Calculate Revised LD Indices
๐ก For more information about the LD indices calculated by TAALED, see the LEARN MORE section at the bottom of this page.
๐ก For index selection guidelines, go directly to the Summary.
[Review] Create a ldvals object
All the commands on this page assume that you have created a ldvals
object. The following is the recap on how to create it using the sample text provided in the package.
from taaled import ld #import a package
from pylats import lats #used for preprocessing in this example, but optional in your analysis
clnsmpl = lats.Normalize(ld.txtsmpl, ld.params) #preprocessing
ldvals = ld.lexdiv(clnsmpl.toks) #calculate LD indices
MTLDโ โ โ
The Measure of Textual Lexical Diversity (MTLD; McCarthy, 200517; McCarthy & Jarvis, 201019) measures the average number of tokens it takes for the TTR value to reach a point of stabilization (which McCarthy & Jarvis, 2010 defined as TTR = .720). Texts that are less lexically diverse will have lower MTLD values than more lexically diverse texts.
MTLD is calculated by determining the number of factors in a text (the number of non-overlapping portions of text that reach TTR = .720) and the length of each factor. In most texts, a partial factor will occur at the end of the text. MTLD runs both forwards and backwards.
TAALED calculates two closely related variants of MTLD. The first variant (MTLD Original) is described in McCarthy and Jarvis (2010)19 and is used in previous versions of TAALED (up to and including version 1.4). The second variant (MTLD) is the version used in studies where Scott Jarvis is the lead author (e.g., Jarvis & Hashimoto, 20218). We recommend using MTLD.
Check MTLDO value
The .mtldo
method reports the original MTLD as it was explained by McCarthy and Jarvis (2010) in their paper. This approach divides the sum of the factor lengths (including both full and partial factors) by the sum of factor scores (a full factor recieves the score of 1, and the partial factor score ranges from 0 to 1), which can gives you a slightly different results from the .mtld
implementation.
print(ldvals.mtldo) # mtld value as it is described in McCarthy & Jarvis (2010) paper
81.38912039538282
Report MTLD value
The .mtld
reports the MTLD value, where the average of the factor lengths is calculated after the final partial factor is estimated by the factor proportion.
print(ldvals.mtld) # mtld value
82.7575
Check MTLDVALS value
The .mtldvals
method returns lengths of each factor used for calculation of the final MTLD score. In the example text, there were eight factors. The first factor was 99 words in length, the second factor was 74 words in length (etc.). Any factors lengths with decimal places represent estimates based on partial factors. It is possible to have two estimated factor lengths (one at the end of forward factor calculation and one at the end of backward factor calculation). It is also possible (but rare) to have only one estimated factor length (or none).
print(ldvals.mtldvals)
[99.0, 74.0, 77.0, 47.32000000000001, 48.0, 96.0, 103.0, 117.73999999999997]
Check MTLDLISTS value
The .mtldlists
method reports a list of tokens, factor lengths, and factor proportions used for each calculation. This information can be used for diagnostic purposes.
print(ldvals.mtldlists)
Plot MTLD (Requires the plotnine package)
The .mtldplot
method returns a density plot for factor lengths. This information will tell you the distribution of factor lengths and the mean factor length (represented by the vertical dashed red line).
print(ldvals.mtldplot)
MATTRโ โ โ
The Moving-Average Type-Token Ratio calculates the moving average for all segments of a given length. For a segment length of 50 tokens (i.e., window length), TTR is calculated on tokens 1-50, 2-51, 3-52, etc., and the resulting TTR measurements are averaged to produce the final MATTR value (Covington & McFall, 20103).
Report MATTR value
To get the MATTR value, use .mattr
method.
print(ldvals.mattr) #moving average TTR value
0.7922466960352423
Check TTR values for each window
The .mattrs
method gives you a list of TTR values for each moving window. This information can be used for diagnostic purposes, or more widely to understand how MATTR is calculated over different windows.
print(ldvals.mattrs)
[0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.78, 0.78, 0.78, 0.78, 0.8, 0.82, 0.84, 0.82, 0.8, 0.8, 0.82, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.78, 0.8, 0.78, 0.76, 0.76, 0.76, 0.78, 0.78, 0.78, 0.78, 0.8, 0.8, 0.78, 0.76, 0.78, 0.78, 0.8, 0.82, 0.8, 0.82, 0.82, 0.82, 0.82, 0.8, 0.82, 0.82, 0.8, 0.78, 0.78, 0.8, 0.8, 0.8, 0.8, 0.8, 0.78, 0.78, 0.8, 0.82, 0.8, 0.8, 0.78, 0.8, 0.8, 0.8, 0.8, 0.8, 0.82, 0.8, 0.82, 0.82, 0.84, 0.84, 0.86, 0.86, 0.84, 0.84, 0.86, 0.84, 0.84, 0.84, 0.84, 0.84, 0.86, 0.84, 0.84, 0.82, 0.8, 0.78, 0.78, 0.8, 0.82, 0.82, 0.82, 0.82, 0.84, 0.84, 0.86, 0.86, 0.84, 0.84, 0.84, 0.84, 0.82, 0.82, 0.8, 0.78, 0.76, 0.76, 0.76, 0.76, 0.78, 0.8, 0.8, 0.78, 0.76, 0.76, 0.76, 0.78, 0.78, 0.78, 0.78, 0.76, 0.76, 0.74, 0.76, 0.76, 0.76, 0.78, 0.78, 0.78, 0.78, 0.78, 0.76, 0.78, 0.78, 0.8, 0.8, 0.8, 0.78, 0.78, 0.76, 0.76, 0.76, 0.78, 0.78, 0.78, 0.76, 0.76, 0.78, 0.76, 0.78, 0.76, 0.76, 0.76, 0.76, 0.76, 0.78, 0.78, 0.78, 0.78, 0.78, 0.78, 0.76, 0.78, 0.8, 0.8, 0.8, 0.8, 0.8, 0.82, 0.8, 0.8, 0.8, 0.78, 0.78, 0.8, 0.8, 0.82, 0.84, 0.82, 0.82, 0.82, 0.84, 0.84, 0.84, 0.82, 0.8, 0.8, 0.8, 0.78, 0.8, 0.78, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.78, 0.78, 0.76, 0.76, 0.78, 0.76, 0.74, 0.74, 0.74, 0.74, 0.72, 0.74, 0.74, 0.74, 0.76, 0.76, 0.74, 0.74, 0.72, 0.72]
Check tokens in each window
print(ldvals.mattrwins[0]) #first window
print(ldvals.mattrwins[1]) #second window
['there_PRON', 'be_VERB', 'a_DET', 'saying_NOUN', 'in_ADP', 'my_PRON', 'language_NOUN', 'that_PRON', 'go_VERB', 'like_ADP', 'if_SCONJ', 'only_ADV', 'the_DET', 'young_ADJ', 'could_AUX', 'know_VERB', 'and_CCONJ', 'the_DET', 'old_ADJ', 'could_AUX', 'do_VERB', 'this_PRON', 'explain_VERB', 'an_DET', 'important_ADJ', 'lesson_NOUN', 'but_CCONJ', 'one_PRON', 'have_VERB', 'to_PART', 'attain_VERB', 'a_DET', 'certain_ADJ', 'degree_NOUN', 'of_ADP', 'wisdom_NOUN', 'to_PART', 'understand_VERB', 'it_PRON', 'in_ADP', 'my_PRON', 'opinion_NOUN', 'be_AUX', 'young_ADJ', 'be_AUX', 'more_ADV', 'enjoyable_ADJ', 'be_AUX', 'old_ADJ', 'may_AUX']
['be_VERB', 'a_DET', 'saying_NOUN', 'in_ADP', 'my_PRON', 'language_NOUN', 'that_PRON', 'go_VERB', 'like_ADP', 'if_SCONJ', 'only_ADV', 'the_DET', 'young_ADJ', 'could_AUX', 'know_VERB', 'and_CCONJ', 'the_DET', 'old_ADJ', 'could_AUX', 'do_VERB', 'this_PRON', 'explain_VERB', 'an_DET', 'important_ADJ', 'lesson_NOUN', 'but_CCONJ', 'one_PRON', 'have_VERB', 'to_PART', 'attain_VERB', 'a_DET', 'certain_ADJ', 'degree_NOUN', 'of_ADP', 'wisdom_NOUN', 'to_PART', 'understand_VERB', 'it_PRON', 'in_ADP', 'my_PRON', 'opinion_NOUN', 'be_AUX', 'young_ADJ', 'be_AUX', 'more_ADV', 'enjoyable_ADJ', 'be_AUX', 'old_ADJ', 'may_AUX', 'make_VERB']
Plot MATTR (Requires the plotnine package)
The .mattrplot
method will return a density plot for TTR values for each moving window. This information will tell you the range of TTR values before taking the average. This information can be used for diagnostic purposes, or more widely to understand how MATTR is calculated over different windows.
print(ldvals.mattrplot)
HD-Dโ โ โ
HD-D (Hypergeometric Distribution Diversity) is a revised index of vocd-D (for more information about vocd-D, see McCarthy & Jarvis, 200718). For each word type in a text, HD-D uses the hypergeometric distribution to calculate the probability of encountering one of its tokens in a random sample of 42 tokens, and these probabilities are then added together to produce the final value for the text. For convenience, TAALED reports these values on the same scale as TTR.
Report hdd value
The .hdd
method returns the value of HD-D.
print(ldvals.hdd)
0.8355613021691302
Maasโ โ โ
Maas' index is a more complex transformation of TTR that attempts to fit the value to a logarithmic curve. It is often referred to simply as Maas (Maas, 197213).
Report maas value
The .maas
method returns the Maas' index.
print(ldvals.maas)
0.04348168494633678
LEARN MORE
Previous studies with L1 corpus
-
Observing that vocd-D(Malvern & Richards, 199714) was rapidly becoming the preferred LD measures in the field, McCarthy and Jarvis (2007)18 conducted an assessment of its sensitivity to text length using a corpus of 23 genres taken from various corpora. As a result, they concluded that vocd-D scores were not as independent of text length. In contrast, they found a small relationship (r=.282) between text length and HD-D for longer texts (up to 2000 words).
-
McCarthy and Jarvis (2010)19 used 16 different written and spoken L1 corpus (comprising texts from a range of registers such as editorials, official documents, academic prose, fiction, etc.). The results suggested that MTLD was stable to the text length (r=.016, p=.530) whereas other revised indices were fairly correlated to the text length (HD-D: r=.282, p<.001; Mass: r=.125, p<.001) and the TTR was extremely correlated (r=.811, p<.001). They also found that MTLD had a low correlation with the simple TTR, which can be a desirable property of a lexical diversity index since the simple TTR strongly correlates with text length. Meanwhile, they showed that Maas was one of the main predictors of their register prediction model (along with MTLD and vocd-D).
Previous studies with L2 spoken corpus
-
Koizumi (2012)10 evaluated MTLD (including simple TTR, Root TTR, and vocd-D) using spoken English samples from 20 Japanese adolescents. Each text was clipped to the first 200 words and then subdivided into 25 segments ranging in length from 50 to 200 tokens by parallel sampling. Results indicated that MTLD values stabilized at roughly 100 tokens, and none of the other indices produced stable values within the text length ranges included in the study.
-
Koizumi and In'nami (2012)11 continued the same study after including more participants (38 Japanese teenagers). They reconfirmed the result of the previous research in that MTLD was the only index that yields stable values, with a stabilization point of around 100 words. They found that HD-D was less affected by text length than other commonly used indices, while still reporting meaningful and significant effects.
A previous study with L2 written corpus
- Zenker and Kyle (2021)22 evaluated MTLD using 4,542 argumentative essays by English learners from Asian regions. The result indicated that MTLD-Original and MTLD-MA-Wrap were robust to even shorter texts such as 50 words. They found a negligible relationship (r=.064) between HD-D and text length.
A previous study on validating LD indices
- Kyle et al. (2021)12 explored the relationship between the indices and human ratings of LD in both L1 and L2 argumentative essays. The major finding was that the abundance (i.e., the total number of word types) was the strongest predictor of lexical diversity ratings followed by volume (i.e., the total number of word tokens) and variety, which was measured by four LD indices of HD-D, MATTR, MTLD, and MTLD-BI-Wrap. Among the indices, HD-D showed the strongest correlation with human judgments of the LD (r=.602, p<.001). Considering the correlation between abundance and volume as a potential problem in measuring large and complex constructs in writing assessment, the researchers concluded that measuring the dimension of variety with relatively text length independent indices is critical in L2 assessment.
Summary
This is a brief guideline for index selection considering the minimum text length based on the results of Zenker and Kyle (2021)22.
Recommendation | Index (Minimum Text Length (tokens)) |
---|---|
Use with confidence (โ โ โ ) | MATTR (50), MTLD-Original(50) |
Use with caution (โ โ โ) | HD-D (50), Maas (100) |
Avoid (โ โโ) | MSTTR, Log TTR, Root TTR, Simple TTR |