Calculate Classic (but Flawed) LD Indices
A number of lexical diversity indices have been proposed and used in fields such as educational psychology, applied linguistics, and other fields that include text analytics. An important issue with the use of lexical diversity indices (particularly when they are used to index an aspect of writing/speaking proficiency) is that many of these indices are intrinsically related to text length. Text length is a reasonable (if imperfect) general index of writing/speaking proficiency. It is, however, rather imprecise as a number of factors can be related to text length (fluency, development of ideas, etc.). If we want to measure the construct of lexical diversity itself, we need to use indices that are independent of text length.
Indices such as TTR are negatively correlated with text length. This is problematic because we would expect more proficient writers/speakers to both write longer texts (higher fluency, ability to create more elaborated texts, etc.), but we would also expect more proficient writers/speakers to write more diverse texts.
Indices such as Root TTR (also referred to as Guiraud's Index) are positively correlated with text length. This is problematic because it overestimates diversity of lexical items in a text (and conflates lexical diversity with other constructs such as fluency and development).
๐ก For more information about the lexical diversity indices calculated by taaled, see the LEARN MORE section at the bottom of this page.
[Review] Create a ldvals object
All the commands on this page assume that you have created a ldvals
object. The following is the recap on how to creat it using the sample text provided in the package.
from taaled import ld #import a package
from pylats import lats #used for preprocessing in this example, but optional in your analysis
clnsmpl = lats.Normalize(ld.txtsmpl, ld.params) #preprocessing
ldvals = ld.lexdiv(clnsmpl.toks) #calculate LD indices
Type-token ratio (TTR)โ โโ
TTR is calculated as the number of unique words in a text (types) divided by the number of running words (tokens) (Johnson, 19449). To get the TTR value, use .ttr
method.
print(ldvals.ttr)
0.5507246376811594
Root TTRโ โโ
Root TTR is calculated as the number of types divided by the square root of the number of tokens (Guiraud, 19604, also called Guiraud's index). To get the RTTR value, use .rttr
method.
print(ldvals.rttr)
9.14932483451846
Log TTRโ โโ
Log TTR is calculated by dividing the logarithm of the number of word types by the logarithm of the number of word tokens (Chotlos, 19442; Herdan, 19605, also known as Herdan's C). To get the Log TTR value, use .lttr
method.
print(ldvals.lttr)
0.8938651603109881
MSTTRโ โโ
The Mean-Segmental Type-Token Ratio (MSTTR) is the average TTR for successive segments of text containing a standard number of word tokens (Johnson, 1939, 19449). According to Jarvis & Hashimoto (2021)8, it was the only LD measure that had compelling evidence of overcoming the text-length sensitivity problem in the early 2000s.
After Johnson recognized the text-length problem of TTR, he suggested a solution to break the text into a number of evenly-sized segments, calculating the TTR of each segment, and measuring the mean TTR across all segments. Even though this index appears to be independent of the text length, many researchers (e.g., Covington & McFall, 20103; Malvern et al., 200416 ; McCarthy & Jarvis, 201019) have addressed that MSTTR is flawed in that it drops the remaining (and potentially valuable) data (see Jarvis & Hashimoto, 20218, p.165).
To get the MSTTR value, use .msttr
method.
print(ldvals.msttr)
0.796
LEARN MORE
-
TTR is perhaps the most well known LD index. Johnson (1939, 19449), in the field of psychology, discussed the type-token ratio (TTR) to come up with reliable measures of language behavior. He also listed varying statistical or mathematical procedures of TTRs, which include the over-all TTR, the mean segmental TTR, and the cumulative TTR curve, the decremental TTR curve, and type frequencies, and etc. Here, the mean segmental TTR refers to MSTTR. He thought that the MSTTR makes the samples of different magnitudes can be compared by dividing each sample into like-sized segments (e.g., 100 words each), computing the TTR for each segment and then averaging the segmental TTR's for each example. As long as the samples represent the segments of equal size, they are comparable. However, according to Hess, Sefton, and Jandry (1986)7, all the four measures have been criticized due to its sensitivity to the text length.
-
Root TTR and Log TTR were two early attempts to correct for TTR's sensitivity to text length using simple mathematical transformations. Chotlos (1944)2 presented a log TTR graph (i.e., the relationship of the log number of tokens and the log number of types) of the L1 children's writing samples. Herdan (1960)5 also explored the log TTR drawing upon 61 works by Chaucer and two parts of the New Testament and stated that the index remained "sensibly constant for samples of different size from as a style characteristic" (p.26, according to Hess et al., 19867).
-
TTR values are intrinsically skewed by the length of a text, wherein longer texts tend to have lower TTR scores because the proportion of repeated words increases as the text grows longer (e.g., Koizumi & In'nami, 201211; Tweedie & Baayen, 199820).
-
Hess, Sefton, and Landry (1986)7 analyzed oral language samples from 83 preschool children using five LD measures that were popular at the time: simple TTR, corrected TTR (Carroll, 19641), Root TTR, Log TTR, and Characteristic K (Yule, 194421). They concluded that all these LD measures were unsuitable for comparing texts of different lengths.
-
Hess, Haug, and Landry (1989)6 used oral language samples from 52 elementary school children to analyze four versions of TTR: simple TTR, corrected TTR, Root TTR, and Log TTR. As was the case in the earlier study by Hess et al. (1986)7, results suggested that none of the TTR measures were stable across texts of different lengths.
-
In Malvern & Richards (2002)15, MSTTR computes TTR values for equal-sized segments (sometimes 30 words, but more typically 100 words) out of the original text and averages the values for each non-overlapping segments. The remaining words are discarded. However, a number of weaknesses have been identified in this approach.
Previous studies with L1 corpus
-
Tweedie and Baayen (1998)20, see also Baayen & Tweedie, 1998) chose the twelve LD indices, including a few classic indices (e,g., root TTR, Maas) to test the text length effects. They evaluated their constancy across a range of text lengths using randomization techniques โ shuffling word orders randomly before measuring the values). The text lengths were decided by the English literature corpus, based on the fact that some of the literature consisted of relatively short words and some of them were not. They concluded that only a few measures were independent of the text length, but most were not.
-
McCarthy and Jarvis (2007)18 used a parallel sampling technique to investigate the relationship between text length and LD indices in an L1 corpus of speaking and writing. They found that most indices (including TTR and root TTR) were strongly correlated with the text length.
A previous study with L2 spoken corpus
- Koizuimi (2012)10 evaluated TTR and Root TTR and the result indicated that both indices were not stable to the text length effects.
A previous study with L2 written corpus
- Zenker and Kyle (2021)22 found out that well-known classic indices (i.e., simple TTR, root TTR, log TTR, MSTTR) were not stable across texts from 50 to 200 words in length.