Tool for the Automatic Analysis of Lexical Diversity (TAALED)
TAALED is a Python package for calculating lexical diversity (LD) indices. The package is designed for the researchers, students, and teachers in (applied) linguistics needing to calculate LD indices that are stable across different text lengths (i.e., revised LD indices) as well as classic LD indices. The package was developed by Kristopher Kyle. Many thanks to Scott Jarvis, who provided valuable insights about the calculation of MTLD and HD-D. This documentation page is contributed by Hakyung Sung and Masaki Eguchi in the LCR-ADS lab at the University of Oregon.
Quick Start Guides
How to Install TAALED
To install TAALED, you can use pip (a package installer for Python)
:
pip install taaled
How to Install Related Packages
While not strictly necessary, this tutorial will presume that you have also installed a few helpful packages for text preprocessing and visualization. These are optional but recommended.
TAALED takes a list of strings as input and returns various indices of LD (and diagnostic information). In the rest of the tutorial, we will use pylats for preprocessing of texts (e.g., tokenization, lemmatization, word disambiguation, checking for misspelled words, etc.). Currently pylats only supports advanced features for English (models for other languages are forthcoming). Pylats was tested using spacy version 3.2 and by default uses the "en_core_web_sm" model. To install spacy and a language model, see the spacy installation instructions.
However, TAALED can work with any language, as long as texts are tokenized (and appropriately preprocessed). See tools such as spacy, stanza, and trankit for NLP pipelines for a wide range of languages.
pip install pylats
TAALED also makes use of plotnine for data visualization. This package is not required for taaled to function properly, but is needed if data visualization (e.g., density plots for mtld factor lengths) are desired.
pip install plotnine
How to Import
We will import the installed packages in Python.
from taaled import ld
from pylats import lats #optional, but recommended for text preprocessing
How to Preprocess a text
Because some indices presume that texts are at least 50 words in length (see, e.g., McCarthy & Jarvis, 201019; Kyle, Crossley, & Jarvis, 202112; Zenker & Kyle, 202122, we will use a longer text in this example that is conveniently included in TAALED.
Minimally, a text string must be turned into a flat list of strings to work with TAALED.
Ideally, a number of text preprocessing/normalization steps will be used. In the example below, the pylats package is used to tokenize the text, remove most punctuation, add part of speech tags (for homograph disambiguation), lemmatize each word, check for (and ignore) misspelled words (misspelled words will innapropriately inflate ld values), and convert all words to lower case. pylats is quite flexible/customizable, and the taaled package includes a default parameters object ld.params
for use with pylats.
#if pylats is installed, preprocess the sample text using the default taaled parameters file
clnsmpl = lats.Normalize(ld.txtsmpl, ld.params)
print(clnsmpl.toks[:10]) #check sample output
['there_PRON', 'be_VERB', 'a_DET', 'saying_NOUN', 'in_ADP', 'my_PRON', 'language_NOUN', 'that_PRON', 'go_VERB', 'like_ADP']
To continue reading the quick-start guide, follow this link and learn about how to calculate LD indices.
How to Cite
Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154-170.12