Tool for the Automatic Analysis of Lexical Diversity (TAALED)

TAALED is a Python package for calculating lexical diversity (LD) indices. The package is designed for the researchers, students, and teachers in (applied) linguistics needing to calculate LD indices that are stable across different text lengths (i.e., revised LD indices) as well as classic LD indices. The package was developed by Kristopher Kyle. Many thanks to Scott Jarvis, who provided valuable insights about the calculation of MTLD and HD-D. This documentation page is contributed by Hakyung Sung and Masaki Eguchi in the LCR-ADS lab at the University of Oregon.

Quick Start Guides

How to Install TAALED

To install TAALED, you can use pip (a package installer for Python):

pip install taaled

While not strictly necessary, this tutorial will presume that you have also installed a few helpful packages for text preprocessing and visualization. These are optional but recommended.

TAALED takes a list of strings as input and returns various indices of LD (and diagnostic information). In the rest of the tutorial, we will use pylats for preprocessing of texts (e.g., tokenization, lemmatization, word disambiguation, checking for misspelled words, etc.). Currently pylats only supports advanced features for English (models for other languages are forthcoming). Pylats was tested using spacy version 3.2 and by default uses the "en_core_web_sm" model. To install spacy and a language model, see the spacy installation instructions.

However, TAALED can work with any language, as long as texts are tokenized (and appropriately preprocessed). See tools such as spacy, stanza, and trankit for NLP pipelines for a wide range of languages.

pip install pylats

TAALED also makes use of plotnine for data visualization. This package is not required for taaled to function properly, but is needed if data visualization (e.g., density plots for mtld factor lengths) are desired.

pip install plotnine

How to Import

We will import the installed packages in Python.

from taaled import ld
from pylats import lats #optional, but recommended for text preprocessing
  

How to Preprocess a text

Because some indices presume that texts are at least 50 words in length (see, e.g., McCarthy & Jarvis, 2010¹⁹; Kyle, Crossley, & Jarvis, 2021¹²; Zenker & Kyle, 2021²², we will use a longer text in this example that is conveniently included in TAALED.

Minimally, a text string must be turned into a flat list of strings to work with TAALED.

Ideally, a number of text preprocessing/normalization steps will be used. In the example below, the pylats package is used to tokenize the text, remove most punctuation, add part of speech tags (for homograph disambiguation), lemmatize each word, check for (and ignore) misspelled words (misspelled words will innapropriately inflate ld values), and convert all words to lower case. pylats is quite flexible/customizable, and the taaled package includes a default parameters object ld.params for use with pylats.

#if pylats is installed, preprocess the sample text using the default taaled parameters file
clnsmpl = lats.Normalize(ld.txtsmpl, ld.params)
print(clnsmpl.toks[:10]) #check sample output
  

['there_PRON', 'be_VERB', 'a_DET', 'saying_NOUN', 'in_ADP', 'my_PRON', 'language_NOUN', 'that_PRON', 'go_VERB', 'like_ADP']

To continue reading the quick-start guide, follow this link and learn about how to calculate LD indices.

How to Cite

Kyle, K., Crossley, S. A., & Jarvis, S. (2021). Assessing the validity of lexical diversity indices using direct judgements. Language Assessment Quarterly, 18(2), 154-170.¹²