Quick Start Guides
How to Install TAALED
To install TAALED, you can use pip (a package installer for Python)
:
pip install taaled
How to Install Related Packages
While not strictly necessary, this tutorial will presume that you have also installed following pacakges for text preprocessing and visualization. These are optional but recommended.
TAALED takes a list of strings as input and returns various indices of LD (and diagnostic information). In the rest of the tutorial, we will use pylats for preprocessing of texts (e.g., tokenization, lemmatization, word disambiguation, checking for misspelled words). Currently pylats only supports advanced features for English (models for other languages are forthcoming). Pylats was tested using spacy version 3.2 and be default uses the "en_core_web_sm" model. To install spacy and a language model, see the spacy installation instructions.
However, TAALED can work with any language, as long as texts are tokenized (and appropriately preprocessed). See tools such as spacy, stanza, and trankit for NLP pipelines for a wide range of languages.
pip install pylats
TAALED also makes use of plotnine for data visualization. This package is not required for taaled to function properly, but is needed if data visualization (e.g., density plots for mtld factor lengths) are desired.
pip install plotnine
How to Import
We will import the installed packages in Python.
from taaled import ld
from pylats import lats #optional, but recommended for text preprocessing
How to Preprocess a text
Because some indices presume that texts are at least 50 words in length (see, e.g., McCarthy & Jarvis, 201018; Kyle, Crossley, & Jarvis, 202112; Zenker & Kyle, 202122, we will use a longer text in this example that conveniently is included in TAALED.
Minimally, a text string must be turned into a flat list of strings to work with TAALED.
Ideally, a number of text preprocessing/normalization steps will be used. In the example below, the pylats package is used to tokenize the text, remove most punctuation, add part of speech tags (for homograph disambiguation), lemmatize each word, check for (and ignore) misspelled words (misspelled words will innapropriately inflate ld values), and convert all words to lower case. pylats is quite flexible/customizable, and the taaled package includes a default parameters object ld.params
for use with pylats.
#if pylats is installed, preprocess the sample text using the default taaled parameters file
clnsmpl = lats.Normalize(ld.txtsmpl, ld.params)
print(clnsmpl.toks[:10]) #check sample output
['there_PRON', 'be_VERB', 'a_DET', 'saying_NOUN', 'in_ADP', 'my_PRON', 'language_NOUN', 'that_PRON', 'go_VERB', 'like_ADP']
Create a ldvals object
To calculate LD indices for a single text, all you need is to run the following command over a tokenized list.
ldvals = ld.lexdiv(clnsmpl.toks)
The ldvals
object contains the result of the LD analysis, which can be accessed through respective methods. These are explained below.
Calculate basic numbers
Once you created an ldvals
object, you will be able to see the result of the analysis by calling a method.
Here, we present how to inspect basic textual information such as the number of tokens, types, and document frequency for each word (i.e., how many times a particular word occurs in the input text).
For how to output modern LD indices (e.g., MTLD, MATTR), see Calculate Revised LD indices. For classic LD indices (not recommended except for comparison), see Calculate Classic (but Flawed) LD Indices.
Number of Tokens
The .ntokens
method will return the number of tokens in the input text.
print(ldvals.ntokens)
276
Number of Types
The .ntypes
method will return the number of types (i.e., unique words) in the input text.
print(ldvals.ntypes)
154
Word Frequency
The .freqd
method will return the document frequencies (i.e., how many times each word occurs in the text) of each word in the input text.
print(ldvals.freqd)
{'there_PRON': 1, 'be_VERB': 1, 'a_DET': 5, 'saying_NOUN': 1, 'in_ADP': 4, 'my_PRON': 2, 'language_NOUN': 1, 'that_PRON': 1, 'go_VERB': 1, 'like_ADP': 1, 'if_SCONJ': 2, 'only_ADV': 2, 'the_DET': 9, 'young_ADJ': 5, 'could_AUX': 2, 'know_VERB': 1, 'and_CCONJ': 8, 'old_ADJ': 6, 'do_VERB': 1, 'this_PRON': 1, 'explain_VERB': 1, 'an_DET': 2, 'important_ADJ': 1, 'lesson_NOUN': 1, 'but_CCONJ': 2, 'one_PRON': 1, 'have_VERB': 2, 'to_PART': 9, 'attain_VERB': 1, 'certain_ADJ': 2, 'degree_NOUN': 1, 'of_ADP': 4, 'wisdom_NOUN': 1, 'understand_VERB': 1, 'it_PRON': 8, 'opinion_NOUN': 1, 'be_AUX': 17, 'more_ADV': 4, 'enjoyable_ADJ': 1, 'may_AUX': 1, 'make_VERB': 2, 'somebody_PRON': 1, 'experienced_ADJ': 1, 'would_AUX': 4, 'not_PART': 6, 'his_PRON': 1, 'life_NOUN': 1, 'less_ADV': 1, 'boring_ADJ': 1, 'while_SCONJ': 1, 'you_PRON': 7, 'your_PRON': 2, 'mind_NOUN': 2, 'fresh_ADJ': 1, 'open_ADJ': 1, 'for_ADP': 1, 'idea_NOUN': 2, 'future_NOUN': 1, 'full_ADJ': 1, 'excitement_NOUN': 1, 'preferable_ADJ': 1, 'than_ADP': 2, 'regret_NOUN': 1, 'which_PRON': 1, 'inevitably_ADV': 1, 'come_VERB': 1, 'with_ADP': 4, 'age_NOUN': 2, 'even_ADV': 3, 'feeling_NOUN': 1, 'can_AUX': 3, 'beat_VERB': 1, 'feel_VERB': 2, 'when_SCONJ': 3, 'upon_SCONJ': 1, 'new_ADJ': 2, 'discovery_NOUN': 1, 'just_ADV': 1, 'remember_VERB': 1, 'great_ADJ': 1, 'triumph_NOUN': 1, 'same_ADJ': 2, 'after_ADP': 2, 'ten_NUM': 1, 'year_NOUN': 1, 'pass_VERB': 1, 'nobody_PRON': 1, 'deny_VERB': 1, 'that_SCONJ': 1, 'energetic_ADJ': 1, 'body_NOUN': 2, 'well_ADJ': 1, 'frail_ADJ': 1, 'dependant_ADJ': 1, 'on_ADP': 3, 'medicine_NOUN': 1, 'or_CCONJ': 1, 'someone_PRON': 1, 'try_VERB': 1, 'parachute_VERB': 1, 'jump_VERB': 1, 'past_ADJ': 1, '60_NUM': 1, 'doctor_NOUN': 1, 'allow_VERB': 1, 'eating_NOUN': 1, 'lot_NOUN': 1, 'constraint_NOUN': 1, 'always_ADV': 1, 'easy_ADJ': 1, 'meet_VERB': 1, 'get_VERB': 4, 'well_ADV': 1, 'people_NOUN': 4, 'as_ADP': 1, 'person_NOUN': 1, 'accept_VERB': 1, 'trust_VERB': 1, 'each_DET': 1, 'other_ADJ': 1, 'easily_ADV': 1, 'simple_ADJ': 1, 'find_VERB': 1, 'common_ADJ': 1, 'ground_NOUN': 1, 'tend_VERB': 1, 'rigid_ADJ': 1, 'their_PRON': 1, 'view_NOUN': 1, 'trait_NOUN': 1, 'begin_VERB': 1, 'turn_VERB': 1, 'to_ADP': 1, 'stone_NOUN': 1, 'become_VERB': 2, 'impossible_ADJ': 1, 'change_VERB': 1, 'they_PRON': 5, 'let_VERB': 1, 'us_PRON': 1, 'face_VERB': 1, 'do_AUX': 1, 'wise_ADJ': 1, 'both_PRON': 1, 'hard_ADJ': 2, 'deal_VERB': 1, 'so_ADV': 1, 'some_DET': 1, 'family_NOUN': 1, 'consider_VERB': 1, 'send_VERB': 1, 'away_ADV': 1}
We can also sort the output using the .freqs
method. In the following, we output 10 most frequent items in the input text.
print(ldvals.freqs[:10]) #sorted ten most frequent items
[('be_AUX', 17), ('the_DET', 9), ('to_PART', 9), ('and_CCONJ', 8), ('it_PRON', 8), ('you_PRON', 7), ('old_ADJ', 6), ('not_PART', 6), ('a_DET', 5), ('young_ADJ', 5)]
You can apply the codes your own sample essay.
- The sample essay below is randomly chosen from the The Gachon Learner Corpus.
sample ='''
On the road almost crash makes me very frightening.
We have every Saturday and Sunday off. I decided to go shopping on Saturday.
I bought some daily necessities. And I have eat lunch in a restaurant.
So I have been going home. When I passed the crossroad the green light was not bright.
After a while the green light I ready to cross the road. When I walk in the middle of the road suddenly a car rushed.
I was still. I want to finish it to hit me. On me from the place near the car stopped. The tyres was made brake loud.
The driver rushed off to apologize to me. This thing that made me think I fear.
'''
clnsmpl2 = lats.Normalize(sample, ld.params)
ldvals2 = ld.lexdiv(clnsmpl2.toks)
print(ldvals2.ntokens)
120
print(ldvals2.ntypes)
70
print(ldvals2.freqs[:5])
[('i_PRON', 15), ('the_DET', 11), ('to_PART', 5), ('be_AUX', 4), ('on_ADP', 3)]
Let's start calculating LD indices!
This quick-start guide walked you through how to create ldvals
object and output basic information. In the following pages, we outline how to output lexical diversity indices and how to batch process input text files and save the output in csv.