Batch Processing and Writing Results to a File

The LD write function allows you to save your analysis into a txt file. Another major (arguably more important) feature of this function is that it also batch processes multiple input text files. Given these characteristics, this function becomes very handy in many use cases. The following sections describe how to use this function.

1. Import necessary packages

from taaled import ld
from pylats import lats

#for creating an output filename
from datetime import datetime 
from datetime import date

#for finding a list of texts
import glob
  

2. Get a list of corpus files

For the analysis, we are going to use the sample essays from the Gachon Learner Corpus 2.1 (Carlstrom & Price, 2014). We will use the text file version of the corpus distributed at the linked page.

#set a path to the folder where your corpus located in
corp_path = "/Users/Unknown/Documents/TAALED/Corpus/sample" 

#get a list of filenames from the folder (need to tweak depending on your filenames)
files = glob.glob(corp_path + "/*.txt")

#(optional) check the number of the files in your folder
len(files) #16111
  

3. Define a file-naming function

The following is a helper function to define the filename for your output. This is not absolutely necessary, but our experience suggest that having timestamps on the output files will help enhance the transparency and replicability of analyses (because we might change a small details which might make the output different at different points in the project.).

def outname_creator(fldname,isprll,other = None):
	day = date.today().strftime("%Y%m%d") #get date
	time = datetime.now().strftime("%H%M%S")
	ldv = "taaledv" + ld.version
	latsv = "pylatsv" + lats.version
	if isprll == True: # we are going to think about only the "nopa" option in this tutorial.
		pa = "pa"
	else:
		pa = "nopa"

	if other == None:
		outn = "_".join([day,time,fldname,pa]) + ".txt"
	else:
		outn = "_".join([day,time,fldname,pa,other]) + ".txt"

	return(outn)
	
	
outname = outname_creator("sample",False)	
  

4. Make an output file

You can run the ld.ldwrite function without parallel analysis (see the LEARN More section at the bottom of this page to learn more about parallel analysis).

#calculate ld for entire corpus (wihtout a parallel analysis)
#an output filename should be "date_time_sample_nopa(no parallel analysis).txt"
ld.ldwrite(files,outname)
  

You can set the prll argument True to implement parallel analysis in your analysis (this is for evaluating the relationship between lexical diversity indices and text length).

#run a parallel analysis for entire corpus
#an output filename should be "date_time_sample_pa(parallel analysis).txt"
ld.ldwrite(files,outname,mx = 200,prll = True) #mx defines the maximum length of each parallel sample
  

5. Examples of the output files

When the corpus files successfully processed by TAALED, we will find an output file on the working directory.

The name of the file is "20220328_002547_sample_nopa.txt"

If you want to do further (statistical) analysis with the calculated values, we can copy the output to other program (e.g., Excel). Alternatively, some statistical programs can directly import this tab delimited file.

While doing the parallel analysis, a text will be skipped if the text length is shorter than the mx number.

The name of the file is "20220328_004640_sample_pa.txt" (parallel).

LEARN MORE: Parallel Analysis

Hess et al.(1986)⁷ first used the parallel sampling method to objectively measure the indices from short text samples of young children of 3, 4, and 5 years of age. First, each sample was clipped to the first 200 tokens. Then the resulting texts were subdivided into four texts of 50 tokens, two texts of 100 tokens, one text of 150 tokens, and one text of 200 tokens. LD scores were then calculated for each text, and values from texts of the same length were averaged. Subsequent analysis using repeated measures ANOVAs showed that all the LD measures in the expreiment were significantly affected by text length.
The parallel analysis has been actively used in the previous studies related to LD indices. McCarthy and Jarvis (2007)¹⁸ selected nine representative samples from each genre and kept the consistent maximum text lengths up to 2000 tokens. The samples were divided into eleven different sections from 100 to 2000 tokens to use the parallel sampling method. Additionally, McCarthy and Jarvis (2010)¹⁹ tested MTLD and HD-D using the same method.
Koizuimi (2012)¹⁰ and Koizumi and In'nami (2012)¹¹ clipped each spoken text to the first 200 tokens, and overall texts were divided into 25 segments.
Zenker & Kyle (2021)²² investigated the minimum text lengths needed to produce stable LD values in L2 written texts by clipping each text to the first 200 tokens and subdividing the essays into texts ranging from 50 to 200 tokens in length and increasing at increments of five tokens (four texts of 50 tokens, three texts of 55 tokens, etc.).

Citation of the learner corpus we used on this page

Carlstrom, B. and Price, N. (2012-2014) The Gachon Learner Corpus. Available online at http://koreanlearnercorpusblog.blogspot.kr/p/corpus.html.