Ngram frequency
Definition
Ngram frequency refers to how often sequences of n words (e.g., bigrams, trigrams) occur in a corpus. High-frequency ngrams reflect conventional multiword patterns (e.g., a lot of), while low-frequency ngrams may indicate less typical or more sophisticated usage (e.g., is about being).
Methodology
Mean, log-transformed, and proportional frequency scores are computed by comparing target texts with reference corpora using bigram and trigram frequencies.
Corpus used
- BNC
- COCA
- TOEFL11
Register
- Academic, Fiction, Magazine, News, Spoken, Written
Calculated indices
- Replace
[ ]with register (e.g., academic, fiction) and[10k–100k]with threshold size (e.g., 10k, 20k, …, 100k)
BNC
British National Corpus (BNC) is a 100-million-word collection of samples from a wide range of written and spoken British English from the late 20th century (BNC, 2007).
Raw frequency
- Indices:
- BNC_[ ]_Bigram_Freq_Normed
- BNC_[ ]_Trigram_Freq_Normed
Logarithmic frequency
- Indices:
- BNC_[ ]_Bigram_Freq_Normed_Log
- BNC_[ ]_Trigram_Freq_Normed_Log
Proportional frequency
- Indices:
- BNC_[ ]_Bigram_Proportion
- BNC_[ ]_Trigram_Proportion
COCA
The Corpus of Contemporary American English (COCA) includes more than one billion words from spoken, fiction, magazine, newspaper, and academic texts, offering frequency data for a variety of registers (Davies, 2009).
Raw frequency
- Indices:
- COCA_[ ]_Bigram_Frequency
- COCA_[ ]_Trigram_Frequency
Logarithmic frequency
- Indices:
- COCA_[ ]_Bigram_Frequency_Log
- COCA_[ ]_Trigram_Frequency_Log
Proportional frequency
- Indices:
- COCA_[ ]bi_prop[10k–100k]
- COCA_[ ]tri_prop[10k–100k]
TOEFL11
The TOEFL11 Corpus is a learner corpus containing essays written by English language learners categorized by proficiency levels and L1 background (Blanchard et al., 2013). The TOEFL11 L2 corpus comprises 12,000 essays, each written by a different learner taking the TOEFL iBT test. These essays were produced under timed conditions (30 minutes) for an independent writing task that mirrors first-year college writing. The corpus features responses from learners with eleven different native language backgrounds, written on eight different essay prompts. Expert raters assigned each essay a holistic score on a scale from 1.0 to 5.0, resulting in 1,201 essays rated as low proficiency, 5,964 as medium, and 3,835 as advanced. In total, the corpus contains 3,509,001 words (Monterio et al., 2020, p. 286).
Raw frequency
- Indices:
- NNS_Raw_Bigram_Freq_[High/Med/Low/WC]_AW
- NNS_Raw_Trigram_Freq_[High/Med/Low/WC]_AW
- WC stands for "whole corpus".
Logarithmic frequency
- Indices:
- NNS_Raw_Bigram_Freq_[High/Med/Low/WC]_AW_log
- NNS_Raw_Trigram_Freq_[High/Med/Low/WC]_AW_log
References
- Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2013). TOEFL11: A corpus of non‐native English. ETS Research Report Series, 2013(2), i-15. https://doi.org/10.1002/j.2333-8504.2013.tb02331.x
- BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.
- Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International journal of corpus linguistics, 14(2), 159-190. https://doi.org/10.1075/ijcl.14.2.02dav
- Monteiro, K. R., Crossley, S. A., & Kyle, K. (2020). In search of new benchmarks: Using L2 lexical frequency and contextual diversity indices to assess second language writing. Applied Linguistics, 41(2), 280-300. https://doi.org/10.1093/applin/amy056