Ngram strength of association
Definition
Ngram strength of association (SOA) measures how strongly the words in an ngram are connected based on their co-occurrence patterns.
Methodology
TAALES calculates five statistical association measures for bigrams and trigrams across COCA subcorpora:
- MI: Mutual Information
- MI²: Mutual Information Squared
- T: T-score
- DP: Delta P (predictive association)
- AC: Approximate Collexeme (semantic co-occurrence)
Corpus used
- BNC (BNC, 2007)
- COCA (Davies, 2009)
- TOEFL11 (NNS, Monterio et al., 2020)
Register
- Academic, Fiction, Magazine, News, Spoken
Calculated indices
- Replace
[ ]with register (e.g., academic, fiction): e.g., COCA_academic_bi_MI, COCA_fiction_bi_MI. - For COCA, you can choose the
lemmaoption; however, for indices from BNC and TOEFL11, only therawoption is available.
Mutual information (MI)
- Description: Highlights rare but highly exclusive combinations. Sensitive to low-frequency ngrams.
- Indices:
- BNC/COCA/NNS_[ ]_bi_MI
- BNC/COCA/NNS_[ ]_tri_MI
- BNC/COCA/NNS_[ ]_tri_2_MI
Mutual information squared (MI²)
- Description: Modifies MI to reduce low-frequency bias by squaring the numerator. Favors exclusive, yet not extremely rare, combinations.
- Indices:
- BNC/COCA/NNS_[ ]_bi_MI2
- BNC/COCA/NNS_[ ]_tri_MI2
- BNC/COCA/NNS_[ ]_tri_2_MI2
T score (T)
- Description: Emphasizes more frequent word combinations with weaker associations. Often reflects formulaic or generic phrases.
- Indices:
- BNC/COCA/NNS_[ ]_bi_T
- BNC/COCA/NNS_[ ]_tri_T
- BNC/COCA/NNS_[ ]_tri_2_T
Delta P (DP)
- Description: Captures predictive strength in directional co-occurrence. A high DP score means one word strongly predicts the next.
- Indices:
- BNC/COCA/NNS_[ ]_bi_DP
- BNC/COCA/NNS_[ ]_tri_DP
- BNC/COCA/NNS_[ ]_tri_2_DP
Approximate collexeme (AC)
- Description: Measures how semantically cohesive a word pair is. High AC scores indicate formulaic or strongly related word sequences.
- Indices:
- BNC/COCA/NNS_[ ]_bi_AC
- BNC/COCA/NNS_[ ]_tri_AC
- BNC/COCA/NNS_[ ]_tri_2_AC
Notes
tri_MImeasures treats the first word as item 1 and the following bigram (i.e., the next two words combined) as item 2.tri_2_MIconsiders the first bigram (i.e., the first two words combined) as item 1 and the remaining word as item 2.
References
- BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.
- Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International journal of corpus linguistics, 14(2), 159-190. https://doi.org/10.1075/ijcl.14.2.02dav
- Monteiro, K. R., Crossley, S. A., & Kyle, K. (2020). In search of new benchmarks: Using L2 lexical frequency and contextual diversity indices to assess second language writing. Applied Linguistics, 41(2), 280-300. https://doi.org/10.1093/applin/amy056