Lexical Diversity Analysis

Calculates multiple lexical diversity metrics for a document-feature matrix (DFM) or tokens object. Supports all quanteda.textstats measures plus MTLD (Measure of Textual Lexical Diversity), which is the most recommended measure according to McCarthy & Jarvis (2010) for being independent of text length.

Usage

lexical_diversity_analysis(x, measures = "all", texts = NULL, cache_key = NULL)

Arguments

x: A quanteda DFM or tokens object. Tokens object is preferred for accurate MTLD calculation since it preserves token order.
measures: Character vector of measures to calculate. Default is "all" which includes: TTR, C, R, CTTR, U, S, K, I, D, Vm, Maas, MATTR, MSTTR, and MTLD. Most recommended: "MTLD" or "MATTR" for length-independent measures.
texts: Optional character vector of original texts. Required for MTLD calculation when using DFM input (since DFM loses token order).
cache_key: Optional cache key (e.g., from digest::digest) for caching expensive calculations. Use the same cache_key to retrieve cached results.

Value

A list containing:

lexical_diversity: Data frame with per-document lexical diversity scores
summary_stats: List of summary statistics (mean, median, sd) for each measure

Details

MTLD (Measure of Textual Lexical Diversity) is calculated using the algorithm from McCarthy & Jarvis (2010). It counts the number of "factors" needed to reduce TTR below 0.72, then divides the number of tokens by the number of factors. This provides a length-independent measure of lexical diversity.

Important notes:

For MTLD accuracy, pass a tokens object (not DFM) as input
If using DFM, provide the 'texts' parameter for MTLD calculation
MATTR and MSTTR window sizes are automatically adjusted for short documents
Results are cached when cache_key is provided for repeated analysis

References

McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42(2), 381-392.

Other lexical: calculate_dispersion_metrics(), calculate_lexical_dispersion(), calculate_log_odds_ratio(), calculate_text_readability(), clear_lexdiv_cache(), detect_multi_words(), extract_keywords_keyness(), extract_keywords_tfidf(), extract_morphology(), extract_named_entities(), extract_noun_chunks(), extract_pos_tags(), extract_subjects_objects(), find_similar_words(), get_sentences(), get_spacy_embeddings(), get_spacy_model_info(), get_word_similarity(), init_spacy_nlp(), lexical_analysis, lexical_frequency_analysis(), parse_morphology_string(), plot_keyness_keywords(), plot_keyword_comparison(), plot_lexical_diversity_distribution(), plot_morphology_feature(), plot_readability_by_group(), plot_readability_distribution(), plot_tfidf_keywords(), plot_top_readability_documents(), render_displacy_dep(), render_displacy_ent(), spacy_extract_entities(), spacy_has_vectors(), spacy_initialized(), spacy_lemmatize(), spacy_parse_full(), summarize_morphology()

Examples

if (FALSE) { # \dontrun{
data(SpecialEduTech)
texts <- SpecialEduTech$abstract[1:10]
corp <- quanteda::corpus(texts)
toks <- quanteda::tokens(corp)
# Preferred: pass tokens object for accurate MTLD
lex_div <- lexical_diversity_analysis(toks, texts = texts)
# With caching for repeated analysis
cache_key <- digest::digest(texts)
lex_div <- lexical_diversity_analysis(toks, texts = texts, cache_key = cache_key)
# Alternative: pass DFM with texts for MTLD accuracy
dfm_obj <- quanteda::dfm(toks)
lex_div <- lexical_diversity_analysis(dfm_obj, texts = texts)
print(lex_div)
} # }