Skip to contents

Lexical analysis examines word patterns and frequencies.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Word Frequency

plot_word_frequency(dfm_object, top_n = 20)

TF-IDF Keyword Extraction

Find distinctive words per document using Term Frequency-Inverse Document Frequency:

keywords <- extract_keywords_tfidf(dfm_object, top_n = 10)
plot_tfidf_keywords(keywords, n_docs = 5)

TF-IDF weights terms that are frequent in a document but rare across the corpus, identifying distinctive vocabulary.


Keyness Analysis

Compare word usage between groups:

keyness <- extract_keywords_keyness(
  dfm_object,
  target_group = "Journal Article",
  reference_groups = "Conference Paper",
  category_var = "reference_type"
)
plot_keyness_keywords(keyness)

Keyness analysis identifies statistically significant differences in word usage between groups.


N-gram Analysis

N-grams are sequences of consecutive words that frequently appear together. They capture multi-word expressions like “machine learning” or “New York City” that carry meaning as complete phrases.

Types:

  • Bigrams: 2-word sequences (e.g., “data analysis”)
  • Trigrams: 3-word sequences (e.g., “natural language processing”)
  • 4-grams & 5-grams: Longer phrases (e.g., “statistical significance test results”)

Usage: Set minimum frequency (how often phrases appear) and lambda (collocation strength) to detect meaningful multi-word expressions.

tokens <- detect_multi_words(tokens, min_count = 10)

Learn More: Text Mining with R - N-grams Chapter


Part-of-Speech Tagging

Part-of-speech (POS) tagging identifies the grammatical category of each word. Requires Python with spaCy.

Tags (Universal Dependencies):

  • NOUN, VERB, ADJ, ADV: Content words
  • PROPN: Proper nouns (names)
  • DET, ADP, PRON: Function words
  • NUM, PUNCT: Numbers, punctuation

Usage: Filter by tags to focus on specific word types (e.g., nouns and verbs for content analysis).

Learn More: Universal Dependencies POS Tags


Morphological Analysis

Morphological analysis extracts grammatical features from words. Uses Python spaCy via reticulate.

Features:

Feature Description Values
Number Singular/Plural Sing, Plur
Tense Verb tense Past, Pres, Fut
VerbForm Verb form Fin, Inf, Part, Ger
Person Grammatical person 1, 2, 3
Case Grammatical case Nom, Acc, Dat, Gen
parsed <- extract_pos_tags(texts)  # Uses spacy_parse_full
# Returns columns: doc_id, token, lemma, pos, tag

Usage: Analyze verb tenses for temporal patterns, number agreement, or grammatical complexity.

Learn More: spaCy Morphology


Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities in text. Requires Python with spaCy.

Entity Types:

  • PERSON, ORG: People, organizations
  • GPE, LOC: Places, locations
  • DATE, MONEY, PERCENT: Temporal, monetary values

Usage: Filter by entity type. Add custom entities for qualitative coding.

Learn More: spaCy Named Entity Recognition


Word Networks

Co-occurrence

word_co_occurrence_network(dfm_object, co_occur_n = 10)

Correlation

word_correlation_network(dfm_object, corr_n = 0.3)

Log Odds Ratio Analysis

Log odds ratio compares word frequencies between categories to identify distinctive vocabulary.

Simple Log Odds Ratio:

log_odds <- calculate_log_odds_ratio(
  dfm_object,
  group_var = "category",
  comparison_mode = "binary",
  top_n = 15
)
plot_log_odds_ratio(log_odds)

Weighted Log Odds Ratio:

For publication-quality analysis, use the weighted log odds method which accounts for sampling variability by weighting results with z-scores. This method identifies words that reliably distinguish between groups, not just rare words with extreme ratios.

# Requires tidylo package: install.packages("tidylo")
weighted_odds <- calculate_weighted_log_odds(
  dfm_object,
  group_var = "category",
  top_n = 15
)

Learn More: tidylo: Weighted Log Odds


Lexical Dispersion

Lexical dispersion (X-ray plot) shows where terms appear across documents.

dispersion <- calculate_lexical_dispersion(tokens, terms = c("education", "technology"))
plot_lexical_dispersion(dispersion)

Readability Metrics

Readability metrics quantify text complexity using statistical measures of sentence structure and word characteristics.

Available Metrics:

Metric Formula Basis Output
Flesch Reading Ease Sentence length + syllables 0-100 (higher = easier)
Flesch-Kincaid Sentence length + syllables U.S. grade level
Gunning Fog Sentence length + complex words Years of education
SMOG Polysyllabic word count Years of education
ARI Characters per word U.S. grade level
Coleman-Liau Letters per 100 words U.S. grade level

Usage Notes:

  • Different formulas may produce slightly different grade level estimates
  • These formulas measure surface-level text features (word length, sentence length)
  • Short texts may produce less reliable scores
readability <- calculate_text_readability(united_tbl$united_texts)
plot_readability_distribution(readability)

Learn More: quanteda textstat_readability Documentation


Lexical Diversity Metrics

Lexical diversity measures vocabulary richness by quantifying the relationship between unique words (types) and total words (tokens).

Available Metrics:

Metric Description Note
TTR Types / Tokens Sensitive to text length
CTTR Types / sqrt(2 × Tokens) Partially corrects for length
MSTTR Mean Segmental TTR Divides into segments
MATTR Moving Average TTR More stable across lengths
MTLD Mean length maintaining TTR Text-length independent
Maas Log-based formula Lower = more diverse

Usage Notes:

  • MTLD and MATTR are more stable across different text lengths
  • TTR is sensitive to text length - compare only similar-length texts
  • Maas, Yule K, and Simpson D use inverse scales (lower = more diverse)

Learn More: quanteda textstat_lexdiv Documentation