Lexical Analysis

Lexical analysis examines word patterns and frequencies.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Word Frequency

plot_word_frequency(dfm_object, top_n = 20)

TF-IDF Keyword Extraction

Find distinctive words per document using Term Frequency-Inverse Document Frequency:

keywords <- extract_keywords_tfidf(dfm_object, top_n = 10)
plot_tfidf_keywords(keywords, n_docs = 5)

TF-IDF weights terms that are frequent in a document but rare across the corpus, identifying distinctive vocabulary.

Keyness Analysis

Compare word usage between groups:

keyness <- extract_keywords_keyness(
  dfm_object,
  target_group = "Journal Article",
  reference_groups = "Conference Paper",
  category_var = "reference_type"
)
plot_keyness_keywords(keyness)

Keyness analysis identifies statistically significant differences in word usage between groups.

N-gram Analysis

N-grams are sequences of consecutive words that frequently appear together. They capture multi-word expressions like “machine learning” or “New York City” that carry meaning as complete phrases.

Types:

Bigrams: 2-word sequences (e.g., “data analysis”)
Trigrams: 3-word sequences (e.g., “natural language processing”)
4-grams & 5-grams: Longer phrases (e.g., “statistical significance test results”)

Usage: Set minimum frequency (how often phrases appear) and lambda (collocation strength) to detect meaningful multi-word expressions.

tokens <- detect_multi_words(tokens, min_count = 10)

Learn More: Text Mining with R - N-grams Chapter

Part-of-Speech Tagging

Part-of-speech (POS) tagging identifies the grammatical category of each word. Requires Python with spaCy.

Tags (Universal Dependencies):

NOUN, VERB, ADJ, ADV: Content words
PROPN: Proper nouns (names)
DET, ADP, PRON: Function words
NUM, PUNCT: Numbers, punctuation

Usage: Filter by tags to focus on specific word types (e.g., nouns and verbs for content analysis).

Learn More: Universal Dependencies POS Tags

Morphological Analysis

Morphological analysis extracts grammatical features from words. Uses Python spaCy via reticulate.

Features:

Feature	Description	Values
Number	Singular/Plural	Sing, Plur
Tense	Verb tense	Past, Pres, Fut
VerbForm	Verb form	Fin, Inf, Part, Ger
Person	Grammatical person	1, 2, 3
Case	Grammatical case	Nom, Acc, Dat, Gen

parsed <- extract_pos_tags(texts)  # Uses spacy_parse_full
# Returns columns: doc_id, token, lemma, pos, tag

Usage: Analyze verb tenses for temporal patterns, number agreement, or grammatical complexity.

Learn More: spaCy Morphology

Named Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities in text. Requires Python with spaCy.

Entity Types:

PERSON, ORG: People, organizations
GPE, LOC: Places, locations
DATE, MONEY, PERCENT: Temporal, monetary values

Usage: Filter by entity type. Add custom entities for qualitative coding.

Learn More: spaCy Named Entity Recognition

Word Networks

Co-occurrence

word_co_occurrence_network(dfm_object, co_occur_n = 10)

Correlation

word_correlation_network(dfm_object, corr_n = 0.3)

Log Odds Ratio Analysis

Log odds ratio compares word frequencies between categories to identify distinctive vocabulary.

Simple Log Odds Ratio:

log_odds <- calculate_log_odds_ratio(
  dfm_object,
  group_var = "category",
  comparison_mode = "binary",
  top_n = 15
)
plot_log_odds_ratio(log_odds)

Weighted Log Odds Ratio:

For publication-quality analysis, use the weighted log odds method which accounts for sampling variability by weighting results with z-scores. This method identifies words that reliably distinguish between groups, not just rare words with extreme ratios.

# Requires tidylo package: install.packages("tidylo")
weighted_odds <- calculate_weighted_log_odds(
  dfm_object,
  group_var = "category",
  top_n = 15
)

Learn More: tidylo: Weighted Log Odds

Lexical Dispersion

Lexical dispersion (X-ray plot) shows where terms appear across documents.

dispersion <- calculate_lexical_dispersion(tokens, terms = c("education", "technology"))
plot_lexical_dispersion(dispersion)

Readability Metrics

Readability metrics quantify text complexity using statistical measures of sentence structure and word characteristics.

Available Metrics:

Metric	Formula Basis	Output
Flesch Reading Ease	Sentence length + syllables	0-100 (higher = easier)
Flesch-Kincaid	Sentence length + syllables	U.S. grade level
Gunning Fog	Sentence length + complex words	Years of education
SMOG	Polysyllabic word count	Years of education
ARI	Characters per word	U.S. grade level
Coleman-Liau	Letters per 100 words	U.S. grade level

Usage Notes:

Different formulas may produce slightly different grade level estimates
These formulas measure surface-level text features (word length, sentence length)
Short texts may produce less reliable scores

readability <- calculate_text_readability(united_tbl$united_texts)
plot_readability_distribution(readability)

Learn More: quanteda textstat_readability Documentation

Lexical Diversity Metrics

Lexical diversity measures vocabulary richness by quantifying the relationship between unique words (types) and total words (tokens).