
Calculate Log Odds Ratio Between Categories
Source:R/lexical_analysis.R
calculate_log_odds_ratio.RdCompares word frequencies between categories using a Laplace-smoothed log
frequency ratio, ranked by z-score. Identifies words distinctively used in
one category. For an informative-prior weighted log-odds, see
calculate_weighted_log_odds().
Usage
calculate_log_odds_ratio(
dfm_object,
group_var,
comparison_mode = c("binary", "one_vs_rest", "pairwise"),
reference_level = NULL,
top_n = 10,
min_count = 5
)Arguments
- dfm_object
A quanteda dfm object
- group_var
Character, name of the grouping variable in docvars
- comparison_mode
Character, one of "binary", "one_vs_rest", or "pairwise"
binary: Compare two categories directly
one_vs_rest: Compare each category against all others combined
pairwise: Compare all pairs of categories
- reference_level
Character, reference category for binary comparison (default: first level)
- top_n
Number of top terms per comparison (default: 10)
- min_count
Minimum word count to include (default: 5)
Value
Data frame with columns:
term: The word/feature
category1: First category in comparison
category2: Second category in comparison
count1: Count in category 1
count2: Count in category 2
odds1: Odds in category 1
odds2: Odds in category 2
odds_ratio: Ratio of odds
log_odds_ratio: Log of odds ratio (positive = more in compared category)
variance: Variance of the log ratio, 1/(count1 + 1) + 1/(count2 + 1)
z_score: Log ratio divided by its standard error
Terms are ranked by absolute z-score.
Examples
# \donttest{
articles <- TextAnalysisR::SpecialEduTech[1:20, ]
corpus <- quanteda::corpus(
articles$abstract,
docvars = data.frame(reference_type = articles$reference_type)
)
dfm_object <- quanteda::dfm(quanteda::tokens(corpus))
log_odds <- calculate_log_odds_ratio(dfm_object, "reference_type")
# }