Topic modeling discovers hidden themes in text collections.
Setup
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)Find Optimal Topics
find_optimal_k(dfm_object, topic_range = 5:30)STM (Statistical)
Works with metadata like year or document type:
out <- quanteda::convert(dfm_object, to = "stm")
model <- stm::stm(
documents = out$documents,
vocab = out$vocab,
K = 15,
prevalence = ~ reference_type + s(year),
data = out$meta
)
# View topics
terms <- get_topic_terms(model, top_term_n = 10)Embedding-based (Semantic)
Captures meaning using neural networks:
results <- fit_semantic_model(
texts = united_tbl$united_texts,
n_topics = 15
)Hybrid
Combines statistical and semantic approaches:
results <- fit_hybrid_model(
texts = united_tbl$united_texts,
metadata = united_tbl[, c("reference_type", "year")],
n_topics_stm = 15
)AI Topic Labels
labels <- generate_topic_labels(
terms,
model = "gpt-4o-mini",
openai_api_key = Sys.getenv("OPENAI_API_KEY")
)