Topic modeling discovers hidden themes in text collections.
Setup
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)Find Optimal Topics
find_optimal_k(dfm_object, topic_range = 5:30)STM (Structural Topic Model)
STM discovers latent topics using probabilistic modeling while incorporating document metadata as covariates. It models how topics vary across documents based on metadata like time, author, or category.
Best For:
- Metadata analysis: Relate topics to document characteristics
- Covariate effects: Test how metadata affects topics
- Interpretability: Clear word-probability distributions
Quality Metrics:
- Semantic Coherence: Measures word co-occurrence within topics
- Exclusivity: Measures how unique words are to each topic
- Held-out Likelihood: Predictive performance on unseen documents
out <- quanteda::convert(dfm_object, to = "stm")
model <- stm::stm(
documents = out$documents,
vocab = out$vocab,
K = 15,
prevalence = ~ reference_type + s(year),
data = out$meta
)
terms <- get_topic_terms(model, top_term_n = 10)Learn More: Structural Topic Model | STM Vignette
Embedding-based Topics
Uses transformer embeddings with dimensionality reduction and clustering. Best for short texts, multilingual content, and semantic similarity.
Backend Options:
| Backend | When to Use |
|---|---|
"python" |
Full BERTopic features (default) |
"r" |
No BERTopic installed |
"auto" |
Auto-detect available backend |
# Python backend (BERTopic)
results <- fit_embedding_model(
texts = united_tbl$united_texts,
method = "umap_hdbscan",
n_topics = 15
)
# R backend (no BERTopic needed)
results <- fit_embedding_model(
texts = united_tbl$united_texts,
method = "umap_dbscan",
backend = "r",
n_topics = 10
)R Backend Methods: Format is
{dimred}_{clustering} (e.g., "umap_dbscan",
"tsne_kmeans", "pca_hierarchical").
Learn More: BERTopic | Sentence-BERT
Hybrid Topic Modeling
Combines the strengths of both STM and embedding-based approaches. Uses transformer embeddings for semantic understanding while maintaining STM’s ability to model covariate relationships and provide probabilistic topic assignments.
Best For:
- Best of both worlds: Need semantic coherence AND covariate modeling
- Complex research: Testing hypotheses about how metadata affects semantically-defined topics
- Validation: Compare and validate findings across different methodological approaches
Quality Metrics:
- Semantic Coherence: How often top words co-occur in documents (higher is better)
- Exclusivity: How unique words are to each topic (higher is better)
- Silhouette Score: Cluster separation for embedding topics (-1 to 1, higher is better)
- Alignment Score: Agreement between STM and embedding topic assignments
- Adjusted Rand Index: Clustering agreement corrected for chance
results <- fit_hybrid_model(
texts = united_tbl$united_texts,
metadata = united_tbl[, c("reference_type", "year")],
n_topics_stm = 15
)Learn More: Structural Topic Model | BERTopic
AI Topic Labels
AI generates label suggestions based on topic terms. You review and edit:
- Generate: AI creates draft labels from top terms
- Review: Examine suggestions in the output table
- Edit: Modify any labels that need refinement
- Override: Use manual labels field to replace AI suggestions
# AI suggests, human decides
labels <- generate_topic_labels(
terms,
provider = "ollama" # or "openai"
)
# Review and edit labels before final useTopic-Grounded Content Generation
Generate draft content grounded in your topic model results rather than AI parametric knowledge. The LLM receives topic labels and term probabilities (beta scores), ensuring outputs are anchored to your data.
Workflow
- Run topic modeling and optionally generate/edit topic labels
- Select content type: survey items, research questions, theme descriptions, policy recommendations, or interview questions
- Generate drafts: AI creates content using your top terms ordered by beta scores
- Review and edit: Examine all outputs before use
- Export: Download as CSV or Excel
Content Types
| Type | Output | Use Case |
|---|---|---|
| Survey Item | Likert-scale statement | Scale development |
| Research Question | RQ for literature review | Systematic reviews |
| Theme Description | Qualitative theme summary | Thematic analysis |
| Policy Recommendation | Action-oriented statement | Policy analysis |
| Interview Question | Open-ended question | Qualitative research |
Example
# Get topic terms with beta scores
top_terms <- get_topic_terms(model, top_term_n = 10)
# Optional: Add topic labels
topic_labels <- c("1" = "Digital Learning Tools", "2" = "Family Engagement")
# Generate content grounded in topic terms
content <- generate_topic_content(
topic_terms_df = top_terms,
content_type = "survey_item",
topic_labels = topic_labels, # Optional
provider = "ollama",
model = "llama3"
)
# Review before use
print(content)Prompt Format
The LLM receives structured prompts with your topic data:
Topic: Digital Learning Tools
Top Terms (highest to lowest beta score):
virtual (.035)
manipulatives (.022)
mathematical (.014)
solving (.013)
learning (.012)
This “topic-grounded” approach ensures content reflects your actual topic model results, not generic AI knowledge. Prompts include guidelines for person-first language and content-specific best practices.
Customizing Prompts:
# Get default prompts
system_prompt <- get_content_type_prompt("survey_item")
user_template <- get_content_type_user_template("survey_item")
# Use custom prompts
content <- generate_topic_content(
topic_terms_df = top_terms,
content_type = "custom",
system_prompt = "You are a survey methodology expert...",
user_prompt_template = "Create an item for: {topic_label}\nTerms: {terms}"
)Methods Comparison
| Feature | STM | Embedding | Hybrid |
|---|---|---|---|
| Speed | Fast | Medium | Slow |
| Metadata Support | Yes | No | Yes |
| Short Texts | Poor | Good | Good |
| Multilingual | No | Yes | Yes |
When to Use:
- STM: Metadata analysis, covariate effects
-
Embedding: Short texts, semantic similarity (use
backend = "r"if no BERTopic) - Hybrid: Combine both approaches
Next Steps
- Semantic Analysis
- Python Environment (for embedding-based methods)
