Skip to contents

Topic modeling discovers hidden themes in text collections.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Find Optimal Topics

find_optimal_k(dfm_object, topic_range = 5:30)

STM (Structural Topic Model)

STM discovers latent topics using probabilistic modeling while incorporating document metadata as covariates. It models how topics vary across documents based on metadata like time, author, or category.

Best For:

  • Metadata analysis: Relate topics to document characteristics
  • Covariate effects: Test how metadata affects topics
  • Interpretability: Clear word-probability distributions

Quality Metrics:

  • Semantic Coherence: Measures word co-occurrence within topics
  • Exclusivity: Measures how unique words are to each topic
  • Held-out Likelihood: Predictive performance on unseen documents
out <- quanteda::convert(dfm_object, to = "stm")

model <- stm::stm(
  documents = out$documents,
  vocab = out$vocab,
  K = 15,
  prevalence = ~ reference_type + s(year),
  data = out$meta
)

terms <- get_topic_terms(model, top_term_n = 10)

Learn More: Structural Topic Model | STM Vignette


Embedding-based Topics

Uses transformer embeddings with dimensionality reduction and clustering. Best for short texts, multilingual content, and semantic similarity.

Backend Options:

Backend When to Use
"python" Full BERTopic features (default)
"r" No BERTopic installed
"auto" Auto-detect available backend
# Python backend (BERTopic)
results <- fit_embedding_model(
  texts = united_tbl$united_texts,
  method = "umap_hdbscan",
  n_topics = 15
)

# R backend (no BERTopic needed)
results <- fit_embedding_model(
  texts = united_tbl$united_texts,
  method = "umap_dbscan",
  backend = "r",
  n_topics = 10
)

R Backend Methods: Format is {dimred}_{clustering} (e.g., "umap_dbscan", "tsne_kmeans", "pca_hierarchical").

Learn More: BERTopic | Sentence-BERT


Hybrid Topic Modeling

Combines the strengths of both STM and embedding-based approaches. Uses transformer embeddings for semantic understanding while maintaining STM’s ability to model covariate relationships and provide probabilistic topic assignments.

Best For:

  • Best of both worlds: Need semantic coherence AND covariate modeling
  • Complex research: Testing hypotheses about how metadata affects semantically-defined topics
  • Validation: Compare and validate findings across different methodological approaches

Quality Metrics:

  • Semantic Coherence: How often top words co-occur in documents (higher is better)
  • Exclusivity: How unique words are to each topic (higher is better)
  • Silhouette Score: Cluster separation for embedding topics (-1 to 1, higher is better)
  • Alignment Score: Agreement between STM and embedding topic assignments
  • Adjusted Rand Index: Clustering agreement corrected for chance
results <- fit_hybrid_model(
  texts = united_tbl$united_texts,
  metadata = united_tbl[, c("reference_type", "year")],
  n_topics_stm = 15
)

Learn More: Structural Topic Model | BERTopic


AI Topic Labels

AI generates label suggestions based on topic terms. You review and edit:

  1. Generate: AI creates draft labels from top terms
  2. Review: Examine suggestions in the output table
  3. Edit: Modify any labels that need refinement
  4. Override: Use manual labels field to replace AI suggestions
# AI suggests, human decides
labels <- generate_topic_labels(
  terms,
  provider = "ollama"  # or "openai"
)
# Review and edit labels before final use

Topic-Grounded Content Generation

Generate draft content grounded in your topic model results rather than AI parametric knowledge. The LLM receives topic labels and term probabilities (beta scores), ensuring outputs are anchored to your data.

Workflow

  1. Run topic modeling and optionally generate/edit topic labels
  2. Select content type: survey items, research questions, theme descriptions, policy recommendations, or interview questions
  3. Generate drafts: AI creates content using your top terms ordered by beta scores
  4. Review and edit: Examine all outputs before use
  5. Export: Download as CSV or Excel

Content Types

Type Output Use Case
Survey Item Likert-scale statement Scale development
Research Question RQ for literature review Systematic reviews
Theme Description Qualitative theme summary Thematic analysis
Policy Recommendation Action-oriented statement Policy analysis
Interview Question Open-ended question Qualitative research

Example

# Get topic terms with beta scores
top_terms <- get_topic_terms(model, top_term_n = 10)

# Optional: Add topic labels
topic_labels <- c("1" = "Digital Learning Tools", "2" = "Family Engagement")

# Generate content grounded in topic terms
content <- generate_topic_content(
  topic_terms_df = top_terms,
  content_type = "survey_item",
  topic_labels = topic_labels,  # Optional
  provider = "ollama",
  model = "llama3"
)

# Review before use
print(content)

Prompt Format

The LLM receives structured prompts with your topic data:

Topic: Digital Learning Tools

Top Terms (highest to lowest beta score):
virtual (.035)
manipulatives (.022)
mathematical (.014)
solving (.013)
learning (.012)

This “topic-grounded” approach ensures content reflects your actual topic model results, not generic AI knowledge. Prompts include guidelines for person-first language and content-specific best practices.

Customizing Prompts:

# Get default prompts
system_prompt <- get_content_type_prompt("survey_item")
user_template <- get_content_type_user_template("survey_item")

# Use custom prompts
content <- generate_topic_content(
  topic_terms_df = top_terms,
  content_type = "custom",
  system_prompt = "You are a survey methodology expert...",
  user_prompt_template = "Create an item for: {topic_label}\nTerms: {terms}"
)

Methods Comparison
Feature STM Embedding Hybrid
Speed Fast Medium Slow
Metadata Support Yes No Yes
Short Texts Poor Good Good
Multilingual No Yes Yes

When to Use:

  • STM: Metadata analysis, covariate effects
  • Embedding: Short texts, semantic similarity (use backend = "r" if no BERTopic)
  • Hybrid: Combine both approaches

Next Steps