Topic Modeling • TextAnalysisR

Topic modeling discovers hidden themes in text collections.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Find Optimal Topics

find_optimal_k(dfm_object, topic_range = 5:30)

STM (Structural Topic Model)

STM discovers latent topics using probabilistic modeling while incorporating document metadata as covariates. It models how topics vary across documents based on metadata like time, author, or category.

Best For:

Metadata analysis: Relate topics to document characteristics
Covariate effects: Test how metadata affects topics
Interpretability: Clear word-probability distributions

Quality Metrics:

Semantic Coherence: Measures word co-occurrence within topics
Exclusivity: Measures how unique words are to each topic
Held-out Likelihood: Predictive performance on unseen documents

out <- quanteda::convert(dfm_object, to = "stm")

model <- stm::stm(
  documents = out$documents,
  vocab = out$vocab,
  K = 15,
  prevalence = ~ reference_type + s(year),
  data = out$meta
)

terms <- get_topic_terms(model, top_term_n = 10)

Learn More: Structural Topic Model | STM Vignette

Embedding-based Topics

Uses transformer embeddings with dimensionality reduction and clustering. Best for short texts, multilingual content, and semantic similarity.

AI Provider:

Provider	Model	Notes
Ollama (default)	nomic-embed-text, mxbai-embed-large, all-minilm	Free, local, private
Sentence Transformers	all-MiniLM-L6-v2, all-mpnet-base-v2	Requires Python
OpenAI	text-embedding-3-small, text-embedding-3-large	API key required
Gemini	gemini-embedding-001	API key required

Embeddings generated in Document Similarity are cached and reused automatically if the same provider/model is selected.

Backend Options:

Backend	When to Use
`"python"`	Full BERTopic features (default)
`"r"`	No BERTopic installed
`"auto"`	Auto-detect available backend

# With precomputed embeddings (Ollama, OpenAI, Gemini)
embeddings <- get_best_embeddings(
  texts = united_tbl$united_texts,
  provider = "ollama",
  model = "nomic-embed-text"
)

results <- fit_embedding_model(
  texts = united_tbl$united_texts,
  method = "umap_hdbscan",
  precomputed_embeddings = embeddings
)

# R backend (no BERTopic needed)
results <- fit_embedding_model(
  texts = united_tbl$united_texts,
  method = "umap_dbscan",
  backend = "r",
  n_topics = 10
)

R Backend Methods: Format is {dimred}_{clustering} (e.g., "umap_dbscan", "tsne_kmeans", "pca_hierarchical").

Learn More: BERTopic | Sentence-BERT

Hybrid Topic Modeling

Combines the strengths of both STM and embedding-based approaches. Uses transformer embeddings for semantic understanding while maintaining STM’s ability to model covariate relationships and provide probabilistic topic assignments.

Best For:

Best of both worlds: Need semantic coherence AND covariate modeling
Complex research: Testing hypotheses about how metadata affects semantically-defined topics
Validation: Compare and validate findings across different methodological approaches

Quality Metrics:

Semantic Coherence: How often top words co-occur in documents (higher is better)
Exclusivity: How unique words are to each topic (higher is better)
Silhouette Score: Cluster separation for embedding topics (-1 to 1, higher is better)
Alignment Score: Agreement between STM and embedding topic assignments
Adjusted Rand Index: Clustering agreement corrected for chance

results <- fit_hybrid_model(
  texts = united_tbl$united_texts,
  metadata = united_tbl[, c("reference_type", "year")],
  n_topics_stm = 15
)

Learn More: Structural Topic Model | BERTopic

AI Topic Labels

AI generates label suggestions based on topic terms. You review and edit:

Generate: AI creates draft labels from top terms
Review: Examine suggestions in the output table
Edit: Modify any labels that need refinement
Override: Use manual labels field to replace AI suggestions

# AI suggests, human decides
labels <- generate_topic_labels(
  terms,
  provider = "ollama"  # or "openai"
)
# Review and edit labels before final use

Topic-Grounded Content Generation

Generate draft content grounded in your topic model results rather than AI parametric knowledge. The LLM receives topic labels and term probabilities (beta scores), ensuring outputs are anchored to your data.

Workflow

Run topic modeling and optionally generate/edit topic labels
Select content type: survey items, research questions, theme descriptions, policy recommendations, or interview questions
Generate drafts: AI creates content using your top terms ordered by beta scores
Review and edit: Examine all outputs before use
Export: Download as CSV or Excel

Content Types

Type	Output	Use Case
Survey Item	Likert-scale statement	Scale development
Research Question	RQ for literature review	Systematic reviews
Theme Description	Qualitative theme summary	Thematic analysis
Policy Recommendation	Action-oriented statement	Policy analysis
Interview Question	Open-ended question	Qualitative research

Example

# Get topic terms with beta scores
top_terms <- get_topic_terms(model, top_term_n = 10)

# Optional: Add topic labels
topic_labels <- c("1" = "Digital Learning Tools", "2" = "Family Engagement")

# Generate content grounded in topic terms
content <- generate_topic_content(
  topic_terms_df = top_terms,
  content_type = "survey_item",
  topic_labels = topic_labels,  # Optional
  provider = "ollama",
  model = "llama3.2"
)

# Review before use
print(content)

Prompt Format

The LLM receives structured prompts with your topic data:

Topic: Digital Learning Tools

Top Terms (highest to lowest beta score):
virtual (.035)
manipulatives (.022)
mathematical (.014)
solving (.013)
learning (.012)

This “topic-grounded” approach ensures content reflects your actual topic model results, not generic AI knowledge. Prompts include guidelines for person-first language and content-specific best practices.

Customizing Prompts:

# Get default prompts
system_prompt <- get_content_type_prompt("survey_item")
user_template <- get_content_type_user_template("survey_item")

# Use custom prompts
content <- generate_topic_content(
  topic_terms_df = top_terms,
  content_type = "custom",
  system_prompt = "You are a survey methodology expert...",
  user_prompt_template = "Create an item for: {topic_label}\nTerms: {terms}"
)

Methods Comparison

Feature	STM	Embedding	Hybrid
Speed	Fast	Medium	Slow
Metadata Support	Yes	No	Yes
Short Texts	Poor	Good	Good
Multilingual	No	Yes	Yes

When to Use:

STM: Metadata analysis, covariate effects
Embedding: Short texts, semantic similarity (use backend = "r" if no BERTopic)
Hybrid: Combine both approaches

Next Steps

Semantic Analysis
Python Environment (for embedding-based methods)