Fit Hybrid Topic Model

Fits a hybrid topic model combining STM with embedding-based methods. This function integrates structural topic modeling (STM) with semantic embeddings for enhanced topic discovery. The STM component provides statistical rigor and covariate modeling capabilities, while the embedding component adds semantic coherence.

Effect Estimation: Covariate effects on topic prevalence can be estimated using the STM component via stm::estimateEffect(). The embedding component provides semantically meaningful topic representations but does not support direct covariate modeling.

Usage

fit_hybrid_model(
  texts,
  metadata = NULL,
  n_topics_stm = 10,
  embedding_model = "all-MiniLM-L6-v2",
  stm_prevalence = NULL,
  stm_init_type = "Spectral",
  compute_quality = TRUE,
  stm_weight = 0.5,
  verbose = TRUE,
  seed = 123
)

Arguments

texts: A character vector of texts to analyze.
metadata: Optional data frame with document metadata for STM covariate modeling.
n_topics_stm: Number of topics for STM (default: 10).
embedding_model: Embedding model name (default: "all-MiniLM-L6-v2").
stm_prevalence: Formula for STM prevalence covariates (e.g., ~ category + s(year, df=3)).
stm_init_type: STM initialization type (default: "Spectral").
compute_quality: Logical, if TRUE, computes quality metrics (default: TRUE).
stm_weight: Weight for STM in keyword combination, 0-1 (default: 0.5).
verbose: Logical, if TRUE, prints progress messages.
seed: Random seed for reproducibility.

Value

A list containing:

stm_result: The STM model output (use this for effect estimation)
embedding_result: The embedding-based topic model output
alignment: Comprehensive alignment metrics including cosine similarity, assignment agreement, correlation, and Adjusted Rand Index
quality_metrics: Quality metrics including coherence, exclusivity, silhouette scores, and combined quality score
combined_topics: Integrated topic representations with weighted keywords
stm_data: STM-formatted data (needed for effect estimation)
metadata: Metadata used in modeling

Note

For covariate effect estimation, use stm::estimateEffect() on the stm_result$model component with stm_data$meta as the metadata.

Examples

if (FALSE) { # \dontrun{
  texts <- c("Computer-assisted instruction improves math skills for students with disabilities",
             "Assistive technology supports reading comprehension for learning disabled students",
             "Mobile devices enhance communication for students with autism spectrum disorder")

  hybrid_model <- fit_hybrid_model(
    texts = texts,
    n_topics_stm = 3,
    compute_quality = TRUE,
    verbose = TRUE
  )

  # View alignment metrics
  hybrid_model$alignment$overall_alignment
  hybrid_model$alignment$adjusted_rand_index

  # View quality metrics
  hybrid_model$quality_metrics$stm_coherence_mean
  hybrid_model$quality_metrics$combined_quality

  # View combined keywords with source attribution
  hybrid_model$combined_topics[[1]]$combined_keywords
} # }

Usage

Arguments

Value

Note

See also

Examples