Automatically searches for optimal hyperparameters for embedding-based topic modeling. Evaluates multiple configurations of UMAP and HDBSCAN parameters and returns the best model based on the specified metric. Embeddings are generated once and reused across all configurations for efficiency.
Usage
auto_tune_embedding_topics(
texts,
embeddings = NULL,
embedding_model = "all-MiniLM-L6-v2",
n_trials = 12,
metric = "silhouette",
seed = 123,
verbose = TRUE
)Arguments
- texts
Character vector of documents to analyze.
- embeddings
Precomputed embeddings matrix (optional). If NULL, embeddings are generated.
- embedding_model
Embedding model name (default: "all-MiniLM-L6-v2").
- n_trials
Maximum number of configurations to try (default: 12).
- metric
Optimization metric: "silhouette", "coherence", or "combined" (default: "silhouette").
- seed
Random seed for reproducibility.
- verbose
Logical, if TRUE, prints progress messages.
Value
A list containing:
best_config: Data frame with the optimal hyperparameter configuration
best_model: The topic model fitted with optimal parameters
all_results: List of all evaluated configurations with metrics
n_trials_completed: Number of configurations successfully evaluated
Details
The function searches over these parameters:
n_neighbors: UMAP neighborhood size (5, 10, 15, 25)
min_cluster_size: HDBSCAN minimum cluster size (3, 5, 10)
cluster_selection_method: "eom" (broader) or "leaf" (finer-grained)
See also
Other topic-modeling:
analyze_semantic_evolution(),
assess_embedding_stability(),
assess_hybrid_stability(),
calculate_assignment_consistency(),
calculate_eval_metrics_internal(),
calculate_keyword_stability(),
calculate_semantic_drift(),
calculate_topic_probability(),
calculate_topic_stability(),
find_optimal_k(),
find_topic_matches(),
fit_embedding_model(),
fit_hybrid_model(),
fit_temporal_model(),
generate_topic_labels(),
get_topic_prevalence(),
get_topic_terms(),
get_topic_texts(),
identify_topic_trends(),
plot_model_comparison(),
plot_quality_metrics(),
run_contrastive_topics_internal(),
run_neural_topics_internal(),
run_temporal_topics_internal(),
validate_semantic_coherence()
Examples
if (FALSE) { # \dontrun{
texts <- c("Machine learning for image recognition",
"Deep learning neural networks",
"Natural language processing models",
"Computer vision applications")
tuning_result <- auto_tune_embedding_topics(
texts = texts,
n_trials = 6,
metric = "silhouette",
verbose = TRUE
)
# View best configuration
tuning_result$best_config
# Use the best model
best_model <- tuning_result$best_model
} # }
