This function is deprecated. Please use fit_embedding_model() instead.
Usage
fit_embedding_topics(
texts,
method = "umap_hdbscan",
n_topics = 10,
embedding_model = "all-MiniLM-L6-v2",
clustering_method = "kmeans",
similarity_threshold = 0.7,
min_topic_size = 3,
umap_neighbors = 15,
umap_min_dist = 0,
umap_n_components = 5,
representation_method = "c-tfidf",
diversity = 0.5,
reduce_outliers = TRUE,
seed = 123,
verbose = TRUE,
precomputed_embeddings = NULL
)Arguments
- texts
A character vector of texts to analyze.
- method
The topic modeling method: "umap_hdbscan" (uses BERTopic), "embedding_clustering", "hierarchical_semantic".
- n_topics
The number of topics to identify. For UMAP+HDBSCAN, use NULL or "auto" for automatic determination, or specify an integer.
- embedding_model
The embedding model to use (default: "all-MiniLM-L6-v2").
- clustering_method
The clustering method for embedding-based approach: "kmeans", "hierarchical", "dbscan", "hdbscan".
- similarity_threshold
The similarity threshold for topic assignment (default: 0.7).
- min_topic_size
The minimum number of documents per topic (default: 3).
- umap_neighbors
The number of neighbors for UMAP dimensionality reduction (default: 15).
- umap_min_dist
The minimum distance for UMAP (default: 0.0). Use 0.0 for tight, well-separated clusters. Use 0.1+ for visualization purposes. Range: 0.0-0.99.
- umap_n_components
The number of UMAP components (default: 5).
- representation_method
The method for topic representation: "c-tfidf", "tfidf", "mmr", "frequency" (default: "c-tfidf").
- diversity
Topic diversity parameter between 0 and 1 (default: 0.5).
- reduce_outliers
Logical, if TRUE, reduces outliers in HDBSCAN clustering (default: TRUE).
- seed
Random seed for reproducibility (default: 123).
- verbose
Logical, if TRUE, prints progress messages.
- precomputed_embeddings
Optional matrix of pre-computed document embeddings. If provided, skips embedding generation for improved performance. Must have the same number of rows as the length of texts.
