Skip to contents

This function performs a search for the optimal number of topics (K) using stm::searchK and visualizes diagnostics, including held-out likelihood, residuals, semantic coherence, and lower bound metrics.

Usage

evaluate_optimal_topic_number(
  dfm_object,
  topic_range,
  max.em.its = 75,
  categorical_var = NULL,
  continuous_var = NULL,
  height = 600,
  width = 800,
  verbose = TRUE,
  ...
)

Arguments

dfm_object

A quanteda document-feature matrix (dfm).

topic_range

A numeric vector specifying the range of topics (K) to search over.

max.em.its

Maximum number of EM iterations (default: 75).

categorical_var

An optional character string for a categorical variable in the metadata.

continuous_var

An optional character string for a continuous variable in the metadata.

height

The height of the resulting Plotly plot in pixels (default: 600).

width

The width of the resulting Plotly plot in pixels (default: 800).

verbose

Logical; if TRUE, prints progress information.

...

Further arguments passed to stm::searchK.

Value

A plotly object showing the diagnostics for the number of topics (K).

Examples

if (interactive()) {
  df <- TextAnalysisR::SpecialEduTech

  united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))

  tokens <- TextAnalysisR::preprocess_texts(united_tbl, text_field = "united_texts")

  dfm_object <- quanteda::dfm(tokens)

  optimal_topic_range <- TextAnalysisR::evaluate_optimal_topic_number(
                           dfm_object = dfm_object,
                           topic_range = 5:30,
                           max.em.its = 75,
                           categorical_var = "reference_type",
                           continuous_var = "year",
                           height = 600,
                           width = 800,
                           verbose = TRUE)
  print(optimal_topic_range)
}