Skip to contents

This function performs a search for the optimal number of topics (K) using stm::searchK and visualizes diagnostics, including held-out likelihood, residuals, semantic coherence, and lower bound metrics.

Usage

evaluate_optimal_topic_number(
  dfm_object,
  topic_range,
  max.em.its = 75,
  categorical_var = NULL,
  continuous_var = NULL,
  height = 600,
  width = 800,
  verbose = TRUE,
  ...
)

Arguments

dfm_object

A quanteda document-feature matrix (dfm).

topic_range

A numeric vector specifying the range of topics (K) to search over.

max.em.its

Maximum number of EM iterations (default: 75).

categorical_var

An optional character string for a categorical variable in the metadata.

continuous_var

An optional character string for a continuous variable in the metadata.

height

The height of the resulting Plotly plot in pixels (default: 600).

width

The width of the resulting Plotly plot in pixels (default: 800).

verbose

Logical; if TRUE, prints progress information.

...

Further arguments passed to stm::searchK.

Value

A plotly object showing the diagnostics for the number of topics (K).

Examples

if (interactive()) {
  df <- TextAnalysisR::SpecialEduTech
  united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))
  tokens <- TextAnalysisR::preprocess_texts(united_tbl, text_field = "united_texts")
  dfm_object <- quanteda::dfm(tokens)
  TextAnalysisR::evaluate_optimal_topic_number(
    dfm_object = dfm_object,
    topic_range = 5:30,
    max.em.its = 75,
    categorical_var = "reference_type",
    continuous_var = "year",
    height = 600,
    width = 800,
    verbose = TRUE
  )
}