Skip to contents

Preprocesses text data by:

  • Constructing a corpus

  • Tokenizing text into words

  • Converting to lowercase

  • Removing default English stopwords and optional custom stopwords

  • Specifying a minimum token length.

Typically used before constructing a dfm and fitting an STM model.

Usage

preprocess_texts(
  united_tbl,
  text_field = "united_texts",
  custom_stopwords = NULL,
  min_char = 2,
  ...
)

Arguments

united_tbl

A data frame that contains text data.

text_field

The name of the column in united_tbl that contains text data.

custom_stopwords

A character vector of additional stopwords to remove. Default is NULL.

min_char

Minimum length in characters for tokens (default is 2).

...

Further arguments passed to quanteda::corpus.

Value

A quanteda tokens object. Each row in the object represents a document, and each column represents a token. The object is ready for constructing a dfm and fitting an STM model.

Examples

if (interactive()) {
  df <- TextAnalysisR::SpecialEduTech
  united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))
  tokens <- TextAnalysisR::preprocess_texts(united_tbl, text_field = "united_texts")
  tokens
}