Preprocess Text Data — preprocess_texts • TextAnalysisR

Preprocesses text data by:

Constructing a corpus
Tokenizing text into words
Converting to lowercase
Specifying a minimum token length.

Typically used before constructing a dfm and fitting an STM model.

Usage

preprocess_texts(
  united_tbl,
  text_field = "united_texts",
  min_char = 2,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  split_hyphens = TRUE,
  split_tags = TRUE,
  include_docvars = TRUE,
  keep_acronyms = FALSE,
  padding = FALSE,
  verbose = FALSE,
  ...
)

Arguments

united_tbl: A data frame that contains text data.
text_field: The name of the column that contains the text data.
min_char: The minimum number of characters for a token to be included (default: 2).
remove_punct: Logical; remove punctuation from the text (default: TRUE).
remove_symbols: Logical; remove symbols from the text (default: TRUE).
remove_numbers: Logical; remove numbers from the text (default: TRUE).
remove_url: Logical; remove URLs from the text (default: TRUE).
remove_separators: Logical; remove separators from the text (default: TRUE).
split_hyphens: Logical; split hyphenated words into separate tokens (default: TRUE).
split_tags: Logical; split tags into separate tokens (default: TRUE).
include_docvars: Logical; include document variables in the tokens object (default: TRUE).
keep_acronyms: Logical; keep acronyms in the text (default: FALSE).
padding: Logical; add padding to the tokens object (default: FALSE).
verbose: Logical; print verbose output (default: FALSE).
...: Additional arguments passed to quanteda::tokens.

Value

A tokens object that contains the preprocessed text data.

Examples

if (interactive()) {
df <- TextAnalysisR::SpecialEduTech

united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))

tokens <- TextAnalysisR::preprocess_texts(united_tbl,
                                         text_field = "united_texts",
                                         min_char = 2,
                                         remove_punct = TRUE,
                                         remove_symbols = TRUE,
                                         remove_numbers = TRUE,
                                         remove_url = TRUE,
                                         remove_separators = TRUE,
                                         split_hyphens = TRUE,
                                         split_tags = TRUE,
                                         include_docvars = TRUE,
                                         keep_acronyms = FALSE,
                                         padding = FALSE,
                                         verbose = FALSE)
print(tokens)
}