Preprocesses text data by:
Constructing a corpus
Tokenizing text into words
Converting to lowercase
Removing default English stopwords and optional custom stopwords
Specifying a minimum token length.
Typically used before constructing a dfm and fitting an STM model.
Usage
preprocess_texts(
united_tbl,
text_field = "united_texts",
custom_stopwords = NULL,
min_char = 2,
...
)
Arguments
- united_tbl
A data frame that contains text data.
- text_field
The name of the column in
united_tbl
that contains text data.- custom_stopwords
A character vector of additional stopwords to remove. Default is NULL.
- min_char
Minimum length in characters for tokens (default is 2).
- ...
Further arguments passed to
quanteda::corpus
.
Value
A quanteda
tokens object. Each row in the object represents a document, and each column represents a token.
The object is ready for constructing a dfm and fitting an STM model.
Examples
if (interactive()) {
df <- TextAnalysisR::SpecialEduTech
united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))
tokens <- TextAnalysisR::preprocess_texts(united_tbl, text_field = "united_texts")
tokens
}