Preprocessing cleans and prepares text for analysis.
Workflow
library(TextAnalysisR)
# 1. Load data
mydata <- SpecialEduTech
# 2. Combine text columns
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
# 3. Tokenize and clean
tokens <- prep_texts(
united_tbl,
text_field = "united_texts",
remove_punct = TRUE,
remove_numbers = TRUE
)
# 4. Remove stopwords
tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))
# 5. Create document-feature matrix
dfm_object <- quanteda::dfm(tokens_clean)Options
| Parameter | Default | Use Case |
|---|---|---|
remove_punct |
TRUE | FALSE for sentiment analysis |
remove_numbers |
TRUE | FALSE for quantitative text |
lowercase |
TRUE | FALSE to preserve case |
Multi-word Expressions
Detect phrases like “machine learning”:
tokens <- detect_multi_words(tokens, min_count = 10)