Preprocessing

Preprocessing cleans and prepares text for analysis.

Workflow

library(TextAnalysisR)

# 1. Load data
mydata <- SpecialEduTech

# 2. Combine text columns
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))

# 3. Tokenize and clean
tokens <- prep_texts(
  united_tbl,
  text_field = "united_texts",
  remove_punct = TRUE,
  remove_numbers = TRUE
)

# 4. Remove stopwords
tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))

# 5. Create document-feature matrix
dfm_object <- quanteda::dfm(tokens_clean)

Unite Text Columns

Unite combines multiple text columns into a single column for analysis. Useful when text content is spread across multiple fields that should be analyzed together.

Examples:

Survey Data: Combine multiple open-ended response columns
Multi-field Text: Merge title, abstract, and body fields
Comments: Concatenate multiple comment or note columns

Usage: Select one or multiple text columns to combine. Columns are concatenated with spaces between them. The united column becomes the text source for all subsequent preprocessing and analysis steps.

Learn More: tidyr Unite Function

Tokenization Options

Tokenization segments continuous text into individual units (tokens), typically words, converting unstructured text into structured format for computational analysis.

Options:

Lowercase: Convert all text to lowercase to treat “Text” and “text” as identical
Remove Punctuation: Strip punctuation marks like periods, commas, quotes
Remove Numbers: Eliminate numeric digits (keep for technical texts)
Remove Symbols: Remove special characters (@, #, $, etc.)
Remove URLs: Identify and remove web addresses

Parameter	Default	Use Case
`remove_punct`	TRUE	FALSE for sentiment analysis
`remove_numbers`	TRUE	FALSE for quantitative text
`lowercase`	TRUE	FALSE to preserve case

Usage: Select preprocessing options based on your analysis goals. Sentence segmentation splits text into sentences before tokenization when sentence structure is important (e.g., sentiment analysis).

Learn More: quanteda Tokens Documentation

Stopword Removal

Stopwords are common words (e.g., “the”, “is”, “and”) that appear frequently but carry little meaningful content for analysis. Removing them reduces noise and improves focus on content-bearing words.

When to Remove:

Topic Modeling: Helps identify content themes by removing function words
Keyword Extraction: Ensures meaningful terms rise to the top
Content Analysis: Focuses on substantive vocabulary

Usage: Use predefined stopword lists (e.g., Snowball) or add custom words. For sentiment analysis or syntactic studies, consider keeping stopwords as they may carry important meaning.

tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))

Learn More: stopwords Package Documentation

Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma). For example, “running”, “ran”, and “runs” all become “run”. This groups related word forms together for more meaningful analysis.

Comparison:

Lemmatization: Uses linguistic knowledge to produce valid dictionary words (studies → study)
Stemming: Uses simple rules to chop word endings (studies → studi)
Advantage: Lemmatization produces readable, meaningful base forms

Usage: Apply lemmatization after tokenization to consolidate word variants. Particularly useful for topic modeling and keyword extraction where grouping related forms improves interpretability. Requires Python with spaCy.

Learn More: spaCy Lemmatization Guide

Document-Feature Matrix (DFM)

A Document-Feature Matrix (DFM) is a mathematical representation where rows are documents, columns are unique tokens (features), and cells contain frequency counts. It converts unstructured text into structured numerical format for computational analysis.

Process:

Tokenization: Text is split into individual tokens (words)
Vocabulary: All unique tokens form the matrix columns
Counting: Each document-token pair is counted
Sparse Matrix: Efficient storage format for large corpora

Usage: The DFM is the foundation for all downstream analyses including keyword extraction, topic modeling, and semantic analysis. Create it after preprocessing (tokenization, stopword removal, lemmatization).

dfm_object <- quanteda::dfm(tokens_clean)

Learn More: quanteda DFM Documentation

Multi-word Expressions

Detect phrases like “machine learning”:

tokens <- detect_multi_words(tokens, min_count = 10)

Workflow

Multi-word Expressions

Next Steps