Skip to contents

Preprocessing cleans and prepares text for analysis.

Workflow

library(TextAnalysisR)

# 1. Load data
mydata <- SpecialEduTech

# 2. Combine text columns
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))

# 3. Tokenize and clean
tokens <- prep_texts(
  united_tbl,
  text_field = "united_texts",
  remove_punct = TRUE,
  remove_numbers = TRUE
)

# 4. Remove stopwords
tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))

# 5. Create document-feature matrix
dfm_object <- quanteda::dfm(tokens_clean)

Unite Text Columns

Unite combines multiple text columns into a single column for analysis. Useful when text content is spread across multiple fields that should be analyzed together.

Examples:

  • Survey Data: Combine multiple open-ended response columns
  • Multi-field Text: Merge title, abstract, and body fields
  • Comments: Concatenate multiple comment or note columns

Usage: Select one or multiple text columns to combine. Columns are concatenated with spaces between them. The united column becomes the text source for all subsequent preprocessing and analysis steps.

Learn More: tidyr Unite Function


Tokenization Options

Tokenization segments continuous text into individual units (tokens), typically words, converting unstructured text into structured format for computational analysis.

Options:

  • Lowercase: Convert all text to lowercase to treat “Text” and “text” as identical
  • Remove Punctuation: Strip punctuation marks like periods, commas, quotes
  • Remove Numbers: Eliminate numeric digits (keep for technical texts)
  • Remove Symbols: Remove special characters (@, #, $, etc.)
  • Remove URLs: Identify and remove web addresses
Parameter Default Use Case
remove_punct TRUE FALSE for sentiment analysis
remove_numbers TRUE FALSE for quantitative text
lowercase TRUE FALSE to preserve case

Usage: Select preprocessing options based on your analysis goals. Sentence segmentation splits text into sentences before tokenization when sentence structure is important (e.g., sentiment analysis).

Learn More: quanteda Tokens Documentation


Stopword Removal

Stopwords are common words (e.g., “the”, “is”, “and”) that appear frequently but carry little meaningful content for analysis. Removing them reduces noise and improves focus on content-bearing words.

When to Remove:

  • Topic Modeling: Helps identify content themes by removing function words
  • Keyword Extraction: Ensures meaningful terms rise to the top
  • Content Analysis: Focuses on substantive vocabulary

Usage: Use predefined stopword lists (e.g., Snowball) or add custom words. For sentiment analysis or syntactic studies, consider keeping stopwords as they may carry important meaning.

tokens_clean <- quanteda::tokens_remove(tokens, quanteda::stopwords("en"))

Learn More: stopwords Package Documentation


Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma). For example, “running”, “ran”, and “runs” all become “run”. This groups related word forms together for more meaningful analysis.

Comparison:

  • Lemmatization: Uses linguistic knowledge to produce valid dictionary words (studies → study)
  • Stemming: Uses simple rules to chop word endings (studies → studi)
  • Advantage: Lemmatization produces readable, meaningful base forms

Usage: Apply lemmatization after tokenization to consolidate word variants. Particularly useful for topic modeling and keyword extraction where grouping related forms improves interpretability. Requires Python with spaCy.

Learn More: spaCy Lemmatization Guide


Document-Feature Matrix (DFM)

A Document-Feature Matrix (DFM) is a mathematical representation where rows are documents, columns are unique tokens (features), and cells contain frequency counts. It converts unstructured text into structured numerical format for computational analysis.

Process:

  • Tokenization: Text is split into individual tokens (words)
  • Vocabulary: All unique tokens form the matrix columns
  • Counting: Each document-token pair is counted
  • Sparse Matrix: Efficient storage format for large corpora

Usage: The DFM is the foundation for all downstream analyses including keyword extraction, topic modeling, and semantic analysis. Create it after preprocessing (tokenization, stopword removal, lemmatization).

dfm_object <- quanteda::dfm(tokens_clean)

Learn More: quanteda DFM Documentation


Multi-word Expressions

Detect phrases like “machine learning”:

tokens <- detect_multi_words(tokens, min_count = 10)