Preprocesses text data following the complete workflow implemented in the Shiny application:
Constructing a corpus from united texts
Tokenizing text into words with configurable options
Converting to lowercase with acronym preservation option
Applying character length filtering
Optional multi-word expression detection and compound term creation
Stopword removal and lemmatization capabilities
This function serves as the foundation for all subsequent text analysis workflows.
Usage
prep_texts(
united_tbl,
text_field = "united_texts",
min_char = 2,
lowercase = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE,
split_tags = TRUE,
include_docvars = TRUE,
keep_acronyms = FALSE,
padding = FALSE,
remove_stopwords = FALSE,
stopwords_source = "snowball",
stopwords_language = "en",
custom_stopwords = NULL,
custom_valuetype = "glob",
math_mode = FALSE,
verbose = FALSE,
...
)Arguments
- united_tbl
A data frame that contains text data.
- text_field
The name of the column that contains the text data.
- min_char
The minimum number of characters for a token to be included (default: 2).
- lowercase
Logical; convert all tokens to lowercase (default: TRUE). Recommended for most text analysis tasks.
- remove_punct
Logical; remove punctuation from the text (default: TRUE).
- remove_symbols
Logical; remove symbols from the text (default: TRUE).
- remove_numbers
Logical; remove numbers from the text (default: TRUE).
- remove_url
Logical; remove URLs from the text (default: TRUE).
- remove_separators
Logical; remove separators from the text (default: TRUE).
- split_hyphens
Logical; split hyphenated words into separate tokens (default: TRUE).
Logical; split tags into separate tokens (default: TRUE).
- include_docvars
Logical; include document variables in the tokens object (default: TRUE).
- keep_acronyms
Logical; keep acronyms in the text (default: FALSE).
- padding
Logical; add padding to the tokens object (default: FALSE).
- remove_stopwords
Logical; remove stopwords from the text (default: FALSE).
- stopwords_source
Character; source for stopwords, e.g., "snowball", "stopwords-iso" (default: "snowball").
- stopwords_language
Character; language for stopwords (default: "en").
- custom_stopwords
Character vector; additional words to remove (default: NULL).
- custom_valuetype
Character; valuetype for custom_stopwords pattern matching, one of "glob", "regex", or "fixed" (default: "glob").
- math_mode
Logical; if
TRUE, preserve math content (numbers, operators, symbols) by forcingremove_punct,remove_symbols, andremove_numbersall toFALSE, then strip only sentence-end punctuation such as periods, commas, question marks, exclamation marks, colons, semicolons, parentheses, brackets, braces, quotation marks, em dashes, and en dashes. Themin_chardefault of 2 still applies, so noisy single-character tokens are dropped; passmin_char = 1to keep them. Use for math or STEM corpora where multi-character operators and numerals carry meaning (default: FALSE).- verbose
Logical; print verbose output (default: FALSE).
- ...
Additional arguments passed to
quanteda::tokens.
See also
unite_cols() to combine text columns first; lemmatize_tokens() to reduce words to base form (e.g., running -> run); quanteda::dfm() to build a document-feature matrix from the result
Examples
# \donttest{
mydata <- TextAnalysisR::SpecialEduTech
united_tbl <- TextAnalysisR::unite_cols(
mydata,
listed_vars = c("title", "keyword", "abstract")
)
tokens <- TextAnalysisR::prep_texts(united_tbl,
text_field = "united_texts",
min_char = 2,
lowercase = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = TRUE,
split_tags = TRUE,
include_docvars = TRUE,
keep_acronyms = FALSE,
padding = FALSE,
verbose = FALSE)
print(tokens)
#> Tokens consisting of 490 documents and 6 docvars.
#> text1 :
#> [1] "dyscalculia" "and" "the" "minicalculator"
#> [5] "the" "alp" "program" "arithmetic"
#> [9] "arithmetic" "remedial" "teaching" "education"
#> [ ... and 109 more ]
#>
#> text2 :
#> [1] "the" "effects" "of" "computer"
#> [5] "assisted" "instruction" "for" "mastery"
#> [9] "of" "multiplication" "facts" "on"
#> [ ... and 72 more ]
#>
#> text3 :
#> [1] "computer" "assisted" "instruction" "with" "learning"
#> [6] "disabled" "students" "computer" "assisted" "instruction"
#> [11] "computer" "programs"
#> [ ... and 53 more ]
#>
#> text4 :
#> [1] "arc" "ed" "curriculum" "applicability"
#> [5] "for" "severely" "handicapped" "pupils"
#> [9] "computer" "assisted" "instruction" "games"
#> [ ... and 48 more ]
#>
#> text5 :
#> [1] "arc" "ed" "curriculum" "the" "application"
#> [6] "of" "video" "game" "formats" "to"
#> [11] "educational" "software"
#> [ ... and 128 more ]
#>
#> text6 :
#> [1] "the" "effect" "of" "the" "hand"
#> [6] "held" "calculator" "on" "mathematics" "speed"
#> [11] "accuracy" "and"
#> [ ... and 345 more ]
#>
#> [ reached max_ndoc ... 484 more documents ]
# }
