Remove Common Words Across Documents — remove_common

This function removes specified common words from a tokens object and applies two dictionaries to categorize the remaining tokens. It returns a document-feature matrix (dfm) based on the processed tokens. If no words are specified for removal, it returns an initial dfm using the provided initialization function.

Usage

remove_common_words(tokens, remove_vars, dfm_object)

Arguments

tokens: A tokens object from the quanteda package, typically processed using functions like tokens_select or tokens_remove.
remove_vars: A character vector of words to remove from the tokens. If NULL, the function returns the result of dfm_init_func().
dfm_object: A dfm object to process after removing the specified words.

Value

A dfm object with the specified words removed and the remaining tokens categorized

Examples

if (interactive()) {
  df <- TextAnalysisR::SpecialEduTech
  united_tbl <- TextAnalysisR::unite_text_cols(df, listed_vars = c("title", "keyword", "abstract"))
  tokens <- TextAnalysisR::preprocess_texts(united_tbl, text_field = "united_texts")
  dfm_object <- quanteda::dfm(tokens)
  TextAnalysisR::remove_common_words(tokens = tokens,
                                     remove_vars = c("level", "testing"),
                                     dfm_object = dfm_object)
}