Skip to contents

Preprocess text data by conducting the following functions: construct a corpus; segment texts in a corpus into tokens; preprocess tokens; convert the features of tokens to lowercase; remove stopwords; specify the minimum length in characters for tokens (at least 2).

Usage

preprocess_texts(data, text_field = "united_texts", ...)

Arguments

data

A data frame that contains text as data.

text_field

A name of column that contains text data in a data frame.

...

Further arguments passed to corpus.

Value

A tokens object output from quanteda::tokens. The result is a list of tokenized and preprocessed text data.

Examples

suppressWarnings({
SpecialEduTech %>% preprocess_texts(text_field = "abstract")
})
#> Tokens consisting of 490 documents and 5 docvars.
#> text1 :
#>  [1] "notes"          "alp"            "minicalculator" "program"       
#>  [5] "elementary"     "mathematics"    "worked"         "well"          
#>  [9] "clinical"       "setting"        "learning"       "disabled"      
#> [ ... and 40 more ]
#> 
#> text2 :
#>  [1] "study"        "investigated" "relationship" "locus"        "control"     
#>  [6] "achievement"  "learning"     "disabled"     "elementary"   "school"      
#> [11] "aged"         "children"    
#> [ ... and 6 more ]
#> 
#> text3 :
#>  [1] "results"       "investigation" "effectiveness" "computer"     
#>  [5] "assisted"      "instruction"   "learning"      "disabled"     
#>  [9] "students"      "elementary"    "school"        "indicate"     
#> [ ... and 11 more ]
#> 
#> text4 :
#>  [1] "arc"        "ed"         "curriculum" "uses"       "video"     
#>  [6] "game"       "formats"    "teach"      "math"       "language"  
#> [11] "arts"       "content"   
#> [ ... and 21 more ]
#> 
#> text5 :
#>  [1] "article"       "explores"      "applicability" "video"        
#>  [5] "arcade"        "game"          "formats"       "educational"  
#>  [9] "microcomputer" "software"      "four"          "variables"    
#> [ ... and 54 more ]
#> 
#> text6 :
#>  [1] "purpose"       "investigation" "determine"     "effect"       
#>  [5] "using"         "hand"          "held"          "calculator"   
#>  [9] "secondary"     "educable"      "mentally"      "retarded"     
#> [ ... and 208 more ]
#> 
#> [ reached max_ndoc ... 484 more documents ]