Preprocess text data by conducting the following functions: construct a corpus; segment texts in a corpus into tokens; preprocess tokens; convert the features of tokens to lowercase; remove stopwords; specify the minimum length in characters for tokens (at least 2).
Arguments
- data
A data frame that contains text as data.
- text_field
A name of column that contains text data in a data frame.
- ...
Further arguments passed to
corpus
.
Value
A tokens object output from quanteda::tokens
.
The result is a list of tokenized and preprocessed text data.
Examples
suppressWarnings({
SpecialEduTech %>% preprocess_texts(text_field = "abstract")
})
#> Tokens consisting of 490 documents and 5 docvars.
#> text1 :
#> [1] "notes" "alp" "minicalculator" "program"
#> [5] "elementary" "mathematics" "worked" "well"
#> [9] "clinical" "setting" "learning" "disabled"
#> [ ... and 40 more ]
#>
#> text2 :
#> [1] "study" "investigated" "relationship" "locus" "control"
#> [6] "achievement" "learning" "disabled" "elementary" "school"
#> [11] "aged" "children"
#> [ ... and 6 more ]
#>
#> text3 :
#> [1] "results" "investigation" "effectiveness" "computer"
#> [5] "assisted" "instruction" "learning" "disabled"
#> [9] "students" "elementary" "school" "indicate"
#> [ ... and 11 more ]
#>
#> text4 :
#> [1] "arc" "ed" "curriculum" "uses" "video"
#> [6] "game" "formats" "teach" "math" "language"
#> [11] "arts" "content"
#> [ ... and 21 more ]
#>
#> text5 :
#> [1] "article" "explores" "applicability" "video"
#> [5] "arcade" "game" "formats" "educational"
#> [9] "microcomputer" "software" "four" "variables"
#> [ ... and 54 more ]
#>
#> text6 :
#> [1] "purpose" "investigation" "determine" "effect"
#> [5] "using" "hand" "held" "calculator"
#> [9] "secondary" "educable" "mentally" "retarded"
#> [ ... and 208 more ]
#>
#> [ reached max_ndoc ... 484 more documents ]