Development and Application of User-Defined Dictionary Objects for Text Mining Analysis in Special Education Technology Research

Mikyung Shin, Ph.D., Assistant Professor1, Gahangir Hossain, Ph.D., Associate Professor2


1 Department of Education, Center for Learning Disabilities, West Texas A&M University
2 Department of Computer Information and Decision Management, West Texas A&M University

Introduction

Over the last four decades, there have been many attempts to synthesize technology in teaching mathematics for students with disabilities. Reflecting innovative developments and expansions on the use of technology in mathematics for students with disabilities, it is necessary to consider what themes define the corpus of research published in the STEM fields for students with disabilities.

Many researchers have analyzed large bibliographical datasets and implemented a machine learning-based text mining approach (Sharma et al., 2019). In the text mining process, a domain-specific dictionary and stop words are significant. However, there is currently no dictionary and stop word objects available in the Open Science to help researchers specify their focus area on special education technology. Thus, in this poster session, researchers aim to share the ongoing text mining analysis project on special education technology research.

Search Strategy

Inclusion Criteria. - Target participants: Students with disabilities in K to 12 grades. - Focus of studies: Teaching mathematics using technology. - Reference type: Journal articles or dissertations published in English between 1980 and 2021. - Document-level variables: Publication year (title and abstract should be reported).

Database Search. Applying the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for article selection procedures (Page et al., 2021), researchers conducted an electronic database search of ERIC (n = 3,548), Web of Science (n = 1,677), Academic Search Complete (n = 1,881), Education Source (n = 1,657), APA PsycInfo (n = 1,515), and MEDLINE (n = 604) for journal articles and dissertations published in English between 1980 and 2021, resulting in a total of 10,882 studies.

Development

Extraction of Bibliographic Data. The research team has extracted a bibliographic citation file that included authors, titles, keywords, abstracts, reference types, and publication years. The extracted bibliographic data is available at the poster online repository and the textminingR R package (Shin, 2022).

Text Pre-Processing. The researchers have pre-processed textual data in three steps: (a) constructing a corpus by selecting a text column in the dataset (abstract) and combining texts with document-level variables (publication year); (b) constructing a token object by segmenting complex text into smaller words; (c) constructing a document-feature matrix.

  • Manually extended a total of 146 acronyms (e.g., “WPS” to “word problem solving” and “VR” to “virtual-representational”).
  • Processed tokenization by changing texts to lowercase.
  • Converted accented characters to the American Standard Code for Information Interchange
  • Split hyphens and tags.
  • Removed punctuation marks, symbols, numbers, Uniform Resource Locators (URLs), and separators.

Customized Dictionary Objects. Researchers have developed two different dictionary objects that could be passed through the quanteda R package (Benoit et al., 2018). To avoid duplicating the exact words, these dictionary objects were sequentially processed. Before constructing dictionary word lists, researchers identified frequently used multi-word expressions with compound words and synonyms that depend on word order within the dataset. To develop the lists of dictionary objects, the frequently co-occurring multi-words were detected through the kwic() function. The two dictionary lists included 224 words for the first and 166 words list for the second object. Following the guidelines by Benoit et al. (2018), wildcard expressions were implemented. The two dictionary lists are available at the textminingR.

Customized Stop Word List. Researchers removed commonly observed units (tokens) of words or patterns, stop words, that are not distinct across documents. - Calculated inverse document frequency (idf) of each word. - The idf of a term is a metric that show the degree of distinction of words within documents, commonly defined as the following formula:

df is the number of documents in the corpus containing word i, and N is the total number of documents in the corpus (Hvitfeldt, & Silge, 2021); when idf was close to zero, a term was considered to appear commonly in almost all publications. Researchers manually examined below 10 percent in the ranking of idf (i.e., 457 words) out of the 4,643 words and went as far down in the list until we identified distinctive words. In this process, 192 of 457 words were eventually included, creating the remaining 265 as stop words list; for example, “learning_disability” was ranked as bottom 19th out of 457 words; however, we decided to include “learning_disability,” considering the importance of this word in the current special education technology context. The researcher-developed stop word list is available at textminingR::stopwords_list.

Application

Word Network Analysis. Researchers frequently identified co-occurring words (at least five times) within each publication concerning the use of technology in mathematics instruction for students with disabilities over the last 42 years (1980 and 2021). Applying the pairwise_counts() function in the widyr R package (Robinson, 2021), researchers counted the number of times each pair of words appear together within a publication. If two words (nodes) co-occur in one publication, the nodes are connected with a line link (edge).

  • Examined the importance of each individual word through a measure of degree centrality, assuming influential and important nodes have higher neighbors (degrees) compared to other nodes with fewer degrees (Newman, 2018).
  • Normalized the values of degree centrality (C) to be between zero and one, with one being the central node where all nodes are connected.

Conclusion

Understanding the domain-specific dictionary objects and stop word lists is essential in text mining. Practitioners and researchers can specify target research interests and contextualize the text mining process in the field. Reflecting the frequently used words related to technology in mathematics instruction for students with disabilities, practitioners can detect frequently used instructional words and word patterns. Future research can further focus on generalizing these dictionary objects and stopping words by validating the contents through external reviews. The currently suggested open science practices can improve the research transparency in special education and across interdisciplinary fields.

Contact. Collaborators. Min Wook Ok, Sam Choo, Gahangir Hossain, and Diane P. Bryant Reference. Shin, M. (2022). textminingR: Text Mining Workflow Tools. R package version 0.0.0.9000. https://github.com/mshin77/textminingR

Poster Online Repository https://github.com/mshin77/2022_WTAMU_Poster Automated Process https://mkshin.shinyapps.io/textminingR

Figure 1: Poster Online Repository https://github.com/mshin77/2022_WTAMU_Poster Automated Process https://mkshin.shinyapps.io/textminingR