Quick reference guide organized by workflow stage.
Quick Start Examples
Complete Workflow (5 steps)
library(TextAnalysisR)
# 1. Load data
data(SpecialEduTech)
texts <- SpecialEduTech$abstract
# 2. Preprocess
tokens <- prep_texts(texts, remove_punct = TRUE, remove_numbers = TRUE)
dfm <- quanteda::dfm(tokens)
# 3. Analyze keywords
keywords <- extract_keywords_tfidf(dfm, top_n = 20)
plot_tfidf_keywords(keywords)
# 4. Topic modeling
model <- fit_embedding_model(texts, n_topics = 5)
get_topic_terms(model, n_terms = 10)
# 5. Sentiment analysis
sentiment <- sentiment_lexicon_analysis(texts, lexicon = "bing")
plot_sentiment_distribution(sentiment)Generate Embeddings
# Auto-detect best available provider
embeddings <- get_best_embeddings(texts)
# Reduce dimensions for visualization
reduced <- reduce_dimensions(embeddings, method = "umap", n_components = 2)
plot_semantic_viz(reduced)Network Analysis
# Co-occurrence network
word_co_occurrence_network(dfm, top_node_n = 30, co_occur_n = 5)
# Correlation network
word_correlation_network(dfm, top_node_n = 30, corr_n = 0.3)1. Data Import & Preprocessing
| Function | Purpose |
|---|---|
import_files() |
Import CSV, XLSX, PDF, DOCX, TXT files |
unite_cols() |
Combine multiple text columns into one |
prep_texts() |
Tokenize with full preprocessing options |
detect_multi_words() |
Find collocations (n-grams) |
get_available_dfm() |
Get best available DFM with fallback |
2. Lexical Analysis
| Function | Purpose |
|---|---|
calculate_word_frequency() |
Count word frequencies |
extract_keywords_tfidf() |
TF-IDF keyword extraction |
extract_keywords_keyness() |
Keyness-based keywords |
lexical_diversity_analysis() |
TTR, MATTR, MTLD metrics |
calculate_text_readability() |
Flesch, SMOG, ARI scores |
Visualization Functions
| Function | Purpose |
|---|---|
plot_word_frequency() |
Bar chart of word frequencies |
plot_tfidf_keywords() |
TF-IDF keyword visualization |
plot_keyness_keywords() |
Keyness comparison plot |
plot_ngram_frequency() |
N-gram frequency plot |
plot_readability_distribution() |
Readability score distribution |
plot_lexical_diversity_distribution() |
Diversity metrics plot |
3. Sentiment Analysis
| Function | Purpose |
|---|---|
analyze_sentiment() |
Quick sentiment scoring |
sentiment_lexicon_analysis() |
Dictionary-based (no Python) |
sentiment_embedding_analysis() |
Neural sentiment (Python) |
analyze_sentiment_llm() |
LLM-based with explanations (Ollama/OpenAI/Gemini) |
Visualization Functions
| Function | Purpose |
|---|---|
plot_sentiment_distribution() |
Sentiment score histogram |
plot_sentiment_by_category() |
Sentiment by group |
plot_sentiment_boxplot() |
Box plot comparison |
plot_emotion_radar() |
Emotion radar chart |
4. Semantic Analysis
| Function | Purpose |
|---|---|
get_best_embeddings() |
Auto-detect and use best embedding provider |
generate_embeddings() |
Create document embeddings (local) |
reduce_dimensions() |
PCA, t-SNE, UMAP reduction |
calculate_document_similarity() |
Compute similarity matrix |
semantic_similarity_analysis() |
Full similarity workflow |
semantic_document_clustering() |
Cluster similar documents |
generate_cluster_labels() |
AI-generated cluster names |
Visualization Functions
| Function | Purpose |
|---|---|
plot_semantic_viz() |
2D/3D semantic visualization |
plot_similarity_heatmap() |
Similarity matrix heatmap |
plot_cross_category_heatmap() |
Cross-category similarity comparison |
plot_cluster_terms() |
Cluster term visualization |
5. Network Analysis
| Function | Purpose |
|---|---|
word_co_occurrence_network() |
Word co-occurrence graph |
word_correlation_network() |
Word correlation graph |
Network Parameters
| Parameter | Default | Description |
|---|---|---|
node_label_size |
22 | Font size for node labels (12-40) |
community_method |
“leiden” | Algorithm: “leiden”, “louvain” |
top_node_n |
30 | Number of top nodes to display |
co_occur_n |
10 | Minimum co-occurrence count (co-occurrence only) |
corr_n |
0.4 | Minimum correlation threshold (correlation only) |
Network Statistics (9 Metrics)
| Metric | Description |
|---|---|
| Nodes | Total unique words |
| Edges | Total connections |
| Density | Edge density (0-1) |
| Diameter | Longest shortest path |
| Global Clustering | Network clustering tendency |
| Avg Local Clustering | Average local clustering |
| Modularity | Community structure quality |
| Assortativity | Similar node connection tendency |
| Avg Path Length | Average node distance |
6. Topic Modeling
| Function | Purpose |
|---|---|
find_optimal_k() |
Search for optimal topic count |
fit_semantic_model() |
STM (Structural Topic Model) |
fit_embedding_model() |
Embedding-based topics (BERTopic) |
fit_hybrid_model() |
STM + embeddings hybrid |
get_topic_terms() |
Extract top words per topic |
get_topic_prevalence() |
Calculate topic prevalence |
generate_topic_labels() |
AI-generated topic names |
Visualization Functions
| Function | Purpose |
|---|---|
plot_topic_probability() |
Topic probability distribution |
plot_topic_effects_categorical() |
Topic effects by category |
plot_topic_effects_continuous() |
Topic effects over continuous var |
plot_word_probability() |
Word probability per topic |
plot_quality_metrics() |
Model quality metrics |
7. PDF Processing
| Function | Purpose |
|---|---|
process_pdf_unified() |
Auto-fallback PDF extraction |
extract_text_from_pdf() |
Extract text (R) |
extract_pdf_multimodal() |
Vision AI for images in PDFs |
detect_pdf_content_type() |
Detect PDF content type |
8. AI Integration
TextAnalysisR uses a human-in-the-loop approach where AI provides suggestions that you review, edit, and approve before use. Content generation is topic-grounded: drafts are based on validated topic terms and beta scores, not parametric AI knowledge.
Supports local (Ollama) and web-based (OpenAI, Gemini) providers.
| Function | Purpose |
|---|---|
call_llm_api() |
Unified LLM API (all providers) |
call_ollama() |
Local Ollama API |
call_gemini_chat() |
Gemini API |
generate_topic_labels() |
AI-suggested topic labels |
generate_topic_content() |
Topic-grounded content drafts |
generate_cluster_labels() |
AI-suggested cluster names |
analyze_sentiment_llm() |
LLM-based sentiment analysis |
run_rag_search() |
RAG search over documents |
get_api_embeddings() |
Web-based embeddings (OpenAI, Gemini) |
get_spacy_embeddings() |
Local spaCy word embeddings |
Ollama Utilities
| Function | Purpose |
|---|---|
check_ollama() |
Verify Ollama availability |
list_ollama_models() |
List installed models |
get_recommended_ollama_model() |
Auto-select best model |
9. Linguistic Analysis
| Function | Purpose |
|---|---|
extract_pos_tags() |
Identify word types (nouns, verbs, adjectives) |
extract_named_entities() |
Find people, places, organizations in text |
extract_morphology() |
Analyze verb tenses, plural forms |
Requires Python. Run setup_python_env() first.
10. Python Environment
| Function | Purpose |
|---|---|
setup_python_env() |
Set up Python environment |
check_python_env() |
Check Python configuration |
11. Validation & Quality
Validation Functions
| Function | Purpose |
|---|---|
cross_analysis_validation() |
Cross-validate analysis |
validate_semantic_coherence() |
Check semantic coherence |
calculate_clustering_metrics() |
Clustering quality metrics |
