Quick reference guide organized by workflow stage.
1. Data Import & Preprocessing
| Function | Purpose |
|---|---|
import_files() |
Import CSV, XLSX, PDF, DOCX, TXT files |
unite_cols() |
Combine multiple text columns into one |
prep_texts() |
Tokenize with full preprocessing options |
detect_multi_words() |
Find collocations (n-grams) |
get_available_dfm() |
Get best available DFM with fallback |
2. Lexical Analysis
| Function | Purpose |
|---|---|
calculate_word_frequency() |
Count word frequencies |
extract_keywords_tfidf() |
TF-IDF keyword extraction |
extract_keywords_keyness() |
Keyness-based keywords |
lexical_diversity_analysis() |
TTR, MATTR, MTLD metrics |
calculate_text_readability() |
Flesch, SMOG, ARI scores |
Visualization Functions
| Function | Purpose |
|---|---|
plot_word_frequency() |
Bar chart of word frequencies |
plot_tfidf_keywords() |
TF-IDF keyword visualization |
plot_keyness_keywords() |
Keyness comparison plot |
plot_ngram_frequency() |
N-gram frequency plot |
plot_readability_distribution() |
Readability score distribution |
plot_lexical_diversity_distribution() |
Diversity metrics plot |
3. Sentiment Analysis
| Function | Purpose |
|---|---|
analyze_sentiment() |
Quick sentiment scoring |
sentiment_lexicon_analysis() |
Dictionary-based (no Python) |
sentiment_embedding_analysis() |
Neural sentiment (Python) |
Visualization Functions
| Function | Purpose |
|---|---|
plot_sentiment_distribution() |
Sentiment score histogram |
plot_sentiment_by_category() |
Sentiment by group |
plot_sentiment_boxplot() |
Box plot comparison |
plot_emotion_radar() |
Emotion radar chart |
4. Semantic Analysis
| Function | Purpose |
|---|---|
generate_embeddings() |
Create document embeddings |
reduce_dimensions() |
PCA, t-SNE, UMAP reduction |
calculate_document_similarity() |
Compute similarity matrix |
semantic_similarity_analysis() |
Full similarity workflow |
semantic_document_clustering() |
Cluster similar documents |
generate_cluster_labels() |
AI-generated cluster names |
Visualization Functions
| Function | Purpose |
|---|---|
plot_semantic_viz() |
2D/3D semantic visualization |
plot_similarity_heatmap() |
Similarity matrix heatmap |
plot_cross_category_heatmap() |
Cross-category similarity comparison |
plot_cluster_terms() |
Cluster term visualization |
5. Network Analysis
| Function | Purpose |
|---|---|
semantic_cooccurrence_network() |
Word/document co-occurrence graph |
semantic_correlation_network() |
Word/document correlation graph |
Network Parameters
| Parameter | Default | Description |
|---|---|---|
feature_type |
“words” | Feature space: “words”, “ngrams”, “embeddings” |
embedding_sim_threshold |
0.5 | Similarity threshold for embedding networks (0.3-0.9) |
node_label_size |
22 | Font size for node labels (12-40) |
community_method |
“leiden” | Algorithm: “leiden”, “louvain”, “label_prop”, “fast_greedy” |
top_node_n |
30 | Number of top nodes to display |
co_occur_n |
10 | Minimum co-occurrence count (co-occurrence only) |
corr_n |
0.4 | Minimum correlation threshold (correlation only) |
Network Statistics (9 Metrics)
| Metric | Description |
|---|---|
| Nodes | Total unique terms/documents |
| Edges | Total connections |
| Density | Edge density (0-1) |
| Diameter | Longest shortest path |
| Global Clustering | Network clustering tendency |
| Avg Local Clustering | Average local clustering |
| Modularity | Community structure quality |
| Assortativity | Similar node connection tendency |
| Avg Path Length | Average node distance |
6. Topic Modeling
| Function | Purpose |
|---|---|
find_optimal_k() |
Search for optimal topic count |
fit_semantic_model() |
STM (Structural Topic Model) |
fit_embedding_topics() |
Embedding-based topics (BERTopic) |
fit_hybrid_model() |
STM + embeddings hybrid |
get_topic_terms() |
Extract top words per topic |
get_topic_prevalence() |
Calculate topic prevalence |
generate_topic_labels() |
AI-generated topic names |
Visualization Functions
| Function | Purpose |
|---|---|
plot_topic_probability() |
Topic probability distribution |
plot_topic_effects_categorical() |
Topic effects by category |
plot_topic_effects_continuous() |
Topic effects over continuous var |
plot_word_probability() |
Word probability per topic |
plot_quality_metrics() |
Model quality metrics |
7. PDF Processing
| Function | Purpose |
|---|---|
process_pdf_unified() |
Auto-fallback PDF extraction |
extract_text_from_pdf() |
Extract text (R) |
extract_pdf_multimodal() |
Vision AI for images in PDFs |
detect_pdf_content_type() |
Detect PDF content type |
8. AI Integration
| Function | Purpose |
|---|---|
check_ollama() |
Verify Ollama availability |
call_ollama() |
Direct Ollama API call |
call_openai_chat() |
OpenAI API call |
generate_topic_labels_langgraph() |
Multi-agent topic labeling |
generate_survey_items() |
Generate survey items |
9. NLP with spaCy
| Function | Purpose |
|---|---|
extract_pos_tags() |
Extract POS tags using spacyr |
extract_named_entities() |
Extract named entities using spacyr |
Note: Uses the spacyr R package for spaCy
integration.
10. Python Environment
| Function | Purpose |
|---|---|
setup_python_env() |
Set up Python environment |
check_python_env() |
Check Python configuration |
11. Validation & Quality
Validation Functions
| Function | Purpose |
|---|---|
cross_analysis_validation() |
Cross-validate analysis |
validate_semantic_coherence() |
Check semantic coherence |
calculate_clustering_metrics() |
Clustering quality metrics |
