Semantic analysis finds patterns of meaning using embeddings and neural networks.
Setup
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)Document Similarity
similarity <- semantic_similarity_analysis(
texts = united_tbl$united_texts,
method = "cosine"
)Similarity Methods
Semantic analysis measures document similarity using different approaches to capture meaning, from simple vocabulary matching to deep neural representations.
Methods:
| Method | Description | Best For |
|---|---|---|
| Words | Lexical analysis using word frequency vectors (bag-of-words) | Finding documents with shared terminology |
| N-grams | Phrase-based analysis capturing word sequences | Detecting similar phraseology |
| Embeddings | Deep semantic analysis using transformer models | Conceptual similarity, handles synonyms |
Usage: Choose method based on your analysis goals. Words and n-grams are faster and interpretable. Embeddings capture deeper meaning but require more computation. All methods use cosine similarity for comparison.
Learn More: Sentence Transformers Documentation
Sentiment Analysis
Lexicon-based (no Python)
sentiment <- sentiment_lexicon_analysis(dfm_object, lexicon = "afinn")
plot_sentiment_distribution(sentiment$document_sentiment)Document Clustering
Shiny App: Document clustering is now in Topic Modeling → Embedding-based Topics, which combines clustering with automatic keyword extraction.
results <- fit_embedding_model(
texts = united_tbl$united_texts,
method = "umap_dbscan",
backend = "r",
n_topics = 5
)
results$topic_assignments
results$topic_keywordsFor standalone clustering without keywords, use
cluster_embeddings(). See Topic Modeling for details.
AI Cluster Labels
labels <- generate_cluster_labels(
results$topic_keywords,
provider = "ollama"
)Algorithms Reference
Clustering: K-means (spherical), Hierarchical (nested), DBSCAN (density-based), HDBSCAN (auto-detect K)
Dimensionality Reduction: PCA (fast, linear), t-SNE (local structure), UMAP (balanced)
Network Analysis
Visualize word relationships as interactive networks with community detection.
Word Co-occurrence Network
network <- word_co_occurrence_network(
dfm_object,
co_occur_n = 10, # Minimum co-occurrence count
top_node_n = 30, # Top nodes to display
node_label_size = 22, # Font size (12-40)
community_method = "leiden" # Community detection algorithm
)
network$plot # Interactive visNetwork plot
network$table # Node metrics (degree, eigenvector, community)
network$stats # 9 network statisticsWord Correlation Network
corr_network <- word_correlation_network(
dfm_object,
common_term_n = 20, # Minimum term frequency
corr_n = 0.4, # Minimum correlation threshold
community_method = "leiden"
)Category-Specific Analysis
Enable per-category networks in the Shiny app to generate separate networks for each category, displayed in a tabbed interface.
Network Statistics (9 Metrics)
Each network returns comprehensive statistics:
| Metric | Description |
|---|---|
| Nodes | Total unique words in network |
| Edges | Total connections between nodes |
| Density | Proportion of possible edges present (0-1) |
| Diameter | Longest shortest path in network |
| Global Clustering | Overall network clustering tendency |
| Avg Local Clustering | Average of local clustering coefficients |
| Modularity | Quality of community structure (higher = better separation) |
| Assortativity | Tendency of similar nodes to connect |
| Avg Path Length | Average distance between nodes |
Community Detection Methods
Community detection identifies clusters of semantically related nodes.
| Method | Description | Best For |
|---|---|---|
leiden |
Modern algorithm, guarantees well-connected communities | Default, best quality |
louvain |
Fast modularity optimization | Large networks |
label_prop |
Propagates labels through network | Very large networks |
fast_greedy |
Hierarchical agglomerative | Quick exploration |
Learn More: igraph Community Detection
Temporal Analysis
Track themes over time:
temporal <- temporal_semantic_analysis(
texts = united_tbl$united_texts,
timestamps = united_tbl$year
)Embedding Models
| Model | Speed | Quality | Use Case |
|---|---|---|---|
| all-MiniLM-L6-v2 | Fast | Good | General purpose |
| all-mpnet-base-v2 | Slow | Best | Highest quality |
| paraphrase-multilingual | Medium | Good | Multiple languages |
Learn More: Sentence Transformers Models
