Semantic analysis finds patterns of meaning using embeddings and neural networks.
Setup
library(TextAnalysisR)
mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)Document Similarity
similarity <- semantic_similarity_analysis(
texts = united_tbl$united_texts,
method = "cosine"
)Similarity Methods
Semantic analysis measures document similarity using different approaches to capture meaning, from simple vocabulary matching to deep neural representations.
Methods:
| Method | Description | Best For |
|---|---|---|
| Words | Lexical analysis using word frequency vectors (bag-of-words) | Finding documents with shared terminology |
| N-grams | Phrase-based analysis capturing word sequences | Detecting similar phraseology |
| Embeddings | Deep semantic analysis using transformer models | Conceptual similarity, handles synonyms |
Usage: Choose method based on your analysis goals. Words and n-grams are faster and interpretable. Embeddings capture deeper meaning but require more computation. All methods use cosine similarity for comparison.
Learn More: Sentence Transformers Documentation
Sentiment Analysis
Lexicon-based (no Python)
sentiment <- sentiment_lexicon_analysis(dfm_object, lexicon = "afinn")
plot_sentiment_distribution(sentiment$document_sentiment)Document Clustering
clusters <- semantic_document_clustering(
texts = united_tbl$united_texts,
n_clusters = 5
)AI-Suggested Cluster Labels
The AI generates suggested names for each cluster. You maintain full control:
- Review generated labels before applying
- Edit labels directly in the interface
- Regenerate with different parameters if needed
# AI suggests, human reviews and decides
labels <- generate_cluster_labels(
clusters$cluster_keywords,
provider = "ollama" # or "openai"
)
# Review and edit before final useClustering Algorithms
Clustering groups documents with similar semantic content into categories. Documents within a cluster are more similar to each other than to documents in other clusters.
Algorithms:
| Algorithm | Description | Use Case |
|---|---|---|
| K-means | Creates K spherical clusters | Fast, simple, requires specifying K |
| Hierarchical | Builds tree of clusters | Exploring nested structures |
| DBSCAN | Density-based, finds outliers | Arbitrarily shaped clusters |
| HDBSCAN | Hierarchical density-based | Auto-determines cluster count |
Usage: Choose discovery mode (Automatic, Manual, Advanced). Select semantic feature space and algorithm. Automatic mode finds optimal cluster count. Use visualizations and quality metrics to evaluate results.
Learn More: scikit-learn Clustering Guide
Dimensionality Reduction
Dimensionality reduction transforms high-dimensional data into 2D or 3D visualizations while preserving the structure and relationships between documents.
Algorithms:
| Algorithm | Description | Trade-offs |
|---|---|---|
| PCA | Principal Component Analysis, finds linear patterns | Fast, interpretable |
| t-SNE | Preserves local structure, reveals clusters | Slow, good for visualization |
| UMAP | Balances local and global structure | Faster than t-SNE, better topology |
Usage: Select a semantic feature space (words, n-grams, or embeddings), then choose a reduction method. Adjust parameters (perplexity, neighbors, dimensions) based on your data size and structure. Use for visual exploration before clustering.
reduced <- reduce_dimensions(embeddings, method = "umap", n_components = 2)
plot_semantic_viz(reduced, plot_type = "dimensionality_reduction")Learn More: scikit-learn Manifold Learning
Network Analysis
Visualize word relationships as interactive networks with community detection.
Word Co-occurrence Network
network <- word_co_occurrence_network(
dfm_object,
co_occur_n = 10, # Minimum co-occurrence count
top_node_n = 30, # Top nodes to display
node_label_size = 22, # Font size (12-40)
community_method = "leiden" # Community detection algorithm
)
network$plot # Interactive visNetwork plot
network$table # Node metrics (degree, eigenvector, community)
network$stats # 9 network statisticsWord Correlation Network
corr_network <- word_correlation_network(
dfm_object,
common_term_n = 20, # Minimum term frequency
corr_n = 0.4, # Minimum correlation threshold
community_method = "leiden"
)Category-Specific Analysis
Enable per-category networks in the Shiny app to generate separate networks for each category, displayed in a tabbed interface.
Network Statistics (9 Metrics)
Each network returns comprehensive statistics:
| Metric | Description |
|---|---|
| Nodes | Total unique words in network |
| Edges | Total connections between nodes |
| Density | Proportion of possible edges present (0-1) |
| Diameter | Longest shortest path in network |
| Global Clustering | Overall network clustering tendency |
| Avg Local Clustering | Average of local clustering coefficients |
| Modularity | Quality of community structure (higher = better separation) |
| Assortativity | Tendency of similar nodes to connect |
| Avg Path Length | Average distance between nodes |
Community Detection Methods
Community detection identifies clusters of semantically related nodes.
| Method | Description | Best For |
|---|---|---|
leiden |
Modern algorithm, guarantees well-connected communities | Default, best quality |
louvain |
Fast modularity optimization | Large networks |
label_prop |
Propagates labels through network | Very large networks |
fast_greedy |
Hierarchical agglomerative | Quick exploration |
Learn More: igraph Community Detection
Temporal Analysis
Track themes over time:
temporal <- temporal_semantic_analysis(
texts = united_tbl$united_texts,
timestamps = united_tbl$year
)Embedding Models
| Model | Speed | Quality | Use Case |
|---|---|---|---|
| all-MiniLM-L6-v2 | Fast | Good | General purpose |
| all-mpnet-base-v2 | Slow | Best | Highest quality |
| paraphrase-multilingual | Medium | Good | Multiple languages |
Learn More: Sentence Transformers Models
