Skip to contents

Semantic analysis finds patterns of meaning using embeddings and neural networks.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Document Similarity

similarity <- semantic_similarity_analysis(
  texts = united_tbl$united_texts,
  method = "cosine"
)

Similarity Methods

Semantic analysis measures document similarity using different approaches to capture meaning, from simple vocabulary matching to deep neural representations.

Methods:

Method Description Best For
Words Lexical analysis using word frequency vectors (bag-of-words) Finding documents with shared terminology
N-grams Phrase-based analysis capturing word sequences Detecting similar phraseology
Embeddings Deep semantic analysis using transformer models Conceptual similarity, handles synonyms

Usage: Choose method based on your analysis goals. Words and n-grams are faster and interpretable. Embeddings capture deeper meaning but require more computation. All methods use cosine similarity for comparison.

Learn More: Sentence Transformers Documentation


Sentiment Analysis

Lexicon-based (no Python)

sentiment <- sentiment_lexicon_analysis(dfm_object, lexicon = "afinn")
plot_sentiment_distribution(sentiment$document_sentiment)

Neural (requires Python)

sentiment <- sentiment_embedding_analysis(united_tbl$united_texts)

Document Clustering

clusters <- semantic_document_clustering(
  texts = united_tbl$united_texts,
  n_clusters = 5
)

AI-Suggested Cluster Labels

The AI generates suggested names for each cluster. You maintain full control:

  • Review generated labels before applying
  • Edit labels directly in the interface
  • Regenerate with different parameters if needed
# AI suggests, human reviews and decides
labels <- generate_cluster_labels(
  clusters$cluster_keywords,
  provider = "ollama"  # or "openai"
)
# Review and edit before final use

Clustering Algorithms

Clustering groups documents with similar semantic content into categories. Documents within a cluster are more similar to each other than to documents in other clusters.

Algorithms:

Algorithm Description Use Case
K-means Creates K spherical clusters Fast, simple, requires specifying K
Hierarchical Builds tree of clusters Exploring nested structures
DBSCAN Density-based, finds outliers Arbitrarily shaped clusters
HDBSCAN Hierarchical density-based Auto-determines cluster count

Usage: Choose discovery mode (Automatic, Manual, Advanced). Select semantic feature space and algorithm. Automatic mode finds optimal cluster count. Use visualizations and quality metrics to evaluate results.

Learn More: scikit-learn Clustering Guide


Dimensionality Reduction

Dimensionality reduction transforms high-dimensional data into 2D or 3D visualizations while preserving the structure and relationships between documents.

Algorithms:

Algorithm Description Trade-offs
PCA Principal Component Analysis, finds linear patterns Fast, interpretable
t-SNE Preserves local structure, reveals clusters Slow, good for visualization
UMAP Balances local and global structure Faster than t-SNE, better topology

Usage: Select a semantic feature space (words, n-grams, or embeddings), then choose a reduction method. Adjust parameters (perplexity, neighbors, dimensions) based on your data size and structure. Use for visual exploration before clustering.

reduced <- reduce_dimensions(embeddings, method = "umap", n_components = 2)
plot_semantic_viz(reduced, plot_type = "dimensionality_reduction")

Learn More: scikit-learn Manifold Learning


Network Analysis

Visualize word relationships as interactive networks with community detection.

Word Co-occurrence Network

network <- word_co_occurrence_network(
  dfm_object,
  co_occur_n = 10,                    # Minimum co-occurrence count
  top_node_n = 30,                    # Top nodes to display
  node_label_size = 22,               # Font size (12-40)
  community_method = "leiden"         # Community detection algorithm
)

network$plot   # Interactive visNetwork plot
network$table  # Node metrics (degree, eigenvector, community)
network$stats  # 9 network statistics

Word Correlation Network

corr_network <- word_correlation_network(
  dfm_object,
  common_term_n = 20,                 # Minimum term frequency
  corr_n = 0.4,                       # Minimum correlation threshold
  community_method = "leiden"
)

Category-Specific Analysis

Enable per-category networks in the Shiny app to generate separate networks for each category, displayed in a tabbed interface.


Network Statistics (9 Metrics)

Each network returns comprehensive statistics:

Metric Description
Nodes Total unique words in network
Edges Total connections between nodes
Density Proportion of possible edges present (0-1)
Diameter Longest shortest path in network
Global Clustering Overall network clustering tendency
Avg Local Clustering Average of local clustering coefficients
Modularity Quality of community structure (higher = better separation)
Assortativity Tendency of similar nodes to connect
Avg Path Length Average distance between nodes

Community Detection Methods

Community detection identifies clusters of semantically related nodes.

Method Description Best For
leiden Modern algorithm, guarantees well-connected communities Default, best quality
louvain Fast modularity optimization Large networks
label_prop Propagates labels through network Very large networks
fast_greedy Hierarchical agglomerative Quick exploration

Learn More: igraph Community Detection


Temporal Analysis

Track themes over time:

temporal <- temporal_semantic_analysis(
  texts = united_tbl$united_texts,
  timestamps = united_tbl$year
)

Embedding Models
Model Speed Quality Use Case
all-MiniLM-L6-v2 Fast Good General purpose
all-mpnet-base-v2 Slow Best Highest quality
paraphrase-multilingual Medium Good Multiple languages

Learn More: Sentence Transformers Models


Next Steps