Semantic Analysis • TextAnalysisR

Semantic analysis finds patterns of meaning using embeddings and neural networks.

Setup

library(TextAnalysisR)

mydata <- SpecialEduTech
united_tbl <- unite_cols(mydata, listed_vars = c("title", "keyword", "abstract"))
tokens <- prep_texts(united_tbl, text_field = "united_texts")
dfm_object <- quanteda::dfm(tokens)

Document Similarity

similarity <- semantic_similarity_analysis(
  texts = united_tbl$united_texts,
  method = "cosine"
)

Similarity Methods

Semantic analysis measures document similarity using different approaches to capture meaning, from simple vocabulary matching to deep neural representations.

Methods:

Method	Description	Best For
Words	Lexical analysis using word frequency vectors (bag-of-words)	Finding documents with shared terminology
N-grams	Phrase-based analysis capturing word sequences	Detecting similar phraseology
Embeddings	Deep semantic analysis using transformer models	Conceptual similarity, handles synonyms

Usage: Choose method based on your analysis goals. Words and n-grams are faster and interpretable. Embeddings capture deeper meaning but require more computation. All methods use cosine similarity for comparison.

Learn More: Sentence Transformers Documentation

Sentiment Analysis

Lexicon-based (no Python)

sentiment <- sentiment_lexicon_analysis(dfm_object, lexicon = "afinn")
plot_sentiment_distribution(sentiment$document_sentiment)

Neural (requires Python)

sentiment <- sentiment_embedding_analysis(united_tbl$united_texts)

Document Clustering

Shiny App: Document clustering is now in Topic Modeling → Embedding-based Topics, which combines clustering with automatic keyword extraction.

results <- fit_embedding_model(
  texts = united_tbl$united_texts,
  method = "umap_dbscan",
  backend = "r",
  n_topics = 5
)

results$topic_assignments
results$topic_keywords

For standalone clustering without keywords, use cluster_embeddings(). See Topic Modeling for details.

AI Cluster Labels

labels <- generate_cluster_labels(
  results$topic_keywords,
  provider = "ollama"
)

Algorithms Reference

Clustering: K-means (spherical), Hierarchical (nested), DBSCAN (density-based), HDBSCAN (auto-detect K)

Dimensionality Reduction: PCA (fast, linear), t-SNE (local structure), UMAP (balanced)

Network Analysis

Visualize word relationships as interactive networks with community detection.

Word Co-occurrence Network

network <- word_co_occurrence_network(
  dfm_object,
  co_occur_n = 10,                    # Minimum co-occurrence count
  top_node_n = 30,                    # Top nodes to display
  node_label_size = 22,               # Font size (12-40)
  community_method = "leiden"         # Community detection algorithm
)

network$plot   # Interactive visNetwork plot
network$table  # Node metrics (degree, eigenvector, community)
network$stats  # 9 network statistics

Word Correlation Network

corr_network <- word_correlation_network(
  dfm_object,
  common_term_n = 20,                 # Minimum term frequency
  corr_n = 0.4,                       # Minimum correlation threshold
  community_method = "leiden"
)

Category-Specific Analysis

Enable per-category networks in the Shiny app to generate separate networks for each category, displayed in a tabbed interface.

Network Statistics (9 Metrics)

Each network returns comprehensive statistics:

Metric	Description
Nodes	Total unique words in network
Edges	Total connections between nodes
Density	Proportion of possible edges present (0-1)
Diameter	Longest shortest path in network
Global Clustering	Overall network clustering tendency
Avg Local Clustering	Average of local clustering coefficients
Modularity	Quality of community structure (higher = better separation)
Assortativity	Tendency of similar nodes to connect
Avg Path Length	Average distance between nodes

Community Detection Methods

Community detection identifies clusters of semantically related nodes.

Method	Description	Best For
`leiden`	Modern algorithm, guarantees well-connected communities	Default, best quality
`louvain`	Fast modularity optimization	Large networks
`label_prop`	Propagates labels through network	Very large networks
`fast_greedy`	Hierarchical agglomerative	Quick exploration

Learn More: igraph Community Detection

Temporal Analysis

Track themes over time:

temporal <- temporal_semantic_analysis(
  texts = united_tbl$united_texts,
  timestamps = united_tbl$year
)

Embedding Providers

Document-to-document similarity using vector embeddings. Multiple providers are available:

Provider	Model	Notes
Ollama	nomic-embed-text (default), mxbai-embed-large, all-minilm	Free, local, private
OpenAI	text-embedding-3-small, text-embedding-3-large	API key required
Gemini	gemini-embedding-001	API key required
Sentence Transformers	all-MiniLM-L6-v2 (fast), all-mpnet-base-v2 (highest quality)	Requires Python

Embeddings are cached and shared across Document Similarity, Topic Modeling, and Semantic Search when using the same provider and model.

Learn More: Sentence Transformers | Ollama Embedding Models

Next Steps

Topic Modeling