library(TextAnalysisR)
sample_text <- c(
"Figure 1 shows the distribution of student outcomes.",
"Table 2 reports the effect sizes for each intervention."
)
toks <- prep_texts(
data.frame(united_texts = sample_text),
text_field = "united_texts"
)
quanteda::ntoken(toks)## text1 text2
## 7 8
Extract text from PDFs with charts, diagrams, and images using vision AI. R-native pipeline – no Python required.
How It Works
- Extracts text from each page using
pdftools::pdf_text()(R-native) - Renders each page as a PNG image via
pdftools::pdf_render_page() - Identifies sparse-text pages (< 500 characters) that likely contain figures
- Sends only those pages to a vision LLM for description
- Merges extracted text + image descriptions into a single text corpus
Functions
process_pdf_unified() runs the full pipeline with
automatic fallback:
- Multimodal (pdftools + vision LLM) – extracts text and describes visual content
- Text-only (pdftools) – fallback when no vision provider is set
describe_image() describes a single base64-encoded PNG.
Both require a vision-provider API key (OpenAI/Gemini) and network
access; see their reference pages for usage.
