Extract PDF with Multimodal Analysis

Extract both text and visual content from PDFs using R-native pdftools and vision LLM APIs. No Python required.

Usage

extract_pdf_multimodal(
  file_path,
  vision_provider = "ollama",
  vision_model = NULL,
  api_key = NULL,
  describe_images = TRUE,
  envname = "textanalysisr-env"
)

Arguments

file_path

Character string path to PDF file

vision_provider

Character: "ollama" (local, default), "openai", or "gemini"

vision_model

Character: Model name

For Ollama: "llava", "llava:13b", "bakllava"
For OpenAI: "gpt-4.1", "gpt-4.1-mini"
For Gemini: "gemini-2.5-flash", "gemini-2.5-pro"

api_key

Character: API key (required for openai/gemini providers)

describe_images

Logical: Convert page images to text descriptions (default: TRUE)

envname

Character: Kept for backward compatibility, ignored

Value

List with:

success: Logical
combined_text: Character string with all content for text analysis
text_content: List of text chunks
image_descriptions: List of image descriptions
num_images: Integer count of described pages
vision_provider: Character indicating provider used
message: Character status message

Details

Workflow:

Extracts text using pdftools (R-native)
Renders each page as an image
Sends sparse-text pages to vision LLM for description
Merges text + descriptions into a single text corpus

Examples

if (FALSE) { # \dontrun{
result <- extract_pdf_multimodal("research_paper.pdf")
text_for_analysis <- result$combined_text

result <- extract_pdf_multimodal(
  "paper.pdf",
  vision_provider = "gemini",
  api_key = Sys.getenv("GEMINI_API_KEY")
)
} # }

Usage

Arguments

Value

Details

See also

Examples