Smart PDF Extraction with Auto-Detection — extract_pdf_smart • TextAnalysisR

Extracts text and visual content from PDFs using R-native pdftools and vision LLM APIs. Routes directly to multimodal extraction.

Usage

extract_pdf_smart(
  file_path,
  doc_type = "auto",
  vision_provider = "ollama",
  vision_model = NULL,
  api_key = NULL,
  envname = "textanalysisr-env"
)

Arguments

file_path: Character string path to PDF file
doc_type: Character: "auto" (default), "academic", or "general" (kept for compatibility)
vision_provider: Character: "ollama" (default), "openai", or "gemini"
vision_model: Character: Model name for vision analysis
api_key: Character: API key for cloud providers
envname: Character: Kept for backward compatibility, ignored

Value

List with extracted content ready for text analysis

See also

Other pdf: check_vision_models(), detect_pdf_content_type(), detect_pdf_content_type_py(), extract_pdf_multimodal(), extract_tables_from_pdf_py(), extract_text_from_pdf(), extract_text_from_pdf_py(), process_pdf_file(), process_pdf_file_py()

Examples

if (FALSE) { # \dontrun{
result <- extract_pdf_smart("document.pdf")
corpus <- prep_texts(result$combined_text)
} # }