Download
Kaizen OCR & PDF Kaizen OCR & PDF Help All Help Contact
PDF tool

Extract text from PDF

Three extraction modes: direct (embedded text), Tesseract OCR, or Paddle OCR. Pick based on whether your PDF is born-digital or scanned.

The three modes

ModeWhen to useSpeed
DirectPDF has embedded text (born-digital, exported from Word, etc.)Instant
Tesseract OCRPDF is a scan — no embedded textModerate
Paddle OCRScan with modern content, prioritise speed + table detectionFast

Workflow

  1. Drop PDFs into the file list.
  2. Pick the extraction mode.
  3. Click Extract. Per-page status updates live as pages process.
  4. Export as a single .txt or copy to clipboard.

How to tell if your PDF has embedded text

Open it in any PDF viewer and try to select text. If the cursor acts like a text cursor and you can highlight words, you have embedded text — use Direct mode. If selecting just draws rectangles on an image, it's scanned — use one of the OCR modes.

Mixed PDFs

Some PDFs have embedded text on some pages and scans on others. Kaizen OCR handles this gracefully — Direct mode returns whatever text is there; for missing pages, switch to an OCR mode and re-run. You can also combine results manually.

Output formatting

  • Pages are separated by a page-break marker in the output.
  • Line breaks follow the PDF's visual layout (may include artifacts from multi-column or heavy-formatting source).
  • For cleaner output from complex layouts, use Searchable PDF with the Azure Layout model.