Extract text from PDF — Kaizen OCR & PDF Help

The three modes

Mode	When to use	Speed
Direct	PDF has embedded text (born-digital, exported from Word, etc.)	Instant
Tesseract OCR	PDF is a scan — no embedded text	Moderate
Paddle OCR	Scan with modern content, prioritise speed + table detection	Fast

Workflow

Drop PDFs into the file list.
Pick the extraction mode.
Click Extract. Per-page status updates live as pages process.
Export as a single .txt or copy to clipboard.

How to tell if your PDF has embedded text

Open it in any PDF viewer and try to select text. If the cursor acts like a text cursor and you can highlight words, you have embedded text — use Direct mode. If selecting just draws rectangles on an image, it's scanned — use one of the OCR modes.

Mixed PDFs

Some PDFs have embedded text on some pages and scans on others. Kaizen OCR handles this gracefully — Direct mode returns whatever text is there; for missing pages, switch to an OCR mode and re-run. You can also combine results manually.

Output formatting

Pages are separated by a page-break marker in the output.
Line breaks follow the PDF's visual layout (may include artifacts from multi-column or heavy-formatting source).
For cleaner output from complex layouts, use Searchable PDF with the Azure Layout model.

Kaizen OCR & PDF

Extract text from any image or PDF, edit, convert and OCR — fast, accurate and fully offline on Windows.

Get Kaizen OCR & PDF →Free download