The three modes
| Mode | When to use | Speed |
|---|---|---|
| Direct | PDF has embedded text (born-digital, exported from Word, etc.) | Instant |
| Tesseract OCR | PDF is a scan — no embedded text | Moderate |
| Paddle OCR | Scan with modern content, prioritise speed + table detection | Fast |
Workflow
- Drop PDFs into the file list.
- Pick the extraction mode.
- Click Extract. Per-page status updates live as pages process.
- Export as a single
.txtor copy to clipboard.
How to tell if your PDF has embedded text
Open it in any PDF viewer and try to select text. If the cursor acts like a text cursor and you can highlight words, you have embedded text — use Direct mode. If selecting just draws rectangles on an image, it's scanned — use one of the OCR modes.
Mixed PDFs
Some PDFs have embedded text on some pages and scans on others. Kaizen OCR handles this gracefully — Direct mode returns whatever text is there; for missing pages, switch to an OCR mode and re-run. You can also combine results manually.
Output formatting
- Pages are separated by a page-break marker in the output.
- Line breaks follow the PDF's visual layout (may include artifacts from multi-column or heavy-formatting source).
- For cleaner output from complex layouts, use Searchable PDF with the Azure Layout model.