Download
Kaizen OCR & PDF Kaizen OCR & PDF Help All Help Contact
Reference

Multilingual OCR

Tips for getting the best results from documents that mix multiple languages or use non-Latin scripts.

The golden rule

Tell the engine what to expect. OCR accuracy jumps when the language setting matches the source language. Don't leave Tesseract on “English” if you're scanning Hindi — accuracy drops dramatically.

Single-language documents

  1. Open Settings → Language.
  2. Pick the single language that dominates your document.
  3. Run OCR.

Mixed-language documents

A multi-script page (e.g. English + Chinese, Hindi + English) needs multiple language hints:

With Tesseract

In Settings → Language, enter language codes separated by +:

  • eng+chi_sim — English + Simplified Chinese
  • eng+hin — English + Hindi
  • eng+ara — English + Arabic

With Paddle AI

Paddle's multi model handles mixed scripts natively. For dedicated accuracy on a specific language pair, select the appropriate Paddle model.

With Azure Document Intelligence

Azure auto-detects language per text block. No configuration needed.

Right-to-left scripts

Arabic, Hebrew, and Persian text is recognized and output in the correct reading direction. All three engines handle RTL correctly.

CJK (Chinese/Japanese/Korean)

Use Paddle AI for CJK — it's trained on huge CJK corpora and consistently outperforms Tesseract. For Traditional Chinese, pick the dedicated Paddle model.

Complex scripts (Thai, Khmer, Burmese)

These scripts lack word boundaries the way Latin scripts do. Expect:

  • Better results at higher DPI (300+ for scans)
  • Longer processing time
  • Paddle AI generally outperforms Tesseract for Thai
  • Azure Document Intelligence is the best choice for critical Thai / Khmer workloads

Handwriting in non-Latin scripts

Tesseract struggles with handwriting universally. Paddle AI is slightly better. For handwriting in any script, Azure Document Intelligence (with the Read or Layout model) is your best bet.