The golden rule
Tell the engine what to expect. OCR accuracy jumps when the language setting matches the source language. Don't leave Tesseract on “English” if you're scanning Hindi — accuracy drops dramatically.
Single-language documents
- Open Settings → Language.
- Pick the single language that dominates your document.
- Run OCR.
Mixed-language documents
A multi-script page (e.g. English + Chinese, Hindi + English) needs multiple language hints:
With Tesseract
In Settings → Language, enter language codes separated by +:
eng+chi_sim— English + Simplified Chineseeng+hin— English + Hindieng+ara— English + Arabic
With Paddle AI
Paddle's multi model handles mixed scripts natively. For dedicated accuracy on a specific language pair, select the appropriate Paddle model.
With Azure Document Intelligence
Azure auto-detects language per text block. No configuration needed.
Right-to-left scripts
Arabic, Hebrew, and Persian text is recognized and output in the correct reading direction. All three engines handle RTL correctly.
CJK (Chinese/Japanese/Korean)
Use Paddle AI for CJK — it's trained on huge CJK corpora and consistently outperforms Tesseract. For Traditional Chinese, pick the dedicated Paddle model.
Complex scripts (Thai, Khmer, Burmese)
These scripts lack word boundaries the way Latin scripts do. Expect:
- Better results at higher DPI (300+ for scans)
- Longer processing time
- Paddle AI generally outperforms Tesseract for Thai
- Azure Document Intelligence is the best choice for critical Thai / Khmer workloads
Handwriting in non-Latin scripts
Tesseract struggles with handwriting universally. Paddle AI is slightly better. For handwriting in any script, Azure Document Intelligence (with the Read or Layout model) is your best bet.