Tesseract languages (bundled)
Kaizen OCR ships with Tesseract trained data for 80+ languages, including:
- Latin-script: English, German, French, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Polish, Czech, Hungarian, Romanian, Turkish, Vietnamese, Indonesian, Swahili
- Cyrillic: Russian, Ukrainian, Bulgarian, Serbian, Macedonian
- East Asian: Chinese (Simplified), Chinese (Traditional), Japanese, Korean
- South Asian: Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Marathi, Nepali, Sinhala, Urdu, Punjabi
- Middle Eastern: Arabic, Hebrew, Persian (Farsi)
- Other: Greek, Thai, Khmer, Burmese, Georgian, Armenian, Amharic, Tigrinya
Paddle AI languages
Paddle's models are multi-lingual by default — its primary model handles Latin-script, Chinese, and Japanese automatically. For specialty languages, pick the dedicated model in Settings → OCR engine → Paddle language.
Azure Document Intelligence
If you use the Searchable PDF feature with Azure, you get access to 160+ languages. Azure auto-detects language in most cases — no configuration needed. Full list: Microsoft's documentation.
Mixed-language documents
All three engines handle pages with multiple scripts (e.g. English headings with Chinese body text). For best results:
- Paddle: use the multi model
- Tesseract: specify multiple languages separated by
+(e.g.eng+chi_sim) in Settings - Azure: no config needed — it auto-detects per block
RTL languages
Arabic, Hebrew, and Persian text is recognized and output in the correct reading direction. In the text pane, RTL text renders right-to-left; copying to clipboard preserves the correct Unicode order.
Performance notes
- Tesseract is fastest on English and other single-script Latin languages.
- Paddle is fastest on Chinese, Japanese, and mixed-script content.
- Complex scripts (Thai, Khmer, Arabic) are slower than Latin across all engines — plan 2× the time for similar-length text.