Download
Kaizen OCR & PDF Kaizen OCR & PDF Help All Help Contact
Reference

Supported languages

Tesseract ships with 80+ languages; Paddle AI supports many more; Azure Document Intelligence covers 160+ — all bundled / plugged in with no downloads.

Tesseract languages (bundled)

Kaizen OCR ships with Tesseract trained data for 80+ languages, including:

  • Latin-script: English, German, French, Spanish, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Finnish, Polish, Czech, Hungarian, Romanian, Turkish, Vietnamese, Indonesian, Swahili
  • Cyrillic: Russian, Ukrainian, Bulgarian, Serbian, Macedonian
  • East Asian: Chinese (Simplified), Chinese (Traditional), Japanese, Korean
  • South Asian: Hindi, Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Marathi, Nepali, Sinhala, Urdu, Punjabi
  • Middle Eastern: Arabic, Hebrew, Persian (Farsi)
  • Other: Greek, Thai, Khmer, Burmese, Georgian, Armenian, Amharic, Tigrinya

Paddle AI languages

Paddle's models are multi-lingual by default — its primary model handles Latin-script, Chinese, and Japanese automatically. For specialty languages, pick the dedicated model in Settings → OCR engine → Paddle language.

Azure Document Intelligence

If you use the Searchable PDF feature with Azure, you get access to 160+ languages. Azure auto-detects language in most cases — no configuration needed. Full list: Microsoft's documentation.

Mixed-language documents

All three engines handle pages with multiple scripts (e.g. English headings with Chinese body text). For best results:

  • Paddle: use the multi model
  • Tesseract: specify multiple languages separated by + (e.g. eng+chi_sim) in Settings
  • Azure: no config needed — it auto-detects per block

RTL languages

Arabic, Hebrew, and Persian text is recognized and output in the correct reading direction. In the text pane, RTL text renders right-to-left; copying to clipboard preserves the correct Unicode order.

Performance notes

  • Tesseract is fastest on English and other single-script Latin languages.
  • Paddle is fastest on Chinese, Japanese, and mixed-script content.
  • Complex scripts (Thai, Khmer, Arabic) are slower than Latin across all engines — plan 2× the time for similar-length text.