Skip to content

Multilingual OCR

Some documents contain text in more than one language. OCR & PDF Tools can handle multilingual documents by combining multiple language packs during recognition.

How Multilingual OCR Works

The Tesseract OCR engine can load multiple language models simultaneously. When processing a multilingual document, it attempts to recognize characters from all selected languages, choosing the best match for each word or character.

Setting Up Multilingual OCR

Step 1: Install Required Language Packs

Ensure all the languages present in your document are installed. Go to Settings > Language Packs and install any missing languages.

Step 2: Select Multiple Languages

  1. Click the Language dropdown in the toolbar.
  2. Hold Ctrl and click to select multiple languages.
  3. The selected languages appear in the dropdown display (e.g., "English + French + German").

Step 3: Run OCR

Process the image or document as usual. The OCR engine will use all selected language models to recognize text.

Multilingual language selection

Tips for Best Results

Language Order Matters

Place the primary (most common) language first in your selection. The OCR engine gives priority to the first selected language, using additional languages for words or characters that do not match the primary language.

Limit the Number of Languages

While you can select many languages simultaneously, using too many can:

  • Slow down processing
  • Increase the chance of misrecognition (the engine may confuse characters between similar scripts)

For best accuracy, select only the languages you know are present in the document.

Same-Script Languages

Languages that share the same script (e.g., English, French, and Spanish all use Latin characters) work well together. Combining languages with very different scripts (e.g., English and Chinese) also works but may require more processing time.

Common Multilingual Scenarios

Scenario Recommended Languages
European business documents English + French + German
Academic papers with citations English + source language
Bilingual signs or menus Primary language + secondary language
Mixed English and CJK text English + Chinese/Japanese/Korean

Script Mixing

Documents that intermix scripts extensively (e.g., Arabic and Latin characters on the same line) may require additional preprocessing with Advanced OCR for optimal results.

Troubleshooting Multilingual OCR

  • Wrong characters recognized -- Verify the correct languages are selected. Remove any unnecessary languages.
  • Partial recognition -- Check that all relevant language packs are installed.
  • Slow processing -- Reduce the number of selected languages to only those present in the document.

:octicons-arrow-right-24: Get OCR & PDF Tools