3.65 Extract text (OCR)

Updated on Oct 03, 2024

In this article, you will learn

which requirements must be met for this Fixup and
how to apply this Fixup.

1. Introduction

The Fixup Extract Text (OCR) is available in the Workflow. This Fixup extracts all text elements of the PDF file in the selected language and stores them as a text file (.txt) in the "Additional Data" tab.

2. General

Use the Extract Text (OCR) Fixup to extract characters and numbers from the Print Item in the selected language. Increasing the contrast can significantly improve the readability of the text in some cases. This Fixup does not change the Print Item itself.

Use this Fixup to check the spelling in the text in a file, for example. The extracted text can be checked for spelling mistakes in a spellchecker of your choice.

The correction is located in the Data Preparation tab of an Article, Order, or Production Job by

entering the name of the Fixup in the Filter area,
activating the Text option in the Category area,
activating the Create option in the Action area,
activating the Text or OCR option in the Search area.

3. Description

This Fixup uses OCR to extract text from the file, even if the text has already been converted to an outline. The resulting text file is saved in the “Additional Data” tab. Increasing the contrast can often improve the recognition of letters so that it is even possible to distinguish gray letters on a gray background.

Note that this Fixup does not change the Print Item.

3.1. Requirements and Functionality

To ensure that the Fixup does what it promises, the following requirements must be met:

There must be letters/numbers in the file.
The form in which the letters/numbers are present is irrelevant. They may be present
- as text objects,
- as vectors (texts converted to outlines) or
- in images (texts that have already been rendered).

Extraction sequence

Please note that the text passages are not always extracted in the correct sequence. This is particularly the case when a layout with texts is arranged across the entire Print Item and individual text blocks are formatted below it, e.g. in two columns. Text recognition always starts at the top left and works its way down to the bottom right, depending on the structure of the file. Texts that have been converted to outlines or images are always added at the end. Texts that have been converted to outlines are given priority over text images.

Figure 1: The dialog of the Fixup Extract Text (OCR)

To extract text from the Print Item, select the following options:

Language [1] – use the drop-down menu to select the language in which the text that should be extracted was written. The following options are available:
- German – the text to be extracted was written in German.
- English – the text to be extracted was written in English.
- Italian – the text to be extracted was written in Italian.
- French – the text to be extracted was written in French.
- Spanish – the text to be extracted was written in Spanish.
- Portuguese – the text to be extracted was written in Portuguese.
- Swedish – the text to be extracted was written in Swedish.
- Polish – the text to be extracted was written in Polish.
- Russian – the text to be extracted was written in Russian.
- Japanese – the text to be extracted was written in Japanese.
- Korean – the text to be extracted was written in Korean.
- Chinese – the text to be extracted was written in Chinese.
- Variable Content [4] – select the desired placeholder database field or custom field from which the value for the language should be retrieved. The following values must be found in the selected field for the selection to be executed:
  - deu
  - eng
  - ita
  - fra
  - spa
  - por
  - swe
  - pol
  - rus
  - jpn
  - kor
  - chi_tra
Accuracy [2] – use the drop-down menu to select the resolution at which the image should be rendered for OCR recognition. The more ornate the writing, the higher the accuracy should be. However, keep in mind that a higher resolution also takes more time. The following options are available:
- Fast (large text size) – the file is rendered at a resolution of 150 dpi. This option is suitable for large text sizes.
- Normal (reading sizes) – the file is rendered at a resolution of 600 dpi. This option is suitable for reading sizes.
- High Quality (small text size) – the file is rendered at a resolution of 1200 dpi. This option is suitable for small text sizes.
- Variable Content [5] – select the desired placeholder database field or custom field from which the value for the accuracy should be retrieved. The following values must be found in the selected field for the selection to be executed:
  - 150
  - 600
  - 1200
Contrast [3] – use the drop-down menu to select the text contrast. In many cases, the higher the contrast, the more legible the text. The following options are available:
- Standard – the text contrast remains unchanged.
- High – the contrast of the image for OCR recognition is increased
- Extreme – the contrast of the image for OCR recognition is extremely amplified
- Variable Content [6] – select the desired placeholder database field or user-defined field from which the value for the contrast should be retrieved. The following values must be found in the selected field for the selection to be executed:
  - Standard
  - High
  - Extreme

Figure 2: Left: Values present in the selected field for the Language: option. Center: Values present in the selected field for the Accuracy: option. Right: Values present in the selected field for the Contrast option.

3.2. Before/After

To experiment with this function, refer to the example file "Sample_Extract text (OCR).pdf". There are three different fonts in the file.

The heading was created using the Bradley Hand Bold font, Bold style, and 8 pt font size. The heading was converted to an image.
The first paragraph was formatted using the Myriad Pro font, Regular and Italic styles, and a 5 pt font size. This part of the text was converted to paths, which means that it is now a vector.
The second paragraph was created using the Snell Roundhand font in Regular style and 6 pt font size. This is a normal text passage.

After applying the Fixup, you will receive the text file "Sample_Extract text_OCR.txt", which can be downloaded in the Additional Data tab. To do this, use the values from Figure 1.

Figure 3: Left: initial file; Right: the Additional Data tab with the extracted text file

Article update: Workflow 1.18.0 – 08/2024

Previous Article 3.64 Extract Colors from DeviceN

Next Article 3.66 Extract Print Items from Page

User Manual – Durst