6 September 2022

Optical character recognition in Safetica ONE

Introduction

Modern OCR technology enables the detection of sensitive data in PDF documents and image files. The entire process is carried out at the workstation level, so it does not burden network resources and works even when the station is offline. See how to set up optical character recognition step-by-step to better protect your company's assets.

How does OCR work in Safetica ONE?

This function is deactivated by default. This allows the administrator to selectively control the OCR technology’s load on the endpoints. Optical character recognition supports the following image types: .png, .fit, .jpg, .jpeg, .jpe, .bmp. However, you can also extract images of other formats, such as .pdf, presentation files, ebooks and many others.

The list of supported formats can be found by clicking here.

For performance reasons, Safetica only scans files that have at least 75% of the pages covered by images. The page covered with images must contain images on at least 70% of its surface.

How do I set the OCR language?

Language settings can be found in Protection > Data Categories > OCR Language. The configuration is independent of the Safetica Client language configured for the endpoints. The OCR language is also centralised and homogeneous for the entire environment (no other one can be set for different endpoints).

You can select 2 different languages for OCR scanning, which can be helpful for companies working in multiple languages. This is mainly useful for different character sets, such as Cyrillic, Latin and Chinese alphabets.

However, setting an additional language will affect the performance of the workstation.

Languages without special characters are a subset of languages with special characters. This means that if you set, for example, Czech or German as the primary language, it is not necessary to set English as an additional language because all English characters are already included in the Czech/ German character sets.

Where can I set up an OCR scan?

You can configure settings related to content inspection in Safetica Management Console > Protection > Data Categories. Simply select a sensitive data category from the list on the left or create a new one and click the Set Data Category button.

If you enable OCR in one data category, it will be activated for all file types specified in that data category.

In the next step, you will see the new options available in the detection rule configuration: Optical Character Recognition (OCR) and Extensions. The sliders are independent.

How do I choose file types for OCR?

When OCR is enabled, it is run only for the file formats specified in section Extensions and not for all files.

For example, if an administrator turns on OCR and then sets only the .pdf extension in the Extensions section, no other formats, even .jpeg or .bmp image files, will be scanned with OCR. If the administrator selects Recommended in the Extensions section, OCR will run the scan only for recommended file types.

How do I activate or deactivate OCR for a specific workstation?

Go to Settings > Deactivate Workstation, and select a particular workstation or group in the user tree. In the Functions section of the protection module, select the desired option using the Optical Character Recognition (OCR) slider.

OCR is dependent on the Safetica content analysis service. If you deactivate the content analysis service, the OCR will not start even if you activate it in the Protection Features section.

What does the administrator see in the logs?

The content control results are shown in the Safetica Management Console. Go to Protection > DLP Logs and click Details in the Records table.

Example of usage

The company often uses .pdf, .docx, .pptx, .html, and .xml files. However, administrators know sensitive data (credit card and IBANs) can only be found in .pdf and .docx files.

They create a data category that only scans .pdf and .docx files. This setting excludes image file types from OCR scanning, so the company will not burden workstations unnecessarily.