OCR for Financial and Legal Documents
The optical recognition of content of (usually – paper) documents has been made broadly available since early 1990s, inline with availability of a range of document scanners. While both scanner hardware and the recognition algorithms have progressed imminently since then, the basic assumptions for the content recognition remain the same: the contents of the paper is scanned (typically in 300 DPI or more), the image is transformed into black-and-white using a set of filters that attempt to remove image noise, dirt, etc. and emphasize these features of the scanned characters, which allow the recognition algorithms identification of the characters as well as possible.
The text recognition algorithms typically perform three steps:
- (a) identification of text blocks and tokenizing the blocks into individual letters;
- (b) identification of individual letters;
- (c) dictionary-based, probability-driven correction of recognized words and/or sentences in order to make-up for the identification errors.
Depending on the use case, resources (required recognition speed, hardware resources) and the given text recognition algorithm principles, the stress on the accuracy of the individual character identification versus the strength of the dictionary-based correction may be altered. Generally however, in all of the publicly available text recognition engines nowadays the dictionary-based correction plays the prominent role in the overall text recognition accuracy.
There are however numerous cases where the dictionary-based correction is impossible due to a nature of the documents to be digitalized.
Our OCR technology targets the cases of digitization of paper forms of various kinds (e.g. financial, manufacturing and trade documents), which share a set of common features including:
- (a) a discreet (typically low) number of form-based documents layouts in a given institution;
- (b) a subset of fields in a given layout contain text which cannot be matched against any available dictionary (e.g. amounts, IDs without CRC etc.);
- (c) very high sensitivity to the misidentification of the characters (e.g. misidentification of an ID number may result in transfer of funds to a wrong account).
In such cases the quality of tokenization and identification of characters plays a critical role in usefulness of the digitized documents.
Our OCR is dedicated to recognize content of financial and legal documents, especially those, that have form- or table-based layouts and structured content.
It is optimized to maximize the recognition ratio of individual characters, tokenization of sentences and words, recognition of table layouts as well as utilizing intelligent dictionaries and adaptive validation algorithms, that understand the data that may be encountered on the invoices, reports, receipts and other types of financial and legal documents and are able to detect and correct typical recognition errors (as control sums in identifiers and account numbers, methods of writing the addresses, tax rates, names and last names, numbers that should add up to totals etc.).
The system is also taught to recognize fonts that can be encountered on:
- Governmental and business forms
The Active Text OCR is most often used as a part of automatic document processing workflow.