High Accuracy OCR - Cleaner Data Reduces OCR Errors

Overview

Over one half of OCR errors are not flagged as suspicious by conventional OCR engines. These errors cannot be cost effectively found by manual review, and are often not found by other error correction technology, hence they flow through to the end user's application. For most applications, errors in the application data are extremely costly.

This document shows how Prime Recognition’s High Accuracy OCR engine, called PrimeOCR, can reduce the number of errors left in recognized data, AFTER manual error correction, by more than 75%.

OCR error rates are highly variable based on the quality of images, font types, etc. This analysis uses a single, relatively typical, example to make it easy to follow. Feel free to contact Prime Recognition for a more elaborate model of error rates, or to ask how your particular situation may differ from this model. For many applications this analysis will understate the quality improvement of using Prime Recognition’s OCR engine.

This analysis uses Prime Recognition's entry level engine, which produces 65%, or two thirds fewer errors than conventional OCR. Prime Recognition also offers higher accuracy options that produce 82%, or four fifths fewer errors than conventional OCR engines.

Image Conversion Alternatives

An image of a document, i.e., a piece of paper converted into pixels in computer memory, is of little value unless you also electronically capture information about the image’s content. Ideally you want to capture all the text that appears on the document. The fast growth of imaging systems in recent years for automated processing of insurance forms, medical claims, legal documents, and other types of data on paper suggests that there is tremendous value in electronically capturing the text information of an image.

Currently, the most common way to capture this information is multiple pass manual data entry. Multiple passes (i.e., typing in the same text 2-3 times, comparing the results, and fixing the discrepancies) are required because single pass is not accurate enough for most applications. A common accuracy target is 99.95% accuracy, or .5 errors in 1000 characters. Three pass manual data entry can usually generate this accuracy but it is very expensive because of the cost of labor.

OCR is a popular replacement for manual data entry because it is significantly less expensive. However, OCR is less accurate than multiple pass data entry, and sometimes less accurate than single pass data entry, even after OCR error correction. Since many applications require high data accuracy many users cannot use OCR technology. Why is OCR less accurate than Manual Data Entry? Or put another way, why not just review all OCR output, find the errors, and correct them? The answer lies in the limitations of the error correction technology available, which includes automated and manual techniques.

Automated OCR Error Correction

There are applications where it has been possible to automate a partial review of OCR data. For example, some data lends itself to automated review, such as spell check on English text, or mathematical checks on numeric tables, or table lookups on a "State Code" field on a form.

Unfortunately automated OCR review is not a full solution:

Most images contain a large amount of data which does not lend itself to an automated review.

Automated reviews are only partially effective at finding mistakes. For example, an OCR error that converts "contract" to "contact" will not be found with spell check. Most conventional OCR packages include a linguistic technology which pushes OCR output to be a correctly spelled word. While this does improve conventional OCR accuracy it also negates much of the value of a post OCR spell check to identify errors.

Most users do not trust automated tests to correct mistakes since the tests have limited effectiveness. These users only use automated tests to flag errors. This means that a person must still review the potential error and correct it.

Any automated test beyond a simple spell check is highly custom to the specific application, its data, etc. It is time consuming, expensive, and, as with any new development effort, potentially risky to custom develop automated tests.

I don't need accurate data. I'll use an automated "fuzzy" search to find my data.

Some applications are less sensitive to OCR errors, e.g., full text searches with the new "fuzzy" search engines, so users are contemplating using OCR but without manual error correction. However, even fuzzy searches assume a significant level of accuracy in the data. How are you going to find "profit" when OCR reports it as "moiit."? Even if you do find it, will you ever get to it if its ranking is 303rd out of 350 documents. Fuzzy search engines are great for finding data with a couple of errors in it but the tradeoff is that they find many documents that are not relevant, hence you have to dig through the top ranked documents looking for good matches. The larger the database the more digging.

In many ways high accuracy OCR is even more relevant when you decide not to correct the OCR data. If you are going to trust the OCR output you better get the best OCR output you can.

Prime Recognition offers a lower cost engine for applications that do not need to manually correct OCR errors.

The net result is that large, well funded, applications do often use automated review technology but this technology is seen as an attempt to better mark OCR errors, not as a technology which can automatically find and remove OCR errors.

Manual OCR Error Correction

Manual OCR error correction, the process of reviewing OCR results and correcting mistakes, cannot practically find and fix all OCR errors. It is too expensive to verify every single OCR character. Among other reasons, "every character" OCR editing is very difficult, OCR errors often look correct. For exarnple, the "m" in the "example" word at the beginning of this sentence is actually "r" and "n" put right next to each other, a common OCR mistake. Unless the editor is going very slowly, and methodically, through the text, i.e. incurring a lot of labor cost, this type of error could easily be missed.

Therefore, OCR error correction is typically based on verifying characters that have been flagged as "suspicious" by the OCR engine. The problem with this approach is that not all OCR errors get flagged. The ISRI group at the University of Nevada-Las Vegas has shown that even specially customized "research" engines from the conventional OCR vendors can only mark up to 60% of their errors. Prime Recognition has found that non-customized, commercially available, engines from these vendors, set at their highest level of error marking, only mark approximately 36-48% of errors as suspicious characters.

No Substitute for Accurate OCR

Automated and manual OCR error correction can only find and fix a fraction of the errors that are created in OCR. Therefore, for the "cleanest" output data it is very important to start with the most accurate OCR data.

The calculations below show how Prime Recognition's engine, PrimeOCR, creates 65%, i.e., two thirds, fewer errors to start with, and marks a higher percentage of its errors as suspicious than conventional OCR. The net result is that 70-75% fewer errors are in the data after manual error correction. This means that many more users will now be able to capture data from paper documents using OCR.

Accuracy Calculations

Example

Assumptions	Notes
Average OCR accuracy rate is 98%	40 characters out of 2000 on a typical full text page will be wrong. This is a typical average error rate on "real world" documents in real production sites. Note that error rates are highly dependent on image quality.

Conventional OCR

Assumptions	Notes
38% of OCR errors are marked as "suspicious" characters.	"Suspicious" characters are reviewed by data entry clerks to find and correct OCR errors. Errors that are not marked as suspicious - 62% of all errors for the leading conventional OCR engine - do not get reviewed, and are included in the final output. Users must use logical checks in their mainframe, database, or other target application to find and reject the data that includes errors (if possible).
The total number of marked characters is 1.9 times the number of true OCR errors.	Again this performance is based on the leading conventional OCR engine. This performance represents the most sensitive of two potential settings. If the sensitivity is reduced fewer characters are marked as well as fewer errors.

Calculations

A 2000 character page would generate:

74 marked characters as suspicious
40 true errors
25 errors that were not marked as suspicious, and hence left in the data after manual error correction (40 * 62%)

Prime Recognition High Accuracy OCR Engine

Assumptions	Notes
65% fewer errors are generated by PrimeOCR.
55% of OCR errors are marked as "suspicious" characters.
The total number of marked characters is equal to the number of marked characters by the conventional OCR engine.	Prime Recognition allows 9 different settings so that users can fine tune the tradeoff between marking errors vs. marking more characters. In this example we have configured the setting so that the number of characters marked is equal to the conventional OCR engine. This removes this variable as a difference between the engines.

Prime Recognition High Accuracy OCR Calculations

74 characters marked as suspicious. (Defined to be equal to conventional OCR engine)

14 true errors (65% fewer errors)

6.3 errors left in the data after manual error correction (14 * 45%)

Conclusions

1. Prime Recognition’s OCR engines generates much fewer errors than conventional OCR engines.

2. Prime Recognition’s OCR engine does a better job of marking its errors as suspicious, especially considering that it must mark its errors on a much smaller base.

3. The net result is 75% fewer errors in the PrimeOCR data vs. conventional OCR engine data after manual error correction. This result is achieved with an "off the shelf" solution that works in any OCR environment. It does not require custom programming for every application’s unique data.

4. The chart below illustrates another way of looking at the performance of the PrimeOCR engine running in its highest accuracy setting (Level 6 option), (the example so far has focused on standard Level 3 performance). The errors generated by the PrimeOCR engine, before spell check and manual error correction, are roughly the same as conventional OCR engines AFTER spell check and manual error correction. In other words, you could eliminate all error correction effort with the PrimeOCR Level 6 engine and still have the same accuracy as a conventional OCR engine WITH spell check and expensive manual error correction.

Bar graph indicating that PrimeOCR's OCR output includes fewer errors then cleaned up traditional OCR
PRIME RECOGNITION
High Accuracy OCR Engine
Copyright © 1996-2012
Prime Recognition
All rights reserved.