![]() |
|
|
|
. | High Accuracy OCR - Cleaner Data Reduces OCR Errors Overview Over one half of OCR errors are not flagged as suspicious by conventional OCR engines. These errors cannot be cost effectively found by manual review, and are often not found by other error correction technology, hence they flow through to the end user's application. For most applications, errors in the application data are extremely costly. This document shows how Prime Recognitions High Accuracy OCR engine, called PrimeOCR, can reduce the number of errors left in recognized data, AFTER manual error correction, by more than 75%. OCR error rates are highly variable based on the quality of images, font types, etc. This analysis uses a single, relatively typical, example to make it easy to follow. Feel free to contact Prime Recognition for a more elaborate model of error rates, or to ask how your particular situation may differ from this model. For many applications this analysis will understate the quality improvement of using Prime Recognitions OCR engine. This analysis uses Prime Recognition's entry level engine, which produces 65%, or two thirds fewer errors than conventional OCR. Prime Recognition also offers higher accuracy options that produce 82%, or four fifths fewer errors than conventional OCR engines. Image Conversion Alternatives An image of a document, i.e., a piece of paper converted into pixels in computer memory, is of little value unless you also electronically capture information about the images content. Ideally you want to capture all the text that appears on the document. The fast growth of imaging systems in recent years for automated processing of insurance forms, medical claims, legal documents, and other types of data on paper suggests that there is tremendous value in electronically capturing the text information of an image. Currently, the most common way to capture this information is multiple pass manual data entry. Multiple passes (i.e., typing in the same text 2-3 times, comparing the results, and fixing the discrepancies) are required because single pass is not accurate enough for most applications. A common accuracy target is 99.95% accuracy, or .5 errors in 1000 characters. Three pass manual data entry can usually generate this accuracy but it is very expensive because of the cost of labor. OCR is a popular replacement for manual data entry because it is significantly less expensive. However, OCR is less accurate than multiple pass data entry, and sometimes less accurate than single pass data entry, even after OCR error correction. Since many applications require high data accuracy many users cannot use OCR technology. Why is OCR less accurate than Manual Data Entry? Or put another way, why not just review all OCR output, find the errors, and correct them? The answer lies in the limitations of the error correction technology available, which includes automated and manual techniques. Automated OCR Error Correction There are applications where it has been possible to automate a partial review of OCR data. For example, some data lends itself to automated review, such as spell check on English text, or mathematical checks on numeric tables, or table lookups on a "State Code" field on a form.
The net result is that large, well funded, applications do often use automated review technology but this technology is seen as an attempt to better mark OCR errors, not as a technology which can automatically find and remove OCR errors. Manual OCR Error Correction Manual OCR error correction, the process of reviewing OCR results and correcting mistakes, cannot practically find and fix all OCR errors. It is too expensive to verify every single OCR character. Among other reasons, "every character" OCR editing is very difficult, OCR errors often look correct. For exarnple, the "m" in the "example" word at the beginning of this sentence is actually "r" and "n" put right next to each other, a common OCR mistake. Unless the editor is going very slowly, and methodically, through the text, i.e. incurring a lot of labor cost, this type of error could easily be missed. Therefore, OCR error correction is typically based on verifying characters that have been flagged as "suspicious" by the OCR engine. The problem with this approach is that not all OCR errors get flagged. The ISRI group at the University of Nevada-Las Vegas has shown that even specially customized "research" engines from the conventional OCR vendors can only mark up to 60% of their errors. Prime Recognition has found that non-customized, commercially available, engines from these vendors, set at their highest level of error marking, only mark approximately 36-48% of errors as suspicious characters. No Substitute for Accurate OCR Automated and manual OCR error correction can only find and fix a fraction of the errors that are created in OCR. Therefore, for the "cleanest" output data it is very important to start with the most accurate OCR data. The calculations below show how Prime Recognition's engine, PrimeOCR, creates 65%, i.e., two thirds, fewer errors to start with, and marks a higher percentage of its errors as suspicious than conventional OCR. The net result is that 70-75% fewer errors are in the data after manual error correction. This means that many more users will now be able to capture data from paper documents using OCR. Accuracy Calculations Example
Conventional OCR
Calculations A 2000 character page would generate:
Prime Recognition High Accuracy OCR Engine
Prime Recognition High Accuracy OCR Calculations 74 characters marked as suspicious. (Defined to be equal to conventional OCR engine) 14 true errors (65% fewer errors) 6.3 errors left in the data after manual error correction (14 * 45%) Conclusions 1. Prime Recognitions OCR engines generates much fewer errors than conventional OCR engines. 2. Prime Recognitions OCR engine does a better job of marking its errors as suspicious, especially considering that it must mark its errors on a much smaller base. 3. The net result is 75% fewer errors in the PrimeOCR data vs. conventional OCR engine data after manual error correction. This result is achieved with an "off the shelf" solution that works in any OCR environment. It does not require custom programming for every applications unique data. 4. The chart below illustrates another way of looking at the performance of the PrimeOCR engine running in its highest accuracy setting (Level 6 option), (the example so far has focused on standard Level 3 performance). The errors generated by the PrimeOCR engine, before spell check and manual error correction, are roughly the same as conventional OCR engines AFTER spell check and manual error correction. In other words, you could eliminate all error correction effort with the PrimeOCR Level 6 engine and still have the same accuracy as a conventional OCR engine WITH spell check and expensive manual error correction. |