Prime Recognition Logo -  High Accuracy Optical Character Recognition
Home
Products
Services
Support
Customers
Partners
News
Search

Why High Accuracy
Why PrimeOCR
Try PrimeOCR
Info via Email
Mailing List
Contact Us

 

 

.

PrimeOCR Product Evaluation Report

Provided by: Aspen Systems Corporation

RTIS Logo

Overview

ASPEN’s DIRECTOR of SYSTEMS & TECHNOLOGY reported on a test he had performed that compared a leading conventional OCR software product to the "Voting" OCR software solution from Prime Recognition (PrimeOCR). This highly detailed and well executed test pilot shows that:

  • "over 66 more characters per page [were] converted correctly by the PrimeOCR engine vs. the (conventional leader) OCR engine", and

  • "The software cost would be paid for, under this scenario, after 4000 pages were OCRed."

Notes from the director’s report are presented below. These notes have been edited for brevity.

About Aspen Systems

Aspen Systems Corporation was founded in 1958 and, since that time, has grown into a 1400 employee Information Management Services company whose core competencies include:

  • survey design, data collection, and analysis

  • information center design and management

  • health data and information management

  • litigation support

  • information systems and telecommunications technical support

  • print and electronic publishing services

  • order fulfillment and distribution services


Tester Notes

Recently I devised and performed a test designed to evaluate a number of parameters which might affect the quality and throughput of the OCR process. The motivation for this testing was obtaining a 45-day evaluation copy of the Prime Recognition OCR software, which employs a voting engine methodology to improve accuracy.

The test was constructed in such a way as to test the following variables:

  • Good source material vs. bad source material

  • Kodak 500 vs. Fujitsu 3096 scanner

  • 200 DPI vs. 300 DPI scanning

  • Enhancement of scanned image data vs. use of raw image data

  • PrimeOCR Voting Engine vs. (conventional leader) OCR engine

The PrimeOCR software incorporates from 3 to 5 commercially available OCR engines as part of its process.

The Prime software provides an interface for controlling these 5 engines, interprets the output from each, and intelligently arrives at what it considers the best possible solution for each page converted by each engine. The Prime software can be tuned by means of Prime "string commands", which set attributes for the engine. Using these string commands, the operator can specify acceptable levels and thresholds for turning on and off the successive OCR engines.

Under normal operation, the OCR engines are invoked in sequence on a page-by-page basis. With the appropriate string command settings, successive engines will not be invoked if the measured accuracy level of the current engine either falls below a threshold (the quality of the material is so poor as to preclude accurate conversion) or above (the first engine(s) did such a good job, that there would be no appreciable improvement achieved through additional engines).

Because there are five engines used as part of the OCR conversion process, the time to process a given page is approximately 5x longer than it would be using a single engine alone. However, this can be mitigated somewhat by use of string commands as described above.


Methodology

For the purposes of this test, I used a small sample of source hardcopy material. The sample was kept small to keep the test manageable, and to allows manual inspection of the test results. There were two initial batches: Good and Bad source. The Good source material was first generation photocopy; some pages had mixed fonts, some had indented material, some had graphics, and some had numerical data.

The Bad source material consisted of two parts: one half of the pages were taken from the Good batch and successively photocopied until the text was very light (but still readable). The remaining pages were photocopies which included large amounts of tabular data, and mixed fonts and graphic data.

Each of these pages were scanned on both the Fujitsu and Kodak scanners, once at 200 DPI and once at 300 DPI. One half of the images were then image enhanced. Finally, the TIF files were passed through the (conventional OCR leader) and Prime OCR engines, generating text files.

Both the Prime and (conventional OCR leader) OCR engines were run on similarly configured systems. These systems were Compaq DeskPro 400s, running the Windows 95 Operating System, with 64MB RAM, and over 500 MB Hard Disk free. Image data was copied to the local hard drive prior to OCR conversion, and the conversion generated text files on the local hard drive.

As part of the conversion process, the following parameters were captured for each page:

  • OCR accuracy as reported by the engine

  • Number of characters and number of errors, as reported by the engine

  • Time to convert


Adjustments to Process and Measurements

When the test began, there were no string commands entered to modify the operation of the Prime software. As it became apparent that there would not be time to complete the test without some modification to reduce processing time, a string command was entered to establish thresholds for continuing to invoke OCR engines.

600,850

600,865

600,875

600,885

600,895

The numbers reflect overall page-level confidence levels for the conversion, on a scale of 0 – 900, with 900 being the highest confidence. This command would not invoke the following engine in sequence if the output of the current engine was either below 600 (very poor), or above a laddered threshold, of from 850 to 895.

As a result of this string command, the overall processing time was reduced.

The (conventional OCR leader) engine was determined to have a different problem. While it generated converted text as expected, the engine inserted a significant number of additional spaces into that text stream. While these extra spaces did not affect the final makeup of the resultant text document, they did inflate the total number of characters on the page, which consequently improved the reported OCR accuracy.

When manual comparisons were done between the reported number of characters by Prime and (conventional OCR leader), and those actually counted, it was determined that the PrimeOCR engine was very accurate in its total page character count. The Prime generated character counts were used in the calculations of the statistics in this report.


Results

The testing generated the following conclusions:

1) The quality of source material has the greatest impact on the resultant OCR accuracy. With the PrimeOCR engine, the test showed a 3.6% difference, and with the (conventional OCR leader) engine a 4.2% difference between the OCR accuracy for good and bad source material.

2) There does not appear to be any significant difference in the resultant OCR accuracy based upon DPI, scanner, or image enhancements. The overall difference in the average OCR accuracy for the same data sets with only one variable are:

  • Scanner .16%

  • DPI .15%

  • Enhancement .09%

3) There is a significant improvement in OCR accuracy when using the PrimeOCR engine vs. the (conventional OCR leader) engine. Prime generates an average of 2.64% improvement in accuracy. While 2.64% may not sound like a lot, given that our average number of characters per page in this test was 2774, this translates to over 66 more characters per page converted correctly by the Prime OCR engine vs. the (conventional OCR leader) engine.


Recommendations

When comparing the one time costs for the PrimeOCR licenses against the potential recurring costs for OCR correction, the Prime solution may be cost-effective. For example, assuming that there is an advantage of 66 characters which do not have to be corrected using the Prime solution, and that a clerk can correct at the rate of 200 characters/hr. This means that you would receive a $3/page OCRed benefit by not having to correct these additional characters. The software cost would be paid for, under this scenario, after 4000 pages were OCRed.

  Home  -  Products  -  Services   -  Support  -  Customers   -  Partners  -  News   -  Search
Why High Accuracy  -  Why PrimeOCR  -  Try PrimeOCR  -  Info via E-mail  -  Join Mail List  -  Contact Us