OCR F.A.Q.

Below are answers to some common questions.

Does PrimeOCR perform handprint recognition?

PrimeOCR is designed to read machine-printed characters. For handwriting recognition, you should try searching the web for ICR or Handprint vendors.

What is the best scanning resolution for OCR?

Most OCR engines, including the ones used by PrimeOCR, are optimized for 300 dpi images. Scanning at true 300 dpi optical resolution is very important. Scanning at a lower resolution and then using scanner software to increase the dpi later on does nothing for OCR. In cases where the font size of characters on an image are very small ( point size of 4 or less), scanning images in at 400 dpi can improve character recognition. This again would require a scanner that supports true 400 dpi optical resolution.

What is the difference between Forms-based OCR and Full-Text OCR?

We are all familiar with standard paper forms. A typical form has a structured page layout that contains both static and variable information. If the variable information on the form has been filled in using machine printed characters, the form is a candidate for Forms-based OCR. If each page you want to OCR always has the same Form (i.e., the layout of text on the every page is the same), you can create a zone "template" that OCR can use to extract the data you are looking for. Full-Text OCR just means that you intend to OCR the entire page, without prior zoning. In affect, the entire page is treated as a single zone. There are cases, however, when zoning is valuable even in a full-text environment (see below).

Why is Forms-based OCR difficult?

The complexity of Forms OCR is always being able to match up the zones in your template with the correct data on the page. Scanner feed problems, image stretch and skewing or even slight variations on the page layout of each form can cause "zone to data" misalignment. Techniques such as Forms ID, Registration, and Image Enhancement are all methods of addressing these problems.

Prime Recognition provides these techniques in its PrimeZone application on a custom basis. Manual zoning and template creation on a page-by-page basis is also available through the PrimeView application within our PrimeProof software. And Image Enhancement, such as image deskew and despeckling, is a standard option in PrimeOCR.

Why do I need to zone multi-column text before OCR?

Pages with multiple columns are a common entity. You find them in newspapers, books, trade journals and reports, to name a few. It is important to identify the columns through zoning if, after OCR, you intend to search on the data (e.g., using a fuzzy search engine on the data after it has been stored in a database) or if you need to preserve the look and feel of the original page. If you perform a search without separating the columns, hyphenated words that wrap to multiple lines won't be found. Similarly, without column separation, 2 columns of text on the same line will appear as a single line in a word processor.

How do I zone multi-column text?

If the text layout of the page is always the same, you can treat the page as if it were a form and perform Forms-based OCR (see above). A good example of this would be a book where each page always has the same number of columns.

If the page layout varies, as with a newspaper or trade journal, then you have several other choices:

Manual Zoning - This process involves viewing the image prior to OCR and drawing zones over the areas that you want to read. Prime Recognition's PrimeProof software includes an application that allows you to view images and draw zones using the mouse. The advantage of manual zoning is that you can specify exactly what to OCR and in what order. The disadvantage is that each page must be zoned individually by hand.

Automatic Zoning - PrimeOCR includes an zoning option that tries to automatically recognize blocks of text such as a paragraph or column. A zone is generated for each block similar to the manual zoning process. However, since there is no way for the automatic process to determine which sequence the text flows on the page, OCR results may not always be presented in the proper reading order. And automatic zoning will never be as good as someone defining zones manually. But in situations where the image presents a clear delineation between columns and manual zoning is not economical, automatic zoning can provide a major improvement over full-page OCR on the search ability of recognized text