Capture Automation software typically deals with 3 kinds of documents:
- Structured: These are forms with a fixed layout and boxes or underlines where someone fills in the data with a pen.
- Semi-Structured: Documents with a somewhat consistent layout but still different all the time. These are for example invoices or purchase orders.
- Unstructured: Documents with no consistent layout structure at all, like contracts or even emails.
Structured forms have been sidelined a little bit during the recent AI hype, which focuses a lot on semi-structured and unstructured documents. And yes, admittedly that is also where AI shines the most.
But have we really seen the end of what’s possible for forms processing? Are there even still hand-printed paper forms to be processed and how are they recognized? How will forms be recognized in the future? Let’s take a look.
Are there still paper forms?
Yes. Despite forms being provided to people as PDFs more and more often, in most cases, they get printed and sent back on paper anyways. The reason for that is mostly the signature. Most companies sending out forms to their customers are not equipped to provide electronic signature functionality. So even if the person fills the form on the computer, they still need to print it to apply a physical signature to it. More and more government regulations also lead to more forms being sent. This is especially true in banking and insurance where we see most of the forms volume in the incoming mail.
So all in all, while more forms have the field data machine-printed, the volume of forms reaching your company on paper is still high.
The classic approach to forms processing is very design-intense
The problem with forms is not that their structure is unknown. A specific form mostly looks the same all the time. The actual problem is twofold:
- There is a lot of data on these forms, and the data fields are close together.
- The fields are mostly filled with handwriting.
The classic approach to set up a capture automation software to read forms is roughly like this:
- Classify the form so you know which one you are dealing with.
- Draw a rectangle around each data field and name the field.
- For each field, fine-tune image perfection (remove the form background such as lines and boxes, optimize the handprint pen stroke, etc).
- Select a handprint recognition engine, if the capture software of your choice offers multiple ones.
- Fine-tune the engine settings per field, because the engine benefits from knowing to expect a numeric field or an alphanumeric field etc.
- If you can, set up multiple engines to read the same field and configure voting (voting compares 2 different extraction results character by character and delivers a more reliable joint result).
No, wait, not done! Because of the density of most forms, even the slightest variation in how the image was scanned can cause those rectangles you painstakingly drew to be located slightly off. Or way off.
Let’s say you defined the field to be located at point x,y and to have a certain width and height. If your incoming forms are scanned, every image will be slightly shifted, a bit skewed and depending on the scanner even stretched sometimes. So every single runtime image has the actual fields several pixels off from where you defined them. If the image is skewed, the defined field height no longer fits the field because the rotation makes it higher. Stretching effects can cause the field to be wider or narrower than you defined it. All in all: A mess.
Because of that, most capture products provide anchoring functionality of various kinds. In some cases, you need to define anchors manually, in some cases they are automatically detected. The point of anchors is that the software can find them reliably even if the image is distorted. Once the anchors are found, the reading areas for each field can be adjusted.
Another major problem arises if you receive forms that can have fields filled with either machine-print (if the person filled the form on the computer) or handprint (if they printed and filled the form with a pen). If a field can be both, how do you select and tune the We speak about this a lot here. OCR is a technology used to interpret the pixels in an image as... /ICR engine for that field? There are 2 options in this case:
- You can define to read the form (or each zone) twice and apply a voting technique.
- Or you always select the ICR is an extension of OCR. It adds capabilities for handwriting recognition. OCR and ICR together can recognize machine-printed and... engine and hope that it works for machine-printed data as well because hey, isn’t machine-print just like well-written handprint?
As you can imagine this is all very error-prone. Even if the reading area ends up in the right place, you still have to rely on a handprint recognition engine, which usually has a significantly lower accuracy than engines for machine-print.
Modern forms processing approaches blur the lines between document types
The classic approach is what it is because of the history of handprint recognition (ICR) engines.
If you want to know more about how ICR and OCR engines work and how they were invented, check out this post:
All this manual setup described is necessary to get the last percentage of accuracy out of these ancient engines by perfecting and positioning the reading zone as much as possible before letting the engine lose on it.
What if these reading areas were not necessary anymore? Wouldn’t it be great if you could just read a handprint-filled form with a full-page recognition engine just as if it were all machine-printed? Some engines take this approach, among them Microsoft Azure Read and Google Vision OCR, both in the cloud.
In this post we introduce you to the most popular OCR engines out there and how they work:
These engines don’t care about handprint or machine-print and just read the page. That may not always be perfect but they are getting there and they do get better with every release. The result is just a full-page recognition output (words with coordinates relative to the page) as if it were all machine-printed.
With that, you can now apply any technique of data extraction that you would normally apply to an invoice or other structured document. You can use regular expressions and keywords, you can use machine learning if the capture product supports that, or you can keep assigning reading areas.
We think this is the future of forms processing. While ICR engines also improved in recent years, forms processing remains difficult because of image distortion and other issues described above. So we think 2 things will happen:
- Full-page engines like Microsoft Azure Read and Google Vision OCR will become even better, making it unnecessary to isolate the handprint areas and deal with them differently compared to the rest of the page.
- Companies will work on their forms and make them easier to process. The design of such forms will take more and more into account how they will be processed when they are returned on paper. Fields will be better spaced, fewer lines will be used that have to be removed, and barcodes will be used more frequently to store customer-specific data on the form that doesn’t have to be read again from the scanned document.