Global Big Data Conference

Industry News Details

Is Data a Differentiator for Your Business? If So, Traditional OCR Cannot Be An Answer Posted on : Oct 22 - 2021

If your business is driven by data, Optical Character Recognition (OCR) — as most of us know it — is not the answer.

For those of you who view OCR as an industry staple for document processing, let me explain.

OCR as a technology has been around for ages and it still has its place in processing unstructured document formats like PDFs, images, and other text formats that cannot be edited digitally. Users can quickly convert those files into editable documents. In short, it’s a terrific technology for enabling you to edit and search for files that may have been “frozen.”

Using OCR makes editing easier and, more importantly, it reduces human error and is a start to eliminating those monotonous data processing tasks – think insurance forms, healthcare documents, and financial service reports – we spoke of earlier. Despite remaining an important enterprise tool, OCR doesn’t scale to meet more robust needs. Extracting usable structured data from documents involves multiple steps. It’s often assumed OCR addresses all of those steps — it doesn’t. OCR does not address text extraction or post-processing, which eliminates wrong outputs and ensures better quality in results. In addition, while effective with printed text, its accuracy reading handwritten text is inconsistent, which must then be addressed through other technology.

OCR only addresses the conversion of pixels to characters of text. It’s an industry misnomer that it handles text extraction — the process of identifying text and matching it to a desired field — and post processing, which involves processing the extracted text further, including normalization (e.g., converting “1, 23 4” to the number 1234), concatenation (e.g., combining “231” “Avery” “Lane” into “231 Avery Lane”), and interpretation (e.g., using named entity recognition to extract the diagnosis from a paragraph of text). These two steps are critical for modern businesses, which are constantly looking to leverage their data as a differentiator in the organization.

Making The Most of Your OCR Investment with AutoML

To this point, you’ve no doubt noticed I’ve identified a lot of challenges for you to consider. This isn’t to instill fear, but rather to clear up some of the confusion around what AI technologies can and cannot do. It’s that confusion that still leads to so much of the failure we see with AI initiatives: a company spends a lot of money on a pre-trained AI solution, spends a lot of time and money on data science resources to implement it, and ultimately find out that the solution is inadequate for solving for the problem the business has.

One viable solution has emerged: Automated Machine Learning (AutoML) is a nascent artificial intelligence (AI) technology that exposes the power of machine learning (ML) to a much broader audience than data scientists and technologists. Instead of preparing for every possible scenario, AutoML learns based on your documents. It provides all of the benefits of an AI-driven approach, over any set of documents you want to work with. You are not limited to the documents a product has encountered ahead of time and, as your own documents and data evolve, so do the algorithms that interpret them. View More

Get the