Speaker "Eitan Anzenberg" Details Back



Beyond OCR: Using deep learning to understand documents


Although the field of optical character recognition (OCR) has been around for half a century, document parsing and field extraction from images remain an open research topic. We utilize an end-to-end deep learning architecture to predict regions of interest within documents and automatically extract their text.
Extracting key-fields from a variety of document types remains a challenging machine learning problem. Services such as AWS and Google Cloud provide text extraction products to "digitize" images or pdfs. These return phrases, words and characters with  their corresponding coordinate locations. Working with these outputs remains challenging and unscalable as different document types require different heuristics with new types uploaded daily. Furthermore, a performance ceiling is reached even if algorithms work perfectly equaling the accuracy of the service OCR.
We propose an end-to-end scalable solution utilizing deep learning and OCR architecture to automatically extract important text-fields from documents. Computer vision algorithms utilizing deep learning produce state-of-the-art classification accuracy and generalizability through training on millions of images. Region proposals are generated by off-the-shelf OCRs including Tesseract. We compare the in-house model accuracy with 3rd party OCR services. is working to build a paperless future. We parse through millions of documents a year ranging from invoices, contracts, receipts and a variety of other types. Understanding those documents is critical to building intelligent products for our users.


I'm the Chief Data Scientist at and have many years of experience as a scientist and researcher. My recent focus is in machine learning, deep learning, applied statistics and engineering. Before, I was a Postdoctoral Scholar at Lawrence Berkeley National Lab, received my PhD in Physics from Boston University and my B.S. in Astrophysics from University of California Santa Cruz. I have 3 patents and 11 publications to date and have spoken about data at various conferences around the world.