Status Report on OCR

February 5, 2009 — Eric Böhnisch-Volkmann

Just a quick update on our progress on the OCR side. While ABBYY was unable to reproduce the bug that affects the conversion of a multi-page PDF into separate images, we have now written a by-pass that does this before feeding the data into the OCR engine. This has also other advantages, e.g. the ability to run OCR also on very large documents without running into memory problems. We are now writing the new thread handling that we need for this workaround and hope to begin internal testing tonight or tomorrow morning.

Update: Our workaround is working and we have just begun to test it on a number of different machines in our labs. Due to the new architecture the plugin is chewing through a 40-page document here at the moment with medium memory requirements and queueing seems to work as expected. OCR results on both German- and English-language documents seem to be much better compared to IRIS and contain less garbled characters, even though some tests show that the ABBYY engine seems to drop unrecognized words completely. We will be curious to read your reports when we have released the new OCR plugin.

betadevonthink

The DEVONtechnologies Blog

Status Report on OCR