Open-source OCR Technologies (or lack thereof) @ #cwithb

So I could post about how I’ve completely ignored this blog for months… but that would take time to explain my new job and all that jazz. I’d rather post about how baffling it is that, as old as OCR tech is, there’s just not any good open-source libraries available that are even comparable to commercial libraries. Now I understand that comparing open-source projects (or just closed-source freeware) to commercial projects are sometimes light and day but in my experience, especially over the past 5 or so years, it’s almost like MOST open-source or freeware projects are at least a great subset of the commercial versions… if not better in some regards.

Let’s start with what’s out there that’s the most popular mentioned OCR library available, Tesseract . I see on the Google project site that this was one of the top 3 OCR engines in 1995. I feel sorry for people in 1995. I’ve got experience with commercial OCR engines like LeadTools and even the simple one like Microsoft Office Document Imaging (MODI) and they’re much better than Tesseract.

Then there’s SimpleOCR . They’re closed source but freeware. Thanks to those guys for providing a freeware and royalty-free SDK but man, it’s absolutely terrible. In their defense, I don’t think it is maintained anymore (it doesn’t even support color) but could take a screenshot of the letter S and this library would tell me it’s a T. Bleh. I’ve heard MODI is powered by the same engine as SimpleOCR but if that’s true, you’d expect similar results which you don’t get.

Now here’s one I want to try – Microsoft Research Project Hawaii, OCR in the Cloud . I don’t know how fully featured or how dependable this is since it’s a research project but as soon as I get some free time, I’m gonna try it out. I’ve tried WiseTREND’s OCR Cloud service and I love it… it’s fast and super accurate (uses their ABBYY engine) and also supports a lot of file types… but if I want to distribute an opensource desktop application that doesn’t a) require internet connectivity and b) more importantly doesn’t require per document pricing, WiseTREND is obviously not an option.

Just a rant on OCR products… maybe I’m wrong about some of things I said and maybe there are better alternatives out there…. which hopefully both are true and some commenters will lead me on a better path. Overall, OCR in the Cloud is great because of how expensive OCRing (and ICR even more so) is on any processing machine so it’d be great to see more options of this nature but my only problem from the litigation world is that I don’t think lawyers are exactly comfortable sending privileged client data over web for this so in a lot of cases, desktop-only solutions are the only solution.


Open Source Ocr - Bookshelf

Computer Vision and Information Technology: Advances and Applications

Computer Vision and Information Technology: Advances and Applications

Recognition of Handwritten Roman Numerals Using Tesseract Open Source OCR Engine Sandip Rakshit1, Amitava Kundu2, Mrinmoy Maity2, Subhajit Mandal2, ...

Research and Advanced Technology for Digital Libraries, 14th European Conference, ECDL 2010, Glasgow, UK, September 6-10, 2010, Proceedings

Research and Advanced Technology for Digital Libraries, 14th European Conference, ECDL 2010, Glasgow, UK, September 6-10, 2010, Proceedings

2 Customising OCRopus for Historical OCR The open-source OCRopus OCR framework[1 ] affords us a great deal of flexibility in constructing workflows around ...

Mobile Computing, Applications, and Services, First International ICST Conference, MobiCASE 2009, San Diego, CA, USA, October 26-29, 2009, Revised Selected Papers

Mobile Computing, Applications, and Services, First International ICST Conference, MobiCASE 2009, San Diego, CA, USA, October 26-29, 2009, Revised Selected Papers

A detailed list of OCR engines is available at Wikipedia. We tested some of the open source OCR engines such as OCRAD [4], Tesseract [7], ...

Understanding Open Source Software development

Understanding Open Source Software development

The Utilities category contains projects like "Open Source OCR software." Lagging just behind System and Utilities are the somewhat active categories of ...

Sanskrit Computational Linguistics, First and Second International Symposia Rocquencourt, France, October 29-31, 2007 Providence, RI, USA, May 15-17, 2008, Revised Selected Papers

Sanskrit Computational Linguistics, First and Second International Symposia Rocquencourt, France, October 29-31, 2007 Providence, RI, USA, May 15-17, 2008, Revised Selected Papers

OCRopus is an open source OCR system currently being developed, intended to be omni-lingual and omni-script. In addition to modern digital library ...

Daily Information Directory


Open Source Ocr
By expanding the business Google has stepped in and introduced the free accessibility of Open Source OCR. ... Open source ocr is one such software which is introduced by Google. ...

OCR Open Source Software - claraocr.org
Im Bereich der OCR Software werden auch zahlreiche Open Source Lösungen angeboten. Einen kleinen Überblick über die kostenlosen Software-Lösungen finden Sie hier ...

Open-Source OCR Software, Sponsored by Google
Google sponsors the development of an open-source OCR software at the IUPR research group. ... In the past, open source OCR really hasn't come close to the performance level of ...

An open source collaborative network
The source code will read a binary, grey or color image and output text. ... OCRopus :- The open source document analysis and OCR system featuring pluggable layout ...

Open Source OCR Software
The purpose of OCR (optical character recognition) software is to extract text from image files, making them text-searchable and easier to work with.