Need good OCR for printed source code listing, any ideas?

At my work, I sometimes have to take some printed source code and manually type the source code into a text editor. Do not ask why.

Obviously typing it up takes a long time and always extra time to debug typing errors (oops missed a “$” sign there).

I decided to try some OCR solutions like:

  • Microsoft Document Imaging – has built in OCR
    • Result: Missed all the leading whitespace, missed all the underscores, interpreted many of the punctuation characters incorrectly.
    • Conclusion: Slower than manually typing in code.
  • Various online web OCR apps
    • Result: Similar or worse than Microsoft Document Imaging
    • Conclusion: Slower than manually typing in code.

I feel like source code would be very easy to OCR given the font is sans serif and monospace.

Have any of you found a good OCR solution that works well on source code?

Maybe I just need a better OCR solution (not necessarily source code specific)?

With OCR, there are currently three options:

  • Abbee FineReader and OminPage. Both are commercial products which are about on par when it comes to features and OCR result. I can’t say much about OmniPage but FineReader does come with support for reading source code (for example, it has a Java language library).
  • The best OSS OCR engine is tesseract. It’s much harder to use, you’ll probably need to train it for your language.

I rarely do OCR but I’ve found that spending the $150 on the commercial software weights out the wasted time by far.

Leave a Comment