How can I detect the encoding/codepage of a text file

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I’m looking for a way to detect code pages, like ibm850, windows1252.


Thanks for your answers, this is what I’ve done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
  • I’ve created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.

21 Answers
21

Leave a Comment