I’m reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
-
The “Ô in “FuÃball” should look like this in my database: “ß”. If it is a “ß”, it is displayed correctly.
-
Sometimes, the “Ô in “FuÃball” looks like this in my database: “ÃÆß”. Then it is displayed wrongly, of course.
-
In other cases, the “Ô is saved as a “Ô – so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode()
, when must I use utf8_decode()
(it’s clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()
? Can I write a function for this? So my problems are:
- How do I find out what encoding the text uses?
- How do I convert it to UTF-8 – whatever the old encoding is?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I’ve tested it, but it doesn’t work. What’s wrong with it?