How to diagnose and keep in check encoding issues?

What we have for a project is:

  • multiple sites (production, test, local development);
  • migrated by multiple methods (PHPMyAdmin, Navicat, BackupBuddy);

And the issue we are having is that while original production site seems to work fine, rest of the installations are constantly plagued by text encoding issues.

Original site is configured with latin MySQL tables, but WP is configured and serves pages as UTF-8, which I was told (in our chat) is already problematic. Rest of sites (whose databases mostly mirrors original production site) display issues such as:

  • broken characters (correctable by tweaking WP encoding settings);
  • broken characters (not correctable by tweaking WP encoding settings);
  • site working fine, but feeding broken characters to external libraries.

Since I had tried to untangle this for a while and there isn’t much info on diagnosing encoding issues in WP, my questions are following:

  1. How to reliably diagnose if site has encoding configuration issues, even if it might not display them under normal circumstances?

  2. Which rules should be formulated, put into documentation and enforced to prevent encoding issues on migration?

2 Answers
2

So after about a year (on and off!) I had managed to hopefully get a fix on encoding issue.

Why it breaks

What my experience boiled down to, is that encoding issue like this are mostly caused by miscommunication when moving data around.

  • in best case this is read mismatch, when correct data is wrongly interpreted
  • in worst case that is write mismatch, when data is incorrectly saved, causing waterfall of issues and various degrees of corruption down the line

Preemptive measures

The earliest you can screw up database encoding in WP is when creating database. So even before you even went to download that WP archive to install.

Do not rely on defaults and make sure that components talk in same encoding (like UTF8) internally, as well as to each other and visitors. This goes well beyond WP and involves MySQL configuration, possibly with some kicks for Apache and PHP on top.

See WordPress Database Charset and Collation Configuration

Fixing

When the things are thoroughly broken you are up for a ton of pain figuring out what is wrong and how to get it back to normal.

I found mb_detect_encoding() highly useful. It’s not a magic wand, but (in a strict mode) false return from it is good signal that things are not normal.

On WP-specific front $wpdb has encoding-related properties.

When you have a reason/guess/idea of what is wrong – drag data to safe place and try to convert data to be meaningfully normalized, see:

  • Converting Database Character Sets
  • MySQL is destroying my Umlauts

Leave a Comment