Please explain how WordPress works with MySQL character set and collation at a low level

As the question title suggests, I’m looking to understand how WordPress works with MySQL character sets and collation options. As I will show below, things don’t make much sense to me…

I installed WordPress by following the instructions on their installation page:

https://codex.wordpress.org/Installing_WordPress

As part of the instructions, I followed their advice for manual creation of the MySQL database on the commandline, namely the commands:

mysql> CREATE DATABASE databasename;
Query OK, 1 row affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON databasename.* TO "wordpressusername"@"hostname"
-> IDENTIFIED BY "password";
Query OK, 0 rows affected (0.00 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.0Best Answerec)

mysql> EXIT

Further, as instructed, I edited the “wp-config.php” file to use UTF-8 character set:

define( 'DB_CHARSET', 'utf8' );

…and left the collation setting blank:

define( 'DB_COLLATE', '' );

Here is where the fun starts…

  1. If I enter a character that is not part of MySQL UTF-8, but is part of UTF-8 MB4, such as 𝌆, into a post, it shows up correctly on the rendered page. I would have expected this not to happen, as I haven’t set the character set to UTF-8 MB4, but the more restricted UTF-8 (as defined by MySQL of course, not as generally understood).

  2. If I investigate the issue in MySQL on the commandline, it gets weirder. If I run show variables like 'char%';, I get this response:

    +--------------------------+----------------------------+
    | Variable_name            | Value                      |
    +--------------------------+----------------------------+
    | character_set_client     | utf8                       |
    | character_set_connection | utf8                       |
    | character_set_database   | latin1                     |
    | character_set_filesystem | binary                     |
    | character_set_results    | utf8                       |
    | character_set_server     | latin1                     |
    | character_set_system     | utf8                       |
    | character_sets_dir       | /usr/share/mysql/charsets/ |
    +--------------------------+----------------------------+
    

I would have expected database character set to be UTF-8, not latin1.

  1. If I run the command show variables like 'collation%';, the output is:

    +----------------------+-------------------+
    | Variable_name        | Value             |
    +----------------------+-------------------+
    | collation_connection | utf8_general_ci   |
    | collation_database   | latin1_swedish_ci |
    | collation_server     | latin1_swedish_ci |
    +----------------------+-------------------+
    

That’s even stranger, for obvious reasons (wouldn’t have expected the default latin1_swedish_ci collation in a UTF-8 database).

  1. Finally, if I run show full columns from mywpdatabase.wp_posts;, the output lines, where the value is not NULL, show collation to be:

| post_content_filtered | longtext | utf8mb4_unicode_ci |

My question then – how can this be explained? Why is my WordPress install correctly rendering UTF-8 MB4 characters, when the database is defined as UTF-8 in the config? And why is the database showing in MySQL as latin1, swedish collation, instead of UTF-8? And how come, that despite all this, the individual fields in the table are utf8mb4_unicode_ci? A low-level explanation of the way WordPress works with MySQL would be very helpful. Thank you!

1

There are two defines in wp-config.php of WordPress website:

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', '');

There are several things which are most commonly misunderstood. Names
of constants in those defines, might suggest that they are related to
the database itself. They are not. They are related to tables within
the database.

Database creation is totally independent from table creation.
WordPress does not create a database and does not care about
database’s default character set and collation, as long as it can
connect to the database.

The value ‘utf8’ in the first define means, the least restricted
character set from ‘utf8’ family, which is either ‘utf8’ or ‘utf8mb4’.

If you leave defines above unchanged, before an attempt to install your website, it is like telling WordPress to make its own choices, regarding database’s tables character set and collation, which are supported by MySQL ( depending of MySQL version ) and are least limiting.

The following are the things, WordPress analyses to determine its choices, during installation:

  • MySQL’s version
  • database’s collation ( in wp-config.php )

Based on MySQL’s version, WordPress decides, which group of utf8 family to use. There are two, distinguished by their names: utf8 and utf8mb4. Character sets from utf8 group, allow storing of a maximum 3-bytes long characters. Character sets from utf8mb4 group, allow storing of a maximum 4-bytes long characters.

Now, WordPress checks the value of DB_COLLATE define. If empty, it will use the least limiting collation from chosen utf8 family, otherwise, will use the value specified.

Examples

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', '');

If MySQL does not support utf8mb4 ( older versions ) then tables
character set will be utf8 and collation will be
utf8_general_ci. Otherwise, we can expect utf8mb4 and utf8mb4_unicode_520_ci, or utf8mb4_unicode_ci ( MySQL version dependent ), respectively.

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', 'utf8_polish_ci');

Older MySQL version – utf8 and utf8_polish_ci. Newer MySQL
version – utf8mb4 and utf8mb4_polish_ci ( _polish_ci
suffix is honoured )

define('DB_CHARSET', 'cp1250');
define('DB_COLLATE', 'cp1250_polish_ci');

Any MySQL version – cp1250 and cp1250_polish_ci.

define('DB_CHARSET', 'cp1250');
define('DB_COLLATE', 'utf8_general_ci');

Any MySQL version – error ( mismatch of character set and collation )

Summary

In most cases, leaving values of defines, explained above, unchanged, is a good choice. But, if you want tables collation to match the language of your website, you can modify the value of DB_COLLATE define, appropriately ( for example – utf8mb4_polish_ci ).

Note: that explains, why the character 𝌆 was stored and retrieved properly. Simply, your tables character set belonged to utf8mb4 group, not utf8.

Leave a Comment