New in Lighthouse: UTF8MB4 Database Conversion Tool

Lighthouse – Performance & Security Plugin for WordPress

Improving WordPress Database Performance and Emoji Support with UTF8MB4

WordPress has gradually moved to a 4‑byte UTF-8 encoding (called utf8mb4) to support emojis and more languages. The new Lighthouse release adds an easy way to convert your site’s database and tables to the recommended utf8mb4_unicode_ci collation. This was done because modern WordPress and plugins expect a database that can store any Unicode character, including the latest emojis. If your site is still on the old 3‑byte utf8 charset, characters like 🇺🇸 or 🤖 may not save correctly. Lighthouse’s new tool automates the upgrade, so your site can handle richer content without manual SQL hacks.

Charset vs. Collation: What They Mean

In plain terms, a character set is an encoding – a map of characters (letters, numbers, symbols) to bytes. For example, utf8 is a MySQL character set that can encode most common characters in 3 bytes, whereas utf8mb4 can use 4 bytes to encode all Unicode characters. A collation is a set of rules for how to compare and sort those characters. It defines things like case-sensitivity and special ordering. For instance, a collation tells the database whether “é” sorts the same as “e” or where “ß” fits alphabetically.

  • A charset (like utf8mb4) controls which characters you can store and how much space they take.
  • A collation (like utf8mb4_unicode_ci) controls how text comparisons work (for example, whether Love equals love, and how accented letters are handled).

The collation name usually ends in _ci (case‑insensitive), _cs (case‑sensitive), or _bin (binary). When you convert your database to utf8mb4_unicode_ci, you get a Unicode‑aware comparison that is case-insensitive.

Why WordPress Recommends UTF8MB4

Since WordPress 4.2 (in 2015), the core team has preferred utf8mb4 for new sites because it “safely supports the widest set of characters… including Emoji, enabling better support for non-English languages”. In practical terms, using utf8mb4 means:

  • Emoji and symbols work: All current and future Unicode emojis are storable.
  • Broad language coverage: Characters from scripts beyond the Basic Multilingual Plane (like ancient scripts or rare Chinese/Kanji) are supported.
  • Backward compatibility: utf8mb4 is 100% compatible with older utf8 data – existing text will remain readable.

Today, most WordPress installations should use utf8mb4 and utf8mb4_unicode_ci by default. In fact, the database setup in wp-config.php (constants DB_CHARSET and DB_COLLATE) now typically defaults to these for modern MySQL/MariaDB versions. Lighthouse’s new converter ensures your entire database (and all tables) actually follow that standard, so you won’t see broken characters or “????” for fancy emoji.

Collation Differences in Simple Terms

Not all UTF‑8 collations behave the same. Here are the common ones:

  • utf8_general_ci: An older, simplified collation. It sorts characters in a very basic way and is slightly faster, but less accurate for accented or special characters. For example, it may treat “ä” the same as “a” even when languages treat them differently.
  • utf8_unicode_ci: Uses the Unicode Collation Algorithm (an older standard) and gives more correct alphabetical ordering for many languages. It’s a bit slower than general_ci, but respects accents better.
  • utf8mb4_unicode_ci: Same rules as utf8_unicode_ci, but works for the 4‑byte utf8mb4 charset. This is the standard recommendation for WordPress because it handles all Unicode characters correctly.
  • utf8mb4_unicode_520_ci: A newer version based on Unicode 5.2 rules (MySQL added this in newer versions). It only has tiny differences – for example, it treats the Polish “Ł” exactly like “L”, unlike the older collation which orders it separately. In most cases, you won’t notice a practical difference, except that unicode_520_ci reflects a more up-to-date sorting standard.

Bottom line: utf8mb4_unicode_ci is generally recommended for WordPress sites. It’s broad, Unicode‑aware, and compatible. If you’re on MySQL 5.7+ or MariaDB and want the latest, you could also use utf8mb4_unicode_520_ci (or even utf8mb4_0900_ai_ci on MySQL 8+), but as the Yoast support team notes, the safest approach is to use the default Unicode collations and not tinker unless needed.

Performance and Compatibility Considerations

Switching to utf8mb4 has mostly positive outcomes, but it does have some small side effects:

  • Index length: MySQL limits indexes by bytes. With 3‑byte utf8, you could index up to 255 characters in an InnoDB table (767 bytes / 3). With 4‑byte utf8mb4, that drops to 191 characters (767 / 4). WordPress 4.2’s upgrade addressed this by reducing certain index sizes (e.g. a varchar(191) index) so the database migration works smoothly. Lighthouse’s converter uses the same logic as WP core, so you won’t have to manually edit indexes.
  • Database size: In some cases, utf8mb4 data can take more space (e.g., a single character might need 4 bytes instead of 3). In practice, the increase is usually small.
  • Speed: The older general_ci collations are a bit faster than unicode_ci, but benchmarks show only a few percent difference. In real-world terms, the performance loss from unicode_ci is negligible for most sites. The benefit is more correct sorting (for example, ensuring “café” sorts close to “cafe” rather than far apart).

Overall, using utf8mb4_unicode_ci is worth it for the greater compatibility, and any performance impact is minimal. Modern hosts and PHP setups (mysqlnd 5.0.9+ or MySQL/MariaDB 5.5.3+) fully support utf8mb4, so compatibility is broad. Lighthouse’s conversion tool handles the technical work (including any index adjustments) so you don’t have to worry.

Using Lighthouse’s New Converter

To use the new feature, update Lighthouse to the latest version. In the WordPress admin under Lighthouse → Performance, you should now see a Database Charset/Collation section. Selecting it and saving the changes will trigger the conversion process. Behind the scenes, Lighthouse will:

1. Load the WordPress upgrade functions (wp-admin/includes/upgrade.php).

2. Retrieve the list of tables in your database.

3. For each table, run maybe_convert_table_to_utf8mb4($table) (a core WP function) which effectively does:

ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

(It skips any table with non-utf8 columns or if it’s already utf8mb4).

4. Report back which tables were converted and if any errors occurred.

This means you don’t have to manually execute SQL. Lighthouse provides progress indicators and success/error messages in the admin UI. After completion, your entire database (and all eligible tables/columns) will be in utf8mb4_unicode_ci.

Example: How Conversion Works in Code

For developers curious about the process, here’s a simplified example of what the conversion code might look like inside a plugin:

// This code runs during an admin action, e.g. when a button is clicked.
require_once(ABSPATH . 'wp-admin/includes/upgrade.php');
global $wpdb;

// Get all table names in the current database.
$tables = $wpdb->get_col("SHOW TABLES");

// Convert each table to utf8mb4_unicode_ci if possible.
foreach ($tables as $table) {
    $result = maybe_convert_table_to_utf8mb4($table);
    if ($result) {
        // The table was converted to utf8mb4_unicode_ci.
    } else {
        // Either it was already utf8mb4, or it had non-utf8 columns (skipped).
    }
}
echo 'Database conversion to utf8mb4_unicode_ci is complete.';

In the real Lighthouse code, this logic is hooked into the plugin’s settings page. Users don’t need to write any code; the plugin takes care of loading the WP upgrade library and calling maybe_convert_table_to_utf8mb4() on each table. This core function checks the current column and table collations and only runs the ALTER statement when appropriate.

Best Practices for Plugin Developers

When building or updating your own plugins with regard to charsets and collations, keep these tips in mind:

  • Use $wpdb->get_charset_collate(): When creating custom tables (e.g. in an activation hook), always append $wpdb->get_charset_collate() to your SQL. This ensures your table uses the current WordPress charset and collation (which should be utf8mb4 if WP is up-to-date). For example:
$charset_collate = $wpdb->get_charset_collate();
$sql = "CREATE TABLE {$wpdb->prefix}mytable (
    id bigint(20) NOT NULL AUTO_INCREMENT,
    name varchar(255) NOT NULL,
    PRIMARY KEY  (id)
) $charset_collate;";
dbDelta($sql);

This automatically uses DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci when needed.

  • Avoid hard-coding “utf8” or other charsets: Don’t manually put CHARSET=utf8 or COLLATE=utf8_general_ci in your plugin’s SQL. That can conflict if the site is on utf8mb4. Using $wpdb->get_charset_collate() handles this for you.
  • Respect DB_CHARSET and DB_COLLATE: WordPress’s constants (DB_CHARSET, DB_COLLATE) reflect the site’s chosen settings (often set to utf8mb4 defaults). Plugins should rely on these settings and not override them without good reason.
  • Check before converting: If your plugin ever needs to change an existing table’s collation, check its current charset first (e.g. with SHOW TABLE STATUS) and only convert if it’s not utf8mb4. Unnecessary ALTERs are avoided in maybe_convert_table_to_utf8mb4().
  • Backup before major changes: Converting charsets affects your data. It’s best to back up the database (or try on a staging site) before running any migration. This is true even with Lighthouse’s tool.
  • Test with sample data: After conversion, test inserting and retrieving special characters (e.g. emojis, non-Latin scripts) to ensure everything works. Also check your site’s frontend for any unexpected layout or content changes.

In short, rely on WordPress APIs and defaults rather than guessing character sets. The WordPress ecosystem now standardizes on utf8mb4, and using the provided helper functions will keep your plugins compatible and data safe.

Ready to Convert?

We hope this new Lighthouse feature makes it easy to modernize your WordPress database. Upgrading to utf8mb4_unicode_ci unlocks emoji support and better international character handling, with minimal downsides. Go ahead, update Lighthouse and try the conversion on a test site. You’ll be stepping into a more future-proof encoding setup – and you’ll get a green light, knowing your database is ready for anything!

on in WordPress | Last modified on

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *