
Improving WordPress Database Performance and Emoji Support with UTF8MB4
WordPress has gradually moved to a 4‑byte UTF-8 encoding (called utf8mb4) to support emojis and more languages. The new Lighthouse release adds an easy way to convert your site’s database and tables to the recommended utf8mb4_unicode_ci
collation. This was done because modern WordPress and plugins expect a database that can store any Unicode character, including the latest emojis. If your site is still on the old 3‑byte utf8
charset, characters like 🇺🇸 or 🤖 may not save correctly. Lighthouse’s new tool automates the upgrade, so your site can handle richer content without manual SQL hacks.
Charset vs. Collation: What They Mean
In plain terms, a character set is an encoding – a map of characters (letters, numbers, symbols) to bytes. For example, utf8
is a MySQL character set that can encode most common characters in 3 bytes, whereas utf8mb4
can use 4 bytes to encode all Unicode characters. A collation is a set of rules for how to compare and sort those characters. It defines things like case-sensitivity and special ordering. For instance, a collation tells the database whether “é” sorts the same as “e” or where “ß” fits alphabetically.
- A charset (like
utf8mb4
) controls which characters you can store and how much space they take. - A collation (like
utf8mb4_unicode_ci
) controls how text comparisons work (for example, whetherLove
equalslove
, and how accented letters are handled).
The collation name usually ends in _ci
(case‑insensitive), _cs
(case‑sensitive), or _bin
(binary). When you convert your database to utf8mb4_unicode_ci
, you get a Unicode‑aware comparison that is case-insensitive.
Why WordPress Recommends UTF8MB4
Since WordPress 4.2 (in 2015), the core team has preferred utf8mb4
for new sites because it “safely supports the widest set of characters… including Emoji, enabling better support for non-English languages”. In practical terms, using utf8mb4
means:
- Emoji and symbols work: All current and future Unicode emojis are storable.
- Broad language coverage: Characters from scripts beyond the Basic Multilingual Plane (like ancient scripts or rare Chinese/Kanji) are supported.
- Backward compatibility:
utf8mb4
is 100% compatible with olderutf8
data – existing text will remain readable.
Today, most WordPress installations should use utf8mb4
and utf8mb4_unicode_ci
by default. In fact, the database setup in wp-config.php
(constants DB_CHARSET
and DB_COLLATE
) now typically defaults to these for modern MySQL/MariaDB versions. Lighthouse’s new converter ensures your entire database (and all tables) actually follow that standard, so you won’t see broken characters or “????” for fancy emoji.
Collation Differences in Simple Terms
Not all UTF‑8 collations behave the same. Here are the common ones:
utf8_general_ci
: An older, simplified collation. It sorts characters in a very basic way and is slightly faster, but less accurate for accented or special characters. For example, it may treat “ä” the same as “a” even when languages treat them differently.utf8_unicode_ci
: Uses the Unicode Collation Algorithm (an older standard) and gives more correct alphabetical ordering for many languages. It’s a bit slower thangeneral_ci
, but respects accents better.utf8mb4_unicode_ci
: Same rules asutf8_unicode_ci
, but works for the 4‑byteutf8mb4
charset. This is the standard recommendation for WordPress because it handles all Unicode characters correctly.utf8mb4_unicode_520_ci
: A newer version based on Unicode 5.2 rules (MySQL added this in newer versions). It only has tiny differences – for example, it treats the Polish “Ł” exactly like “L”, unlike the older collation which orders it separately. In most cases, you won’t notice a practical difference, except thatunicode_520_ci
reflects a more up-to-date sorting standard.
Bottom line: utf8mb4_unicode_ci
is generally recommended for WordPress sites. It’s broad, Unicode‑aware, and compatible. If you’re on MySQL 5.7+ or MariaDB and want the latest, you could also use utf8mb4_unicode_520_ci
(or even utf8mb4_0900_ai_ci
on MySQL 8+), but as the Yoast support team notes, the safest approach is to use the default Unicode collations and not tinker unless needed.
Performance and Compatibility Considerations
Switching to utf8mb4
has mostly positive outcomes, but it does have some small side effects:
- Index length: MySQL limits indexes by bytes. With 3‑byte
utf8
, you could index up to 255 characters in an InnoDB table (767 bytes / 3). With 4‑byteutf8mb4
, that drops to 191 characters (767 / 4). WordPress 4.2’s upgrade addressed this by reducing certain index sizes (e.g. avarchar(191)
index) so the database migration works smoothly. Lighthouse’s converter uses the same logic as WP core, so you won’t have to manually edit indexes. - Database size: In some cases,
utf8mb4
data can take more space (e.g., a single character might need 4 bytes instead of 3). In practice, the increase is usually small. - Speed: The older
general_ci
collations are a bit faster thanunicode_ci
, but benchmarks show only a few percent difference. In real-world terms, the performance loss fromunicode_ci
is negligible for most sites. The benefit is more correct sorting (for example, ensuring “café” sorts close to “cafe” rather than far apart).
Overall, using utf8mb4_unicode_ci
is worth it for the greater compatibility, and any performance impact is minimal. Modern hosts and PHP setups (mysqlnd 5.0.9+ or MySQL/MariaDB 5.5.3+) fully support utf8mb4, so compatibility is broad. Lighthouse’s conversion tool handles the technical work (including any index adjustments) so you don’t have to worry.
Using Lighthouse’s New Converter
To use the new feature, update Lighthouse to the latest version. In the WordPress admin under Lighthouse → Performance, you should now see a Database Charset/Collation section. Selecting it and saving the changes will trigger the conversion process. Behind the scenes, Lighthouse will:
1. Load the WordPress upgrade functions (wp-admin/includes/upgrade.php
).
2. Retrieve the list of tables in your database.
3. For each table, run maybe_convert_table_to_utf8mb4($table)
(a core WP function) which effectively does:
ALTER TABLE your_table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
(It skips any table with non-utf8 columns or if it’s already utf8mb4).
4. Report back which tables were converted and if any errors occurred.
This means you don’t have to manually execute SQL. Lighthouse provides progress indicators and success/error messages in the admin UI. After completion, your entire database (and all eligible tables/columns) will be in utf8mb4_unicode_ci
.
Example: How Conversion Works in Code
For developers curious about the process, here’s a simplified example of what the conversion code might look like inside a plugin:
// This code runs during an admin action, e.g. when a button is clicked.
require_once(ABSPATH . 'wp-admin/includes/upgrade.php');
global $wpdb;
// Get all table names in the current database.
$tables = $wpdb->get_col("SHOW TABLES");
// Convert each table to utf8mb4_unicode_ci if possible.
foreach ($tables as $table) {
$result = maybe_convert_table_to_utf8mb4($table);
if ($result) {
// The table was converted to utf8mb4_unicode_ci.
} else {
// Either it was already utf8mb4, or it had non-utf8 columns (skipped).
}
}
echo 'Database conversion to utf8mb4_unicode_ci is complete.';
In the real Lighthouse code, this logic is hooked into the plugin’s settings page. Users don’t need to write any code; the plugin takes care of loading the WP upgrade library and calling maybe_convert_table_to_utf8mb4()
on each table. This core function checks the current column and table collations and only runs the ALTER statement when appropriate.
Best Practices for Plugin Developers
When building or updating your own plugins with regard to charsets and collations, keep these tips in mind:
- Use
$wpdb->get_charset_collate()
: When creating custom tables (e.g. in an activation hook), always append$wpdb->get_charset_collate()
to your SQL. This ensures your table uses the current WordPress charset and collation (which should be utf8mb4 if WP is up-to-date). For example:
$charset_collate = $wpdb->get_charset_collate();
$sql = "CREATE TABLE {$wpdb->prefix}mytable (
id bigint(20) NOT NULL AUTO_INCREMENT,
name varchar(255) NOT NULL,
PRIMARY KEY (id)
) $charset_collate;";
dbDelta($sql);
This automatically uses DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
when needed.
- Avoid hard-coding “utf8” or other charsets: Don’t manually put
CHARSET=utf8
orCOLLATE=utf8_general_ci
in your plugin’s SQL. That can conflict if the site is on utf8mb4. Using$wpdb->get_charset_collate()
handles this for you. - Respect
DB_CHARSET
andDB_COLLATE
: WordPress’s constants (DB_CHARSET
,DB_COLLATE
) reflect the site’s chosen settings (often set to utf8mb4 defaults). Plugins should rely on these settings and not override them without good reason. - Check before converting: If your plugin ever needs to change an existing table’s collation, check its current charset first (e.g. with
SHOW TABLE STATUS
) and only convert if it’s not utf8mb4. UnnecessaryALTER
s are avoided inmaybe_convert_table_to_utf8mb4()
. - Backup before major changes: Converting charsets affects your data. It’s best to back up the database (or try on a staging site) before running any migration. This is true even with Lighthouse’s tool.
- Test with sample data: After conversion, test inserting and retrieving special characters (e.g. emojis, non-Latin scripts) to ensure everything works. Also check your site’s frontend for any unexpected layout or content changes.
In short, rely on WordPress APIs and defaults rather than guessing character sets. The WordPress ecosystem now standardizes on utf8mb4, and using the provided helper functions will keep your plugins compatible and data safe.
Ready to Convert?
We hope this new Lighthouse feature makes it easy to modernize your WordPress database. Upgrading to utf8mb4_unicode_ci
unlocks emoji support and better international character handling, with minimal downsides. Go ahead, update Lighthouse and try the conversion on a test site. You’ll be stepping into a more future-proof encoding setup – and you’ll get a green light, knowing your database is ready for anything!