What are the key differences between utf8_general_ci and utf8_unicode_ci collations in MySQL/MariaDB, and how do they affect performance?
The key differences between utf8_general_ci and utf8_unicode_ci collations in MySQL/MariaDB primarily relate to sorting accuracy versus performance. utf8_unicode_ci provides more linguistically accurate sorting by treating certain special characters as their expanded equivalents (like sorting “ß” as “ss”), while utf8_general_ci offers faster performance through simpler sorting rules that treat all characters as individual units.
Contents
- Core Technical Differences
- Performance Comparison
- Practical Sorting Examples
- When to Use Each Collation
- Modern Alternatives
- Migration Considerations
Core Technical Differences
The fundamental distinction between utf8_general_ci and utf8_unicode_ci lies in their sorting algorithms and character comparison rules:
-
utf8_general_ci: Uses a simplified, locale-independent sorting algorithm that treats each Unicode character as a standalone unit. This approach provides faster but less linguistically accurate sorting.
-
utf8_unicode_ci: Implements the Unicode Collation Algorithm (UCA) which provides more linguistically correct sorting by handling character expansions and contractions. For example, it recognizes that certain characters can be equivalent to combinations of other characters.
According to the MySQL 8.0 Reference Manual, comparisons for the utf8mb4_general_ci collation are faster, but slightly less correct, than comparisons for utf8mb4_unicode_ci. The reason is that utf8mb4_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters.
The collation names follow a specific convention: they start with the character set name, may include a language identifier, and end with sensitivity indicators like _ci (case insensitive), _cs (case sensitive), or _bin (binary).
Performance Comparison
Performance differences between these collations are measurable and significant:
Benchmark Results
- Speed Difference: utf8_unicode_ci is consistently slower than utf8_general_ci by approximately 7-8% in various tests
- Test Environment: MySQL v5.6.12 on Windows showed performance differences of 10%, 4%, and 8% in different query types
- Resource Impact: The more complex Unicode algorithms require additional processing time, especially for large datasets and complex sorting operations
Performance Factors
The performance gap stems from several technical factors:
-
Algorithm Complexity: utf8_unicode_ci implements the full Unicode Collation Algorithm, which handles:
- Character expansions (one character = multiple characters)
- Contractions (multiple characters = one character)
- Weight normalization for proper linguistic ordering
-
Memory Usage: Unicode collations typically require more memory for sorting operations due to their complex comparison rules
-
Index Operations: The performance difference becomes more pronounced in indexed operations and ORDER BY clauses
-
Modern Hardware: As noted by flokoe.de, the performance gap was more significant in the past when computers were much weaker. The difference is less noticeable on modern hardware but still measurable.
Practical Sorting Examples
The most visible differences between these collations appear in how they handle special characters and linguistic conventions:
Character Expansion Examples
-
German “ß” (sharp s):
utf8_unicode_ci: Sorts as equivalent to “ss”utf8_general_ci: Sorts as individual character “s”
-
Ligature “Œ” (OE ligature):
utf8_unicode_ci: Sorts as equivalent to “OE”utf8_general_ci: Sorts as individual character
European Language Impact
These differences are particularly relevant for European languages where such character expansions are linguistically correct. The Unicode collation produces results that match native speaker expectations, while the general collation may place words in unexpected positions.
Other Unicode Features
The Unicode collation also properly handles:
- Accented characters with proper normalization
- Case folding according to Unicode standards
- Normalization forms for equivalent characters
When to Use Each Collation
Choose utf8_general_ci When:
- Performance is critical and you can tolerate minor sorting inaccuracies
- Your application primarily uses European languages where the differences are less pronounced
- You have large datasets with frequent sorting operations
- You’re working with legacy systems that were designed for the older utf8 encoding
- Your application doesn’t require perfect linguistic sorting
Choose utf8_unicode_ci When:
- Linguistic accuracy is important for your user base
- You need proper sorting for multilingual applications
- Your application handles special characters that should expand to multiple characters
- You’re implementing search functionality where correct ordering matters
- You’re building international applications with diverse language support
As Nilesh Patil notes, the preferred option is not general, but unicode when accuracy is important.
Modern Alternatives
The utf8mb4 Migration
Modern MySQL versions introduced utf8mb4, which is the proper UTF-8 implementation supporting the full Unicode character set (including emojis and rare characters):
- utf8mb4_general_ci: The modern equivalent of utf8_general_ci but with full Unicode support
- utf8mb4_unicode_ci: The modern equivalent of utf8_unicode_ci with improved algorithms
- utf8mb4_0900_ai_ci: Recommended for MySQL 8.0+ with latest Unicode support
- uca1400_ai_ci: Recommended for MariaDB 10.11+
Version-Specific Recommendations
According to CodeRed’s guide:
- MySQL 8.0+: Use
utf8mb4_0900_ai_ci - MariaDB 10.11+: Use
uca1400_ai_ci
These modern collations provide better Unicode support while maintaining good performance characteristics.
Migration Considerations
Upgrading from utf8 to utf8mb4
The transition from utf8 to utf8mb4 is straightforward since utf8mb4 is a true superset:
- Data Integrity: utf8mb4 supports all characters that utf8 does, plus additional ones
- Performance: Similar performance characteristics when using the same collation type
- Compatibility: Seamless upgrade path without data loss
Practical Migration Steps
- Assess Impact: Test sorting behavior changes in development
- Update Schema: Change character set and collation for affected tables
- Index Rebuilding: Rebuild indexes to use the new collation
- Application Testing: Ensure application logic handles any sorting changes
- Performance Monitoring: Compare query performance before and after migration
Conclusion
Choosing between utf8_general_ci and utf8_unicode_ci involves balancing performance requirements against linguistic accuracy needs. The key takeaways are:
-
Performance vs. Accuracy: utf8_general_ci offers ~7-8% faster performance but less linguistically correct sorting, while utf8_unicode_ci provides more accurate sorting at the cost of some performance.
-
Character Expansion: The Unicode collation properly handles special characters like “ß” (equivalent to “ss”) and “Œ” (equivalent to “OE”), making it preferable for multilingual applications.
-
Modern Migration: For new projects, prefer utf8mb4 variants with utf8mb4_unicode_ci or the latest utf8mb4_0900_ai_ci for MySQL 8.0+ applications.
-
Selective Use: Consider utf8_general_ci only when performance is absolutely critical and you can accept minor sorting inaccuracies.
-
Future-Proofing: As hardware improves, the performance difference becomes less significant, making the more accurate Unicode collation increasingly attractive for most applications.
For most modern applications, the improved accuracy of Unicode collations justifies the minor performance penalty, especially as database hardware continues to improve. The best approach is to test both options with your specific data and workload to make an informed decision.