10.9. Character Sets and Collations That MySQL Supports

MySQL 5.0

10.9. Character Sets and Collations That MySQL Supports

MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.

You can always list the available character sets and their default collations with the statement:

mysql> 
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+

10.9.1. Unicode Character Sets

MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.

  • (UCS-2 Unicode) collations:

    • (default)

  • (UTF-8 Unicode) collations:

    • (default)

Note that in the and collations, and compare as equals, and and compare as equals.

The and collations were added in MySQL 5.0.13. The and collations were added in MySQL 5.0.19.

MySQL implements the collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The following discussion uses , but it is also true for .

Currently, the collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari.

The most significant feature in is that it supports expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘’ is equal to ‘’.

is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the collation are faster, but slightly less correct, than comparisons for .

For example, the following equalities hold in both and :

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for :

ß = s

Whereas this is true for :

ß = ss

MySQL implements language-specific collations for the character set only if the ordering with does not work well for a language. For example, works fine for German and French, so there is no need to create special collations for these two languages.

also is satisfactory for both German and French, except that ‘’ is equal to ‘’, and not to ‘’. If this is acceptable for your application, then you should use because it is faster. Otherwise, use because it is more accurate.

, like other language-specific collations, is derived from with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The and collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘’ (n-tilde) is a separate letter between ‘’ and ‘’. In addition, for traditional Spanish, ‘’ is a separate letter between ‘’ and ‘’, and ‘’ is a separate letter between ‘’ and ‘

10.9.2. West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

  • (US ASCII) collations:

    • (default)

  • (DOS West European) collations:

    • (default)

  • (DEC Western European) collations:

    • (default)

  • (HP Western European) collations:

    • (default)

  • (cp1252 West European) collations:

    • (default)

    is the default character set. MySQL's is the same as the Windows character set. This means it is the same as the official or IANA (Internet Assigned Numbers Authority) , but IANA treats the code points between and as “undefined,” whereas , and therefore MySQL's , assign characters for those positions. For example, is the Euro sign. For the “undefined” entries in , MySQL translates to Unicode , to , to , to , and to .

    The collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.

    The and collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.

    • (dictionary) rules:

      Ä = A
      Ö = O
      Ü = U
      ß = s
      
    • (phone-book) rules:

      Ä = AE
      Ö = OE
      Ü = UE
      ß = ss
      

    In the collation, ‘’ (n-tilde) is a separate letter between ‘’ and ‘’.

  • (Mac West European) collations:

    • (default)

  • (7bit Swedish) collations:

    • (default)

10.9.3. Central European Character Sets

MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.

  • (Windows Central European) collations:

    • (default)

  • (DOS Central European) collations:

    • (default)

  • (DOS Kamenicky Czech-Slovak) collations:

    • (default)

  • (ISO 8859-2 Central European) collations:

    • (default)

  • (Mac Central European) collations:

    • (default)

10.9.4. South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.

  • (ARMSCII-8 Armenian) collations:

    • (default)

  • (Windows Arabic) collations:

    • (default)

  • (GEOSTD8 Georgian) collations:

    • (default)

  • (ISO 8859-7 Greek) collations:

    • (default)

  • (ISO 8859-8 Hebrew) collations:

    • (default)

  • (ISO 8859-9 Turkish) collations:

    • (default)

10.9.5. Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.

  • (Windows Baltic) collations:

    • (default)

  • (ISO 8859-13 Baltic) collations:

    • (default)

10.9.6. Cyrillic Character Sets

The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.

  • (Windows Cyrillic) collations:

    • (default)

  • (DOS Russian) collations:

    • (default)

  • (KOI8-R Relcom Russian) collations:

    • (default)

  • (KOI8-U Ukrainian) collations:

    • (default)

10.9.7. Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section 10.9.7.1, “The Character Set”, for additional information about the and character sets.

  • (Big5 Traditional Chinese) collations:

    • (default)

  • (SJIS for Windows Japanese) collations:

    • (default)

  • (UJIS for Windows Japanese) collations:

    • (default)

  • (EUC-KR Korean) collations:

    • (default)

  • (GB2312 Simplified Chinese) collations:

    • (default)

  • (GBK Simplified Chinese) collations:

    • (default)

  • (Shift-JIS Japanese) collations:

    • (default)

  • (TIS620 Thai) collations:

    • (default)

  • (EUC-JP Japanese) collations:

    • (default)

10.9.7.1. The Character Set

Why is needed?

In MySQL, the character set corresponds to the character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to that are defined by various vendors.

For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of and its exact name is or . In addition to the characters supported by , supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

  • MySQL automatically converts character sets.

  • Character sets are converted via Unicode ().

  • The character set does not support the conversion of these extension characters.

  • There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL character set is designed to solve these problems. It is available as of MySQL 5.0.3.

Because MySQL supports character set conversion, it is important to separate IANA and into two different character sets because they provide different conversion rules.

How does differ from ?

The character set differs from in the following ways:

For some characters, conversion to and from is different for and . The following tables illustrate these differences.

Conversion to :

/ Value -> Conversion -> Conversion
5C 005C 005C
7E 007E 007E
815C 2015 2015
815F 005C FF3C
8160 301C FF5E
8161 2016 2225
817C 2212 FF0D
8191 00A2 FFE0
8192 00A3 FFE1
81CA 00AC FFE2

Conversion from :

value -> Conversion -> Conversion
005C 815F 5C
007E 7E 7E
00A2 8191 3F
00A3 8192 3F
00AC 81CA 3F
2015 815C 815C
2016 8161 3F
2212 817C 3F
2225 3F 8161
301C 8160 3F
FF0D 3F 817C
FF3C 3F 815F
FF5E 3F 8160
FFE0 3F 8191
FFE1 3F 8192
FFE2 3F 81CA

Users of any Japanese character sets should be aware that using (or ) has an important effect. See Section 5.2.1, “mysqld Command Options”.