10.9. Character Sets and Collations That MySQL Supports

MySQL 5.0

previous page next page

10.9. Character Sets and Collations That MySQL Supports

MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.

You can always list the available character sets and their default collations with the SHOW CHARACTER SET statement:

mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+

10.9.1. Unicode Character Sets

MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.

ucs2 (UCS-2 Unicode) collations:
- ucs2_bin
- ucs2_czech_ci
- ucs2_danish_ci
- ucs2_esperanto_ci
- ucs2_estonian_ci
- ucs2_general_ci (default)
- ucs2_hungarian_ci
- ucs2_icelandic_ci
- ucs2_latvian_ci
- ucs2_lithuanian_ci
- ucs2_persian_ci
- ucs2_polish_ci
- ucs2_roman_ci
- ucs2_romanian_ci
- ucs2_slovak_ci
- ucs2_slovenian_ci
- ucs2_spanish2_ci
- ucs2_spanish_ci
- ucs2_swedish_ci
- ucs2_turkish_ci
- ucs2_unicode_ci
utf8 (UTF-8 Unicode) collations:
- utf8_bin
- utf8_czech_ci
- utf8_danish_ci
- utf8_esperanto_ci
- utf8_estonian_ci
- utf8_general_ci (default)
- utf8_hungarian_ci
- utf8_icelandic_ci
- utf8_latvian_ci
- utf8_lithuanian_ci
- utf8_persian_ci
- utf8_polish_ci
- utf8_roman_ci
- utf8_romanian_ci
- utf8_slovak_ci
- utf8_slovenian_ci
- utf8_spanish2_ci
- utf8_spanish_ci
- utf8_swedish_ci
- utf8_turkish_ci
- utf8_unicode_ci

Note that in the ucs2_roman_ci and utf8_roman_ci collations, I and J compare as equals, and U and V compare as equals.

The ucs2_esperanto_ci and utf8_esperanto_ci collations were added in MySQL 5.0.13. The ucs2_hungarian_ci and utf8_hungarian_ci collations were added in MySQL 5.0.19.

MySQL implements the utf8_unicode_ci collation according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The following discussion uses utf8_unicode_ci, but it is also true for ucs2_unicode_ci.

Currently, the utf8_unicode_ci collation has only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese and some minority languages in Russia such as Udmurt, Tatar, Bashkir, and Mari.

The most significant feature in utf8_unicode_ci is that it supports expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages ‘ß’ is equal to ‘ss’.

utf8_general_ci is a legacy collation that does not support expansions. It can make only one-to-one comparisons between characters. This means that comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci.

For example, the following equalities hold in both utf8_general_ci and utf8_unicode_ci:

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The utf8_spanish_ci and utf8_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’. In addition, for traditional Spanish, ‘ch’ is a separate letter between ‘c’ and ‘d’, and ‘ll’ is a separate letter between ‘l’ and ‘m’

10.9.2. West European Character Sets

Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.

ascii (US ASCII) collations:
- ascii_bin
- ascii_general_ci (default)
cp850 (DOS West European) collations:
- cp850_bin
- cp850_general_ci (default)
dec8 (DEC Western European) collations:
- dec8_bin
- dec8_swedish_ci (default)
hp8 (HP Western European) collations:
- hp8_bin
- hp8_english_ci (default)
latin1 (cp1252 West European) collations:
- latin1_bin
- latin1_danish_ci
- latin1_general_ci
- latin1_general_cs
- latin1_german1_ci
- latin1_german2_ci
- latin1_spanish_ci
- latin1_swedish_ci (default)
latin1 is the default character set. MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, but IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

The latin1_swedish_ci collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.

The latin1_german1_ci and latin1_german2_ci collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.”
- latin1_german1_ci (dictionary) rules:
```
Ä = A
Ö = O
Ü = U
ß = s
```
- latin1_german2_ci (phone-book) rules:
```
Ä = AE
Ö = OE
Ü = UE
ß = ss
```
In the latin1_spanish_ci collation, ‘ñ’ (n-tilde) is a separate letter between ‘n’ and ‘o’.
macroman (Mac West European) collations:
- macroman_bin
- macroman_general_ci (default)
swe7 (7bit Swedish) collations:
- swe7_bin
- swe7_swedish_ci (default)

10.9.3. Central European Character Sets

MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.

cp1250 (Windows Central European) collations:
- cp1250_bin
- cp1250_croatian_ci
- cp1250_czech_cs
- cp1250_general_ci (default)
cp852 (DOS Central European) collations:
- cp852_bin
- cp852_general_ci (default)
keybcs2 (DOS Kamenicky Czech-Slovak) collations:
- keybcs2_bin
- keybcs2_general_ci (default)
latin2 (ISO 8859-2 Central European) collations:
- latin2_bin
- latin2_croatian_ci
- latin2_czech_cs
- latin2_general_ci (default)
- latin2_hungarian_ci
macce (Mac Central European) collations:
- macce_bin
- macce_general_ci (default)

10.9.4. South European and Middle East Character Sets

South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.

armscii8 (ARMSCII-8 Armenian) collations:
- armscii8_bin
- armscii8_general_ci (default)
cp1256 (Windows Arabic) collations:
- cp1256_bin
- cp1256_general_ci (default)
geostd8 (GEOSTD8 Georgian) collations:
- geostd8_bin
- geostd8_general_ci (default)
greek (ISO 8859-7 Greek) collations:
- greek_bin
- greek_general_ci (default)
hebrew (ISO 8859-8 Hebrew) collations:
- hebrew_bin
- hebrew_general_ci (default)
latin5 (ISO 8859-9 Turkish) collations:
- latin5_bin
- latin5_turkish_ci (default)

10.9.5. Baltic Character Sets

The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.

cp1257 (Windows Baltic) collations:
- cp1257_bin
- cp1257_general_ci (default)
- cp1257_lithuanian_ci
latin7 (ISO 8859-13 Baltic) collations:
- latin7_bin
- latin7_estonian_cs
- latin7_general_ci (default)
- latin7_general_cs

10.9.6. Cyrillic Character Sets

The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.

cp1251 (Windows Cyrillic) collations:
- cp1251_bin
- cp1251_bulgarian_ci
- cp1251_general_ci (default)
- cp1251_general_cs
- cp1251_ukrainian_ci
cp866 (DOS Russian) collations:
- cp866_bin
- cp866_general_ci (default)
koi8r (KOI8-R Relcom Russian) collations:
- koi8r_bin
- koi8r_general_ci (default)
koi8u (KOI8-U Ukrainian) collations:
- koi8u_bin
- koi8u_general_ci (default)

10.9.7. Asian Character Sets

The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated. For example, the Chinese sets must allow for thousands of different characters. See Section 10.9.7.1, “The cp932 Character Set”, for additional information about the cp932 and sjis character sets.

big5 (Big5 Traditional Chinese) collations:
- big5_bin
- big5_chinese_ci (default)
cp932 (SJIS for Windows Japanese) collations:
- cp932_bin
- cp932_japanese_ci (default)
eucjpms (UJIS for Windows Japanese) collations:
- eucjpms_bin
- eucjpms_japanese_ci (default)
euckr (EUC-KR Korean) collations:
- euckr_bin
- euckr_korean_ci (default)
gb2312 (GB2312 Simplified Chinese) collations:
- gb2312_bin
- gb2312_chinese_ci (default)
gbk (GBK Simplified Chinese) collations:
- gbk_bin
- gbk_chinese_ci (default)
sjis (Shift-JIS Japanese) collations:
- sjis_bin
- sjis_japanese_ci (default)
tis620 (TIS620 Thai) collations:
- tis620_bin
- tis620_thai_ci (default)
ujis (EUC-JP Japanese) collations:
- ujis_bin
- ujis_japanese_ci (default)

10.9.7.1. The `cp932` Character Set

Why is cp932 needed?

In MySQL, the sjis character set corresponds to the Shift_JIS character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)

However, the meaning of “SHIFT JIS” as a descriptive term has become very vague and it often includes the extensions to Shift_JIS that are defined by various vendors.

For example, “SHIFT JIS” used in Japanese Windows environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC selected — IBM extended characters, and IBM extended characters.

Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:

MySQL automatically converts character sets.
Character sets are converted via Unicode (ucs2).
The sjis character set does not support the conversion of these extension characters.
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).

The MySQL cp932 character set is designed to solve these problems. It is available as of MySQL 5.0.3.

Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS and cp932 into two different character sets because they provide different conversion rules.

How does cp932 differ from sjis?

The cp932 character set differs from sjis in the following ways:

cp932 supports NEC special characters, NEC selected — IBM extended characters, and IBM selected characters.
Some cp932 characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back to cp932, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)

The conversion rule works like this:
- If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
- If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
- If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
The table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm provides information about the Unicode values of cp932 characters. For cp932 table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2) encoding. For table entries with an underlined two-digit value appears, there is a range of cp932 character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of the cp932 characters that begin with those digits.

The following links are of special interest. They correspond to the encodings for the following sets of characters:
- NEC special characters:
```
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
```
- NEC selected — IBM extended characters:
```
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm
http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
```
- IBM selected characters:
```
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
```
Starting from version 5.0.3, cp932 supports conversion of user-defined characters in combination with eucjpms, and solves the problems with sjis/ujis conversion. For details, please refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.

For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate these differences.

Conversion to ucs2:

`sjis`/`cp932` Value	`sjis` -> `ucs2` Conversion	`cp932` -> `ucs2` Conversion
5C	005C	005C
7E	007E	007E
815C	2015	2015
815F	005C	FF3C
8160	301C	FF5E
8161	2016	2225
817C	2212	FF0D
8191	00A2	FFE0
8192	00A3	FFE1
81CA	00AC	FFE2

Conversion from ucs2:

`ucs2` value	`ucs2` -> `sjis` Conversion	`ucs2` -> `cp932` Conversion
005C	815F	5C
007E	7E	7E
00A2	8191	3F
00A3	8192	3F
00AC	81CA	3F
2015	815C	815C
2016	8161	3F
2212	817C	3F
2225	3F	8161
301C	8160	3F
FF0D	3F	817C
FF3C	3F	815F
FF5E	3F	8160
FFE0	3F	8191
FFE1	3F	8192
FFE2	3F	81CA

Users of any Japanese character sets should be aware that using --character-set-client-handshake (or --skip-character-set-client-handshake) has an important effect. See Section 5.2.1, “mysqld Command Options”.

previous page start next page

10.9. Character Sets and Collations That MySQL Supports

MySQL 5.0