MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.
You can always list the available character sets and their default
collations with the SHOW CHARACTER SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Description | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
-
ucs2
(UCS-2 Unicode) collations:-
ucs2_bin
-
ucs2_czech_ci
-
ucs2_danish_ci
-
ucs2_esperanto_ci
-
ucs2_estonian_ci
-
ucs2_general_ci
(default) -
ucs2_hungarian_ci
-
ucs2_icelandic_ci
-
ucs2_latvian_ci
-
ucs2_lithuanian_ci
-
ucs2_persian_ci
-
ucs2_polish_ci
-
ucs2_roman_ci
-
ucs2_romanian_ci
-
ucs2_slovak_ci
-
ucs2_slovenian_ci
-
ucs2_spanish2_ci
-
ucs2_spanish_ci
-
ucs2_swedish_ci
-
ucs2_turkish_ci
-
ucs2_unicode_ci
-
-
utf8
(UTF-8 Unicode) collations:-
utf8_bin
-
utf8_czech_ci
-
utf8_danish_ci
-
utf8_esperanto_ci
-
utf8_estonian_ci
-
utf8_general_ci
(default) -
utf8_hungarian_ci
-
utf8_icelandic_ci
-
utf8_latvian_ci
-
utf8_lithuanian_ci
-
utf8_persian_ci
-
utf8_polish_ci
-
utf8_roman_ci
-
utf8_romanian_ci
-
utf8_slovak_ci
-
utf8_slovenian_ci
-
utf8_spanish2_ci
-
utf8_spanish_ci
-
utf8_swedish_ci
-
utf8_turkish_ci
-
utf8_unicode_ci
-
Note that in the ucs2_roman_ci
and
utf8_roman_ci
collations,
I
and J
compare as equals,
and U
and V
compare as
equals.
The ucs2_esperanto_ci
and
utf8_esperanto_ci
collations were added in
MySQL 5.0.13. The ucs2_hungarian_ci
and
utf8_hungarian_ci
collations were added in
MySQL 5.0.19.
MySQL implements the utf8_unicode_ci
collation according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The following discussion uses
utf8_unicode_ci
, but it is also true for
ucs2_unicode_ci
.
Currently, the utf8_unicode_ci
collation has
only partial support for the Unicode Collation Algorithm. Some
characters are not supported yet. Also, combining marks are not
fully supported. This affects primarily Vietnamese and some
minority languages in Russia such as Udmurt, Tatar, Bashkir, and
Mari.
The most significant feature in
utf8_unicode_ci
is that it supports
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ‘ß
’ is
equal to ‘ss
’.
utf8_general_ci
is a legacy collation that
does not support expansions. It can make only one-to-one
comparisons between characters. This means that comparisons for
the utf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
.
For example, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
:
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
:
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering with
utf8_unicode_ci
does not work well for a
language. For example, utf8_unicode_ci
works
fine for German and French, so there is no need to create
special utf8
collations for these two
languages.
utf8_general_ci
also is satisfactory for both
German and French, except that
‘ß
’ is equal to
‘s
’, and not to
‘ss
’. If this is acceptable for
your application, then you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it is
more accurate.
utf8_swedish_ci
, like other
utf8
language-specific collations, is derived
from utf8_unicode_ci
with additional language
rules. For example, in Swedish, the following relationship
holds, which is not something expected by a German or French
speaker:
Ü = Y < Ö
The utf8_spanish_ci
and
utf8_spanish2_ci
collations correspond to
modern Spanish and traditional Spanish, respectively. In both
collations, ‘ñ
’ (n-tilde) is a
separate letter between ‘n
’ and
‘o
’. In addition, for traditional
Spanish, ‘ch
’ is a separate
letter between ‘c
’ and
‘d
’, and
‘ll
’ is a separate letter between
‘l
’ and
‘m
’
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
-
ascii
(US ASCII) collations:-
ascii_bin
-
ascii_general_ci
(default)
-
-
cp850
(DOS West European) collations:-
cp850_bin
-
cp850_general_ci
(default)
-
-
dec8
(DEC Western European) collations:-
dec8_bin
-
dec8_swedish_ci
(default)
-
-
hp8
(HP Western European) collations:-
hp8_bin
-
hp8_english_ci
(default)
-
-
latin1
(cp1252 West European) collations:-
latin1_bin
-
latin1_danish_ci
-
latin1_general_ci
-
latin1_general_cs
-
latin1_german1_ci
-
latin1_german2_ci
-
latin1_spanish_ci
-
latin1_swedish_ci
(default)
latin1
is the default character set. MySQL'slatin1
is the same as the Windowscp1252
character set. This means it is the same as the officialISO 8859-1
or IANA (Internet Assigned Numbers Authority)latin1
, but IANAlatin1
treats the code points between0x80
and0x9f
as “undefined,” whereascp1252
, and therefore MySQL'slatin1
, assign characters for those positions. For example,0x80
is the Euro sign. For the “undefined” entries incp1252
, MySQL translates0x81
to Unicode0x0081
,0x8d
to0x008d
,0x8f
to0x008f
,0x90
to0x0090
, and0x9d
to0x009d
.The
latin1_swedish_ci
collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.The
latin1_german1_ci
andlatin1_german2_ci
collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.”-
latin1_german1_ci
(dictionary) rules:Ä = A Ö = O Ü = U ß = s
-
latin1_german2_ci
(phone-book) rules:Ä = AE Ö = OE Ü = UE ß = ss
In the
latin1_spanish_ci
collation, ‘ñ
’ (n-tilde) is a separate letter between ‘n
’ and ‘o
’. -
-
macroman
(Mac West European) collations:-
macroman_bin
-
macroman_general_ci
(default)
-
-
swe7
(7bit Swedish) collations:-
swe7_bin
-
swe7_swedish_ci
(default)
-
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, and Poland.
-
cp1250
(Windows Central European) collations:-
cp1250_bin
-
cp1250_croatian_ci
-
cp1250_czech_cs
-
cp1250_general_ci
(default)
-
-
cp852
(DOS Central European) collations:-
cp852_bin
-
cp852_general_ci
(default)
-
-
keybcs2
(DOS Kamenicky Czech-Slovak) collations:-
keybcs2_bin
-
keybcs2_general_ci
(default)
-
-
latin2
(ISO 8859-2 Central European) collations:-
latin2_bin
-
latin2_croatian_ci
-
latin2_czech_cs
-
latin2_general_ci
(default) -
latin2_hungarian_ci
-
-
macce
(Mac Central European) collations:-
macce_bin
-
macce_general_ci
(default)
-
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
-
armscii8
(ARMSCII-8 Armenian) collations:-
armscii8_bin
-
armscii8_general_ci
(default)
-
-
cp1256
(Windows Arabic) collations:-
cp1256_bin
-
cp1256_general_ci
(default)
-
-
geostd8
(GEOSTD8 Georgian) collations:-
geostd8_bin
-
geostd8_general_ci
(default)
-
-
greek
(ISO 8859-7 Greek) collations:-
greek_bin
-
greek_general_ci
(default)
-
-
hebrew
(ISO 8859-8 Hebrew) collations:-
hebrew_bin
-
hebrew_general_ci
(default)
-
-
latin5
(ISO 8859-9 Turkish) collations:-
latin5_bin
-
latin5_turkish_ci
(default)
-
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
-
cp1257
(Windows Baltic) collations:-
cp1257_bin
-
cp1257_general_ci
(default) -
cp1257_lithuanian_ci
-
-
latin7
(ISO 8859-13 Baltic) collations:-
latin7_bin
-
latin7_estonian_cs
-
latin7_general_ci
(default) -
latin7_general_cs
-
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, and Ukrainian languages.
-
cp1251
(Windows Cyrillic) collations:-
cp1251_bin
-
cp1251_bulgarian_ci
-
cp1251_general_ci
(default) -
cp1251_general_cs
-
cp1251_ukrainian_ci
-
-
cp866
(DOS Russian) collations:-
cp866_bin
-
cp866_general_ci
(default)
-
-
koi8r
(KOI8-R Relcom Russian) collations:-
koi8r_bin
-
koi8r_general_ci
(default)
-
-
koi8u
(KOI8-U Ukrainian) collations:-
koi8u_bin
-
koi8u_general_ci
(default)
-
The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of different
characters. See Section 10.9.7.1, “The cp932
Character Set”, for additional
information about the cp932
and
sjis
character sets.
-
big5
(Big5 Traditional Chinese) collations:-
big5_bin
-
big5_chinese_ci
(default)
-
-
cp932
(SJIS for Windows Japanese) collations:-
cp932_bin
-
cp932_japanese_ci
(default)
-
-
eucjpms
(UJIS for Windows Japanese) collations:-
eucjpms_bin
-
eucjpms_japanese_ci
(default)
-
-
euckr
(EUC-KR Korean) collations:-
euckr_bin
-
euckr_korean_ci
(default)
-
-
gb2312
(GB2312 Simplified Chinese) collations:-
gb2312_bin
-
gb2312_chinese_ci
(default)
-
-
gbk
(GBK Simplified Chinese) collations:-
gbk_bin
-
gbk_chinese_ci
(default)
-
-
sjis
(Shift-JIS Japanese) collations:-
sjis_bin
-
sjis_japanese_ci
(default)
-
-
tis620
(TIS620 Thai) collations:-
tis620_bin
-
tis620_thai_ci
(default)
-
-
ujis
(EUC-JP Japanese) collations:-
ujis_bin
-
ujis_japanese_ci
(default)
-
Why is cp932
needed?
In MySQL, the sjis
character set
corresponds to the Shift_JIS
character set
defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS
that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese Windows
environments is a Microsoft extension of
Shift_JIS
and its exact name is
Microsoft Windows Codepage : 932
or
cp932
. In addition to the characters
supported by Shift_JIS
,
cp932
supports extension characters such as
NEC special characters, NEC selected — IBM extended
characters, and IBM extended characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
-
MySQL automatically converts character sets.
-
Character sets are converted via Unicode (
ucs2
). -
The
sjis
character set does not support the conversion of these extension characters. -
There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed
to solve these problems. It is available as of MySQL 5.0.3.
Because MySQL supports character set conversion, it is
important to separate IANA Shift_JIS
and
cp932
into two different character sets
because they provide different conversion rules.
How does cp932
differ
from sjis
?
The cp932
character set differs from
sjis
in the following ways:
-
cp932
supports NEC special characters, NEC selected — IBM extended characters, and IBM selected characters. -
Some
cp932
characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back tocp932
, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)The conversion rule works like this:
-
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
-
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
-
If the character is in both IBM selected characters and NEC selected — IBM extended characters, use the code point of IBM extended characters.
The table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm provides information about the Unicode values of
cp932
characters. Forcp932
table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2
) encoding. For table entries with an underlined two-digit value appears, there is a range ofcp932
character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of thecp932
characters that begin with those digits.The following links are of special interest. They correspond to the encodings for the following sets of characters:
-
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
-
NEC selected — IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
-
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
-
-
Starting from version 5.0.3,
cp932
supports conversion of user-defined characters in combination witheucjpms
, and solves the problems withsjis
/ujis
conversion. For details, please refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2
is different for
sjis
and cp932
. The
following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932
Value
|
sjis ->
ucs2 Conversion
|
cp932 ->
ucs2 Conversion
|
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value
|
ucs2 ->
sjis Conversion
|
ucs2 ->
cp932 Conversion
|
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that
using --character-set-client-handshake
(or
--skip-character-set-client-handshake
) has an
important effect. See Section 5.2.1, “mysqld Command Options”.