This Frequently-Asked-Questions section comes from the experiences of MySQL's Support and Development groups, after handling many enquiries about CJK (Chinese Japanese Korean) issues.
You inserted CJK characters with INSERT
, but
when you do a SELECT
, they all look like
“?”. It usually is a setting in MySQL that doesn't
match the settings for the application program or the operating
system. These are common troubleshooting steps:
-
Find out: what version do you have? The statement
SELECT VERSION();
will tell you. This FAQ is for MySQL version 5, so some of the answers here will not apply to you if you have version 4.0 or 4.1. -
Find out: what character set is the database column really in? Too frequently, people think that the character set will be the same as the server's set (false), or the set used for display purposes (false). Make sure, by saying
SHOW CREATE TABLE tablename
, or better yet by saying this:SELECT character_set_name, collation_name FROM information_schema.columns WHERE table_schema = your_database_name AND table_name = your_table_name AND column_name = your_column_name;
-
Find out: what is the hexadecimal value?
SELECT HEX(your_column_name) FROM your_table_name;
If you see
3F
, then that really is the encoding for?
, so no wonder you see “?”. Probably this happened because of a problem converting a particular character from your client character set to the target character set. -
Find out: is a literal round trip possible, that is, if you select “literal” (or “_introducer hexadecimal-value”) do you get “literal” as a result? For example, with the Japanese Katakana Letter Pe, which looks like
ペ'
, and which exists in all CJK character sets, and which has the code point value (hexadecimal coding)0x30da
, enter:SELECT 'ペ' AS `ペ`; /* or SELECT _ucs2 0x30da; */
If the result doesn't look like
ペ
, a round trip failed. For bug reports, we might ask people to follow up withSELECT hex('ペ');
. Then we can see whether the client encoding is right. -
Find out: is it the browser or application? Just use mysql (the MySQL client program, which on Windows will be mysql.exe). If mysql displays correctly but your application doesn't, then your problem is probably “Settings”, but consult also the question about “Troubles with Access (or Perl) (or PHP) (etc.)” much later in this FAQ.
To find your settings, the statement you need here is
SHOW VARIABLES
. For example:mysql>
SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+ | Variable_name | Value | +--------------------------+----------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql/share/mysql/charsets/ | +--------------------------+----------------------------------------+ 8 rows in set (0.03 sec)The above are typical character-set settings for an international-oriented client (notice the use of
utf8
Unicode) connected to a server in the West (latin1
is a West Europe character set and a default for MySQL).Although Unicode (usually the
utf8
variant on Unix, usually theucs2
variant on Windows) is better than “latin”, it's often not what your operating system utilities support best. Many Windows users find that a Microsoft character set, such ascp932
for Japanese Windows, is what's suitable.If you can't control the server settings, and you have no idea what your underlying computer is about, then try changing to a common character set for the country that you're in (
euckr
= Korea,gb2312
orgbk
= People's Republic of China,big5
= other China,sjis
orujis
orcp932
oreucjpms
= Japan,ucs2
orutf8
= anywhere). Usually it is only necessary to change the client and connection and results settings, and there is a simple statement which changes all three at once, namelySET NAMES
. For example:SET NAMES 'big5';
Once you get the correct setting, you can make it permanent by editing
my.cnf
ormy.ini
. For example you might add lines looking like this:[mysqld] character-set-server=big5 [client] default-character-set=big5
MySQL supports the two common variants of the GB (“Guojia
Biaozhun” or “National Standard”) character
sets which are official in the People's Republic of China:
gb2312
and gbk
. Sometimes
people try to insert gbk
characters into
gb2312
, and it works most of the time because
gbk
is a superset of gb2312
.
But eventually they try to insert a rarer Chinese character and it
doesn't work. (Example: bug #16072 in our bugs database,
http://bugs.mysql.com/bug.php?id=16072). So we'll
try to clarify here exactly what characters are legitimate in
gb2312
or gbk
, with
reference to the official documents. Please check these references
before reporting gb2312
or
gbk
bugs. We now have a graphic listing of the
gbk
characters, currently on the site of Mr
Alexander Barkov (MySQL's principal programmer for character set
issues). The chart is in order according to the
gb2312_chinese_ci
collation:
http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html.
MySQL's gbk
is in reality “Microsoft code
page 936”. This differs from the official
gbk
for characters A1A4
(middle dot), A1AA
(em dash),
A6E0-A6F5
, and A8BB-A8C0
.
For a listing of the differences, see
http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h.
For a listing of gbk/Unicode mappings, see
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT.
For MySQL's listing of gbk characters, see
http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html.
MySQL supports the Big5 character set which is common in Hong Kong
and the Republic of China (Taiwan). MySQL's
big5
is in reality “Microsoft code page
950”, which is very similar to the original
big5
character set. This is a recent change,
starting with MySQL version 4.1.16 / 5.0.16. We made the change as
a result of a bug report, bug #12476 in our bugs database,
http://bugs.mysql.com/bug.php?id=12476 (title:
“Some big5 codes are still missing ...”). For
example, the following statements work in the current version of
MySQL, but not in old versions:
mysql>create table big5 (big5 char(1) character set big5);
Query OK, 0 rows affected (0.13 sec) mysql>insert into big5 values (0xf9dc);
Query OK, 1 row affected (0.00 sec) mysql>select * from big5;
+------+ | big5 | +------+ | 嫺 | +------+ 1 row in set (0.02 sec)
There is a feature request for adding HKSCS extensions (bug #13577 in our bugs database, http://bugs.mysql.com/bug.php?id=13577). People who need the extension may find the suggested patch for bug #13577 is of interest.
MySQL supports the sjis
,
ujis
, cp932
, and eucjpms
character sets, as well as Unicode. A common need is to convert
between character sets. For example, there might be a Unix server
(typically with sjis
or
ujis
) and a Windows client (typically with
cp932
). But conversions can seem to fail.
Here's why. In this conversion table, the ucs2
column is the source, and the
sjis
/cp932
/ujis
/eucjpms
columns are the destination, that is, what the hexadecimal result
would be if we used CONVERT(ucs2)
or if we
assigned a ucs2
column containing the value to
an
sjis
/cp932
/ujis
/eucjpms
column.
character name ucs2 sjis cp932 ujis eucjpms -------------- ---- ---- ---- ---- ------- BROKEN BAR 00A6 3F 3F 8FA2C3 3F FULLWIDTH BROKEN BAR FFE4 3F FA55 3F 8FA2 YEN SIGN 00A5 3F 3F 20 3F FULLWIDTH YEN SIGN FFE5 818F 818F A1EF 3F TILDE 007E 7E 7E 7E 7E OVERLINE 203E 3F 3F 20 3F HORIZONTAL BAR 2015 815C 815C A1BD A1BD EM DASH 2014 3F 3F 3F 3F REVERSE SOLIDUS 005C 815F 5C 5C 5C FULLWIDTH "" FF3C 3F 815F 3F A1C0 WAVE DASH 301C 8160 3F A1C1 3F FULLWIDTH TILDE FF5E 3F 8160 3F A1C1 DOUBLE VERTICAL LINE 2016 8161 3F A1C2 3F PARALLEL TO 2225 3F 8161 3F A1C2 MINUS SIGN 2212 817C 3F A1DD 3F FULLWIDTH HYPHEN-MINUS FF0D 3F 817C 3F A1DD CENT SIGN 00A2 8191 3F A1F1 3F FULLWIDTH CENT SIGN FFE0 3F 8191 3F A1F1 POUND SIGN 00A3 8192 3F A1F2 3F FULLWIDTH POUND SIGN FFE1 3F 8192 3F A1F2 NOT SIGN 00AC 81CA 3F A2CC 3F FULLWIDTH NOT SIGN FFE2 3F 81CA 3F A2CC
For example, consider this extract from the table:
ucs2 sjis cp932 ---- ---- ----- NOT SIGN 00AC 81CA 3F FULLWIDTH NOT SIGN FFE2 3F 81CA
It means “for NOT SIGN which is Unicode U+00AC, MySQL
converts to sjis code point 0x81CA and to cp932 code point
3F”. (3F
is question mark
(“?”) and is what we always use when we can't
convert.) Now, what should we do if we want to convert
sjis 81CA
to cp932
? Our
answer is: “?”. There are serious complaints about
this, many people would prefer a “loose” conversion,
so that 81CA (NOT SIGN)
in
sjis
becomes 81CA (FULLWIDTH NOT
SIGN)
in cp932
. We are considering
changing.
In SJIS the code for Yen Sign (¥
) is
5C
. In SJIS the code for Reverse Solidus
(\
) is 5C
. Since the above
statements are contradictory, confusion often results. Well, to
put it more seriously, some versions of Japanese character sets
(both sjis
and euc
) have
treated 5C
as a reverse solidus, also known as
a backslash, and others have treated it as a yen sign. There's
nothing we can do, except take sides: MySQL follows only one
version of the JIS (Japanese Industrial Standards) standard
description, and 5C is Reverse Solidus,
always. Should we make a separate character set where
5C
is Yen Sign, as another DBMS (Oracle) does?
We haven't decided. Certainly not in version 5.1 or 5.2. But if
people keep complaining about The Great Yen Sign Problem, that's
one possible solution.
MySQL supports the euckr
(Extended Unix Code
Korea) character set which is common in South Korea. In theory,
problems could arise because there have been several versions of
this character set. So far, only one problem has been noted, for
Korea's currency symbol. We use the “ASCII” variant
of EUC-KR, in which the code point 0x5c
is
REVERSE SOLIDUS, that is \
, instead of the
“KS-Roman” variant of EUC-KR, in which the code point
0x5c
is WON SIGN, that is “₩”.
You can't convert Unicode U+20A9
WON SIGN to
euckr
:
mysql>SELECT CONVERT('₩' USING euckr) AS euckr,
->HEX(CONVERT('₩' USING euckr)) AS hexeuckr;
+-------+----------+ | euckr | hexeuckr | +-------+----------+ | ? | 3F | +-------+----------+ 1 row in set (0.00 sec)
MySQL's graphic Korean chart is here: http://d.udm.net/bar/~bar/charts/euckr_korean_ci.html.
For illustration, we'll make a table with one Unicode
(ucs2
) column and one Chinese
(gb2312
) column.
mysql>CREATE TABLE ch
->(ucs2 CHAR(3) CHARACTER SET ucs2,
->gb2312 CHAR(3) CHARACTER SET gb2312);
Query OK, 0 rows affected (0.05 sec)
We'll try to place the rare character 汌
in
both columns.
mysql> INSERT INTO ch VALUES ('A汌B','A汌B');
Query OK, 1 row affected, 1 warning (0.00 sec)
Ah, there's a warning. Let's see what it is.
mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)
So it's a warning about the gb2312 column only.
mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch; +-------+--------------+--------+-------------+ | ucs2 | HEX(ucs2) | gb2312 | HEX(gb2312) | +-------+--------------+--------+-------------+ | A汌B | 00416C4C0042 | A?B | 413F42 | +-------+--------------+--------+-------------+ 1 row in set (0.00 sec)
There are several things that need explanation here.
-
The fact that it's a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.
-
The
汌
character isn't in thegb2312
character set. We described that problem earlier. -
Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (bug #9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.
-
With
SQL_MODE=TRADITIONAL
, there would be an error message, but instead of error 2165 you would see:ERROR 1406 (22001): Data too long for column 'gb2312' at row 1
.
You can't get things to look right with your special program for a
GUI front end or browser? Get a direct connection to the server
(with mysql on Unix or with
mysql.exe on Windows) and try the same query
there. If mysql is okay, then the trouble is probably that your
application interface needs some initializing. Use
mysql to tell you what character set(s) it
uses, by saying SHOW VARIABLES LIKE 'char%';
.
If it's Access, you're probably connecting with MyODBC. So you'll
want to check out the Reference Manual page for configuring an
ODBC DSN, and pay attention particularly to the illustrations for
“SQL command on connect”. You should enter
SET NAMES 'big5'
(supposing that you use
big5
) (you don't need a ;
here). If it's ASP, you might need to add SET
NAMES
in the code. Here is an example that has worked in
the past:
<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=yourserver;uid=yourusername;" \
& "pwd=yourpassword;database=yourdatabase;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject(“ADODB.Connection”)
Conn.Open strConnection
%>
If it's PHP, here's a slightly different user suggestion:
<?php $link = mysql_connect($host,$usr,$pwd); mysql_select_db($db); if (mysql_error()) { print "Database ERROR: " . mysql_error(); } mysql_query("SET CHARACTER SET utf8", $link); mysql_query("SET NAMES 'utf8'", $link); ?>
In this case, the tipper used SET CHARACTER SET
statement to change character_set_client
and
character_set_result
, and used SET
NAMES
to change character_set_client
and character_set_connection
and
character_set_results
. So actually the
SET CHARACTER SET
statement is redundant.
(Incidentally, MySQL people encourage the use of the
mysqli
extension, rather than the
mysql
example that this example uses.) Another
thing to check with PHP is the browser assumptions. Sometimes a
meta tag change in the heading area suffices, for example:
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
For Connector/J tips, see the manual section in the Connectors chapter titled “Using Character Sets and Unicode”.
In the old days, with MySQL Version 4.0, there was a single “global” character set for both server and client sides, and the decision was made by the server administrator. We changed that starting with MySQL Version 4.1. What happens now is a “handshake”. The MySQL Reference Manual describes it thus:
When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the
character_set_client
,character_set_results
, andcharacter_set_connection
system variables. In effect, the server performs aSET NAMES
operation using the character set name.
The effect of this is: you can't control the client character set
by saying mysqld --character-set-server=utf8
.
But some Asian customers said that they don't like that, they want
the MySQL 4.0 behaviour. So we added a mysqld
switch, --character-set-client-handshake
, which
(and this is the interesting part) can be turned off with
--skip-character-set-client-handshake
. If you
start mysqld with
--skip-character-set-client-handshake
, then the
behaviour is like this: When a client connects, it sends to the
server the name of the character set that it wants to use. The
server ignores it! Here is an illustration with the handshake
switch on or off. Pretend that your favourite server character set
is latin1
(of course that's unlikely in a CJK
area but it's MySQL's default if there's no
my.ini
or my.cnf
file).
Pretend that the client operates with utf8
because that's what the client's operating system supports. Start
the server with a default character set,
latin1
:
mysqld --character-set-server=latin1
Start the client with a default character set,
utf8
:
mysql --default-character-set=utf8
Show what the current settings are:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
Stop the client. Stop the server with mysqladmin. Start the server again but this time say “skip the handshake”:
mysqld --character-set-server=utf8 --skip-character-set-client-handshake
Start the client with a default character set,
utf8
, again. Show what the current settings
are, again:
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
As you can see by comparing the SHOW VARIABLES
results, the server ignores the client's initial settings if the
--skip-character-set-client-handshake
is used.
There is a simple problem with LIKE
searches on
BINARY
and BLOB
columns: we
need to know the end of a character. With multi-byte character
sets, different characters might have different octet lengths. For
example, in utf8
, A
requires
one byte but ペ
requires three bytes.
Illustration:
+-------------------------+---------------------------+ | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') | +-------------------------+---------------------------+ | 1 | 3 | +-------------------------+---------------------------+ 1 row in set (0.00 sec)
If we don't know where the first character ends, then we don't
know where the second character begins, and even simple-looking
searches like LIKE '_A%'
will fail. The
solution is to use a regular CJK character set in the first place,
or convert to a CJK character character set before comparing.
Incidentally, this is one reason why MySQL cannot allow encodings
of nonexistent characters: It must be strict about rejecting bad
input, or it won't know where characters end. There is a simple
problem with FULLTEXT
: we need to know the end
of a word. With Western writing this is rarely a problem because
there are spaces between words. With Asian writing this is not the
case. We could use half-good solutions, like saying that all Han
characters represent words, or depending on (Japanese) changes
from Katakana to Hiragana which are due to grammatical endings.
But the only good solution requires a dictionary, and we haven't
found a good open-source dictionary.
The list of CJK character sets may vary depending on version. For
example, the eucjpms
character set is a recent
addition. But the language name appears in the
DESCRIPTION
column for every entry in
information_schema.character_sets
. Therefore,
to get a current list of all the non-Unicode CJK character sets,
say:
mysql>SELECT character_set_name, description
->FROM information_schema.character_sets
->WHERE description LIKE '%Chinese%'
->OR description LIKE '%Japanese%'
->OR description LIKE '%Korean%'
->ORDER BY character_set_name;
+--------------------+---------------------------+ | character_set_name | description | +--------------------+---------------------------+ | big5 | Big5 Traditional Chinese | | cp932 | SJIS for Windows Japanese | | eucjpms | UJIS for Windows Japanese | | euckr | EUC-KR Korean | | gb2312 | GB2312 Simplified Chinese | | gbk | GBK Simplified Chinese | | sjis | Shift-JIS Japanese | | ujis | EUC-JP Japanese | +--------------------+---------------------------+ 8 rows in set (0.01 sec)
The majority of everyday-use Chinese/Japanese characters (simplified Chinese and basic non-halfwidth Kana Japanese) appear in all CJK character sets. Here is a stored procedure which accepts a UCS-2 Unicode character, converts it to all other character sets, and displays the results in hexadecimal.
DELIMITER // CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2) BEGIN CREATE TABLE tj (ucs2 CHAR(1) character set ucs2, utf8 CHAR(1) character set utf8, big5 CHAR(1) character set big5, cp932 CHAR(1) character set cp932, eucjpms CHAR(1) character set eucjpms, euckr CHAR(1) character set euckr, gb2312 CHAR(1) character set gb2312, gbk CHAR(1) character set gbk, sjis CHAR(1) character set sjis, ujis CHAR(1) character set ujis); INSERT INTO tj (ucs2) VALUES (ucs2_char); UPDATE tj SET utf8=ucs2, big5=ucs2, cp932=ucs2, eucjpms=ucs2, euckr=ucs2, gb2312=ucs2, gbk=ucs2, sjis=ucs2, ujis=ucs2; /* If there's a conversion problem, UPDATE will produce a warning. */ SELECT hex(ucs2) AS ucs2, hex(utf8) AS utf8, hex(big5) AS big5, hex(cp932) AS cp932, hex(eucjpms) AS eucjpms, hex(euckr) AS euckr, hex(gb2312) AS gb2312, hex(gbk) AS gbk, hex(sjis) AS sjis, hex(ujis) AS ujis FROM tj; DROP TABLE tj; END//
The input can be any single ucs2
character, or
it can be the code point value (hexadecimal representation) of
that character. Here's an example of what
P_CONVERT()
can do. An earlier answer said
that the character “Katakana Letter Pe” appears in
all CJK character sets. We know that the code point value of
Katakana Letter Pe is 0x30da
. (By the way, we
got the name from Unicode's list of ucs2 encodings and names:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.)
So we'll say:
mysql> CALL P_CONVERT(0x30da)//
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8 | big5 | cp932 | eucjpms | euckr | gb2312 | gbk | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379 | A5DA | ABDA | A5DA | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)
Since none of the column values is 3F
, we know
that every conversion worked.
Sometimes people observe that the result of a
utf8_unicode_ci
or
ucs2_unicode_ci
search or ORDER
BY
sort is not what they think a native would expect.
Although we never rule out the chance that there is a bug, we have
found in the past that people are not correctly reading the
standard table of weights for the Unicode Collation Algorithm. So,
here's how to check whether we're using the right collation. The
correct table for MySQL is this one:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
This is different from the first table you will find by navigating
from the unicode.org
home page. MySQL
deliberately uses the older 4.0.0 “allkeys” table,
instead of the current 4.1.0 table. We are very wary about
changing ordering which affects indexes. Here is an example of a
problem that we handled recently, for a complaint in our bugs
database, http://bugs.mysql.com/bug.php?id=16526:
mysql>CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec) mysql>INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql>SELECT * FROM tj WHERE s1 = 'か';
+------+ | s1 | +------+ | が | | か | +------+ 2 rows in set (0.00 sec)
If your eyes are sharp, you'll see that the character in the first
result row isn't the one that we searched for. Why did MySQL
retrieve it? First we look for the Unicode code point value, which
is possible by reading the hexadecimal number for the
ucs2
version of the characters:
mysql> SELECT s1,HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1 | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が | 304C |
| か | 304B |
+------+-----------------------------+
2 rows in set (0.03 sec)
Now let's search for 304B
and
304C
in the 4.0.0 allkeys table. We'll find
these lines:
304B ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA 304C ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
The official Unicode names (following the “#” mark)
are informative; they tell us the Japanese syllabary (Hiragana),
the informal classification (letter instead of digit or
punctuation), and the Western identifier (KA
or
GA
, which happen to be voiced/unvoiced
components of the same letter pair). More importantly, the Primary
Weight (the first hexadecimal number inside the square brackets)
is 1E57
on both lines. For comparisons in both
searching and sorting, MySQL pays attention only to the Primary
Weight, it ignores all the other numbers. So now we know that
we're sorting が
and か
correctly according to the Unicode specification. If we wanted to
distinguish them, we'd have to use a
non-Unicode-Collation-Algorithm collation
(utf8_unicode_bin
or
utf8_general_ci
), or compare the
HEX()
values, or say ORDER BY
CONVERT(s1 USING sjis)
. Being correct “according
to Unicode” isn't enough, of course: the person who
submitted the bug was equally correct. We plan to add another
collation for Japanese according to the JIS X 4061 standard, where
voiced/unvoiced letters like KA/GA are distinguishable for
ordering purposes.
You're using Unicode (ucs2
or
utf8
), and you know what the Unicode sort order
is (see the previous question and answer), but MySQL still seems
to sort your table wrong? This might be easy.
mysql> SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Hmm, the character set looks okay. Let's look at the
information_schema
for this column.
mysql>SELECT column_name, character_set_name, collation_name
->FROM information_schema.columns
->WHERE column_name = 's1'
->AND table_name = 't';
+-------------+--------------------+-----------------+ | column_name | character_set_name | collation_name | +-------------+--------------------+-----------------+ | s1 | ucs2 | ucs2_general_ci | +-------------+--------------------+-----------------+ 1 row in set (0.01 sec)
Oops, the collation is ucs2_general_ci
instead
of ucs2_unicode_ci
! Here's why:
mysql> SHOW CHARSET LIKE 'ucs2%';
+---------+---------------+-------------------+--------+
| Charset | Description | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)
For ucs2
and utf8
, the
“general” collation is the default. To specify that
you wanted a “unicode” collation, you should have
specified COLLATE ucs2_unicode_ci
.
Right. MySQL doesn't support supplementary characters (characters
which need more than 3 bytes with UTF-8). We support only what
Unicode calls the Basic Multilingual Plane / Plane
0. Only a few very rare Han characters are
supplementary; support for them is uncommon. This has led to bug
#12600 (http://bugs.mysql.com/bug.php?id=12600)
which we rejected as “not a bug”. With
utf8
, we must truncate an input string when we
encounter bytes that we don't understand. Otherwise, we wouldn't
know how long the bad multi-byte character is. A workaround is: if
you use ucs2
instead of
utf8
, then the bad characters will change to
question marks, but there will be no truncation. Or change the
data type to BLOB
or BINARY
,
which have no validity checking. In our bugs database, bug #14052
(http://bugs.mysql.com/bug.php?id=14052) is a
feature request for Wikipedia, asking us to support supplementary
characters extending ucs2
as well as
utf8
.
No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters. Another question that has come up (once) is a request for specialized Vietnamese collation, see http://bugs.mysql.com/bug.php?id=4745. We might do something about it someday, if many more requests arise.
Yes. We're changing the names of files and directories. Here's an
example, using mysql as root
under Linux:
-
Create a table with a name containing a Han character:
mysql>
CREATE TABLE tab_楮 (s1 INT);
Query OK, 0 rows affected (0.07 sec) -
Find out where MySQL stores database files:
mysql>
SHOW VARIABLES LIKE 'datadir';
+---------------+-----------------------+ | Variable_name | Value | +---------------+-----------------------+ | datadir | /usr/local/mysql/var/ | +---------------+-----------------------+ 1 row in set (0.00 sec) -
Look at the directory to see the MyISAM table files:
# cd /usr/local/mysql/var/dba # dir tab_* -rw-rw---- 1 root root 0 2006-05-16 10:22 [email protected] -rw-rw---- 1 root root 1024 2006-05-16 10:22 [email protected] -rw-rw---- 1 root root 8556 2006-05-16 10:22 [email protected]
Notice that MySQL has converted the Han character to
@
+ (Unicode value of Han character), that is,
to a purely ASCII representation. This solves an old problem, that
database files weren't portable, because some computers wouldn't
allow 楮
in a file name. Conversion to the new
file names will be automatic when you upgrade to version 5.1. This
should take care of bug #6313 in our bugs database,
http://bugs.mysql.com/bug.php?id=6313.
A Beijing-based group has produced a Simplified Chinese version for us under contract. It's complete and can be found on http://dev.mysql.com/doc/#chinese-5.1. It's up to date as of version 5.1.2. The Japanese manual can be downloaded from http://dev.mysql.com/doc/#japanese-4.1. It is still for version 4.1.
Check http://dev.mysql.com/user-groups/ to see if there is a MySQL user group near you. If there isn't: why not start one yourself? To contact a sales engineer in MySQL KK's Japan office:
Tel: +81(0)3-5326-3133 Fax: +81(0)3-5326-3001 Email: [email protected]
To see feature requests about language issues:
-
Go to http://bugs.mysql.com.
-
Click Advanced Search.
-
In the Severity dropdown box, click
S4 (Feature Request)
. -
In the list box beside Category, click
Character Sets
. -
Click the Search button.
You can post CJK questions, or see previous answers, on MySQL's “Character Sets, Collation, Unicode” forum: http://forums.mysql.com/list.php?103. MySQL plans to add native-language forums on http://forums.mysql.com/ very soon.