10.10. FAQ: MySQL Chinese, Japanese, and Korean Character Sets

MySQL 5.0

10.10. FAQ: MySQL Chinese, Japanese, and Korean Character Sets

This Frequently-Asked-Questions section comes from the experiences of MySQL's Support and Development groups, after handling many enquiries about CJK (Chinese Japanese Korean) issues.

10.10.1. SELECT shows non-Latin characters as "?"s. Why?

You inserted CJK characters with , but when you do a , they all look like “?”. It usually is a setting in MySQL that doesn't match the settings for the application program or the operating system. These are common troubleshooting steps:

  • Find out: what version do you have? The statement will tell you. This FAQ is for MySQL version 5, so some of the answers here will not apply to you if you have version 4.0 or 4.1.

  • Find out: what character set is the database column really in? Too frequently, people think that the character set will be the same as the server's set (false), or the set used for display purposes (false). Make sure, by saying , or better yet by saying this:

    SELECT character_set_name, collation_name
    FROM   information_schema.columns WHERE  table_schema = your_database_name
    AND    table_name = your_table_name
    AND    column_name = your_column_name;
    

  • Find out: what is the hexadecimal value?

    SELECT HEX(your_column_name)
    FROM your_table_name;
    

    If you see , then that really is the encoding for , so no wonder you see “?”. Probably this happened because of a problem converting a particular character from your client character set to the target character set.

  • Find out: is a literal round trip possible, that is, if you select “literal” (or “_introducer hexadecimal-value”) do you get “literal” as a result? For example, with the Japanese Katakana Letter Pe, which looks like , and which exists in all CJK character sets, and which has the code point value (hexadecimal coding) , enter:

    SELECT 'ペ' AS `ペ`;         /* or SELECT _ucs2 0x30da; */
    

    If the result doesn't look like , a round trip failed. For bug reports, we might ask people to follow up with . Then we can see whether the client encoding is right.

  • Find out: is it the browser or application? Just use mysql (the MySQL client program, which on Windows will be mysql.exe). If mysql displays correctly but your application doesn't, then your problem is probably “Settings”, but consult also the question about “Troubles with Access (or Perl) (or PHP) (etc.)” much later in this FAQ.

    To find your settings, the statement you need here is . For example:

    mysql> 
    +--------------------------+----------------------------------------+
    | Variable_name            | Value                                  |
    +--------------------------+----------------------------------------+
    | character_set_client     | utf8                                   |
    | character_set_connection | utf8                                   |
    | character_set_database   | latin1                                 |
    | character_set_filesystem | binary                                 |
    | character_set_results    | utf8                                   |
    | character_set_server     | latin1                                 |
    | character_set_system     | utf8                                   |
    | character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
    +--------------------------+----------------------------------------+
    8 rows in set (0.03 sec)
    

    The above are typical character-set settings for an international-oriented client (notice the use of Unicode) connected to a server in the West ( is a West Europe character set and a default for MySQL).

    Although Unicode (usually the variant on Unix, usually the variant on Windows) is better than “latin”, it's often not what your operating system utilities support best. Many Windows users find that a Microsoft character set, such as for Japanese Windows, is what's suitable.

    If you can't control the server settings, and you have no idea what your underlying computer is about, then try changing to a common character set for the country that you're in ( = Korea, or = People's Republic of China, = other China, or or or = Japan, or = anywhere). Usually it is only necessary to change the client and connection and results settings, and there is a simple statement which changes all three at once, namely . For example:

    SET NAMES 'big5';
    

    Once you get the correct setting, you can make it permanent by editing or . For example you might add lines looking like this:

    [mysqld]
    character-set-server=big5
    [client]
    default-character-set=big5
    

10.10.2. Troubles with GB character sets (Chinese)

MySQL supports the two common variants of the GB (“Guojia Biaozhun” or “National Standard”) character sets which are official in the People's Republic of China: and . Sometimes people try to insert characters into , and it works most of the time because is a superset of . But eventually they try to insert a rarer Chinese character and it doesn't work. (Example: bug #16072 in our bugs database, http://bugs.mysql.com/bug.php?id=16072). So we'll try to clarify here exactly what characters are legitimate in or , with reference to the official documents. Please check these references before reporting or bugs. We now have a graphic listing of the characters, currently on the site of Mr Alexander Barkov (MySQL's principal programmer for character set issues). The chart is in order according to the collation: http://d.udm.net/bar/~bar/charts/gb2312_chinese_ci.html. MySQL's is in reality “Microsoft code page 936”. This differs from the official for characters (middle dot), (em dash), , and . For a listing of the differences, see http://recode.progiciels-bpi.ca/showfile.html?name=dist/libiconv/gbk.h. For a listing of gbk/Unicode mappings, see http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT. For MySQL's listing of gbk characters, see http://d.udm.net/bar/~bar/charts/gbk_chinese_ci.html.

10.10.3. Troubles with big5 character set (Chinese)

MySQL supports the Big5 character set which is common in Hong Kong and the Republic of China (Taiwan). MySQL's is in reality “Microsoft code page 950”, which is very similar to the original character set. This is a recent change, starting with MySQL version 4.1.16 / 5.0.16. We made the change as a result of a bug report, bug #12476 in our bugs database, http://bugs.mysql.com/bug.php?id=12476 (title: “Some big5 codes are still missing ...”). For example, the following statements work in the current version of MySQL, but not in old versions:

mysql> 
Query OK, 0 rows affected (0.13 sec)

mysql> 
Query OK, 1 row affected (0.00 sec)

mysql> 
+------+
| big5 |
+------+
| 嫺  |
+------+
1 row in set (0.02 sec)

There is a feature request for adding HKSCS extensions (bug #13577 in our bugs database, http://bugs.mysql.com/bug.php?id=13577). People who need the extension may find the suggested patch for bug #13577 is of interest.

10.10.4. Troubles with character-set conversions (Japanese)

MySQL supports the , , , and eucjpms character sets, as well as Unicode. A common need is to convert between character sets. For example, there might be a Unix server (typically with or ) and a Windows client (typically with ). But conversions can seem to fail. Here's why. In this conversion table, the column is the source, and the /// columns are the destination, that is, what the hexadecimal result would be if we used or if we assigned a column containing the value to an /// column.

character name         ucs2 sjis  cp932 ujis   eucjpms
--------------         ---- ----  ----  ----   -------

BROKEN BAR             00A6   3F    3F  8FA2C3   3F
FULLWIDTH BROKEN BAR   FFE4   3F  FA55    3F   8FA2

YEN SIGN               00A5   3F    3F    20     3F
FULLWIDTH YEN SIGN     FFE5 818F  818F  A1EF     3F

TILDE                  007E   7E    7E    7E     7E
OVERLINE               203E   3F    3F    20     3F

HORIZONTAL BAR         2015 815C  815C  A1BD   A1BD
EM DASH                2014   3F    3F    3F     3F

REVERSE SOLIDUS        005C 815F    5C    5C     5C
FULLWIDTH ""           FF3C   3F  815F    3F   A1C0

WAVE DASH              301C 8160    3F  A1C1     3F
FULLWIDTH TILDE        FF5E   3F  8160    3F   A1C1

DOUBLE VERTICAL LINE   2016 8161    3F  A1C2     3F
PARALLEL TO            2225   3F  8161    3F   A1C2

MINUS SIGN             2212 817C    3F  A1DD     3F
FULLWIDTH HYPHEN-MINUS FF0D   3F  817C    3F   A1DD

CENT SIGN              00A2 8191    3F  A1F1     3F
FULLWIDTH CENT SIGN    FFE0   3F  8191    3F   A1F1

POUND SIGN             00A3 8192    3F  A1F2     3F
FULLWIDTH POUND SIGN   FFE1   3F  8192    3F   A1F2

NOT SIGN               00AC 81CA    3F  A2CC     3F
FULLWIDTH NOT SIGN     FFE2   3F  81CA    3F   A2CC

For example, consider this extract from the table:

                       ucs2 sjis cp932
                       ---- ---- -----
NOT SIGN               00AC 81CA    3F
FULLWIDTH NOT SIGN     FFE2   3F  81CA

It means “for NOT SIGN which is Unicode U+00AC, MySQL converts to sjis code point 0x81CA and to cp932 code point 3F”. ( is question mark (“?”) and is what we always use when we can't convert.) Now, what should we do if we want to convert to ? Our answer is: “?”. There are serious complaints about this, many people would prefer a “loose” conversion, so that in becomes in . We are considering changing.

10.10.5. The Great Yen Sign problem (Japanese)

In SJIS the code for Yen Sign () is . In SJIS the code for Reverse Solidus () is . Since the above statements are contradictory, confusion often results. Well, to put it more seriously, some versions of Japanese character sets (both and ) have treated as a reverse solidus, also known as a backslash, and others have treated it as a yen sign. There's nothing we can do, except take sides: MySQL follows only one version of the JIS (Japanese Industrial Standards) standard description, and 5C is Reverse Solidus, always. Should we make a separate character set where is Yen Sign, as another DBMS (Oracle) does? We haven't decided. Certainly not in version 5.1 or 5.2. But if people keep complaining about The Great Yen Sign Problem, that's one possible solution.

10.10.6. Troubles with euckr character set (Korean)

MySQL supports the (Extended Unix Code Korea) character set which is common in South Korea. In theory, problems could arise because there have been several versions of this character set. So far, only one problem has been noted, for Korea's currency symbol. We use the “ASCII” variant of EUC-KR, in which the code point is REVERSE SOLIDUS, that is , instead of the “KS-Roman” variant of EUC-KR, in which the code point is WON SIGN, that is “”. You can't convert Unicode WON SIGN to :

mysql> 
-> 
+-------+----------+
| euckr | hexeuckr |
+-------+----------+
| ?     | 3F       |
+-------+----------+
1 row in set (0.00 sec)

MySQL's graphic Korean chart is here: http://d.udm.net/bar/~bar/charts/euckr_korean_ci.html.

10.10.7. The “Data truncated” message

For illustration, we'll make a table with one Unicode () column and one Chinese () column.

mysql> 
    -> 
    -> 
Query OK, 0 rows affected (0.05 sec) 

We'll try to place the rare character in both columns.

mysql> 
Query OK, 1 row affected, 1 warning (0.00 sec) 

Ah, there's a warning. Let's see what it is.

mysql> 
+---------+------+---------------------------------------------+
| Level   | Code | Message                                     |
+---------+------+---------------------------------------------+
| Warning | 1265 | Data truncated for column 'gb2312' at row 1 |
+---------+------+---------------------------------------------+
1 row in set (0.00 sec)

So it's a warning about the gb2312 column only.

mysql> SELECT ucs2,HEX(ucs2),gb2312,HEX(gb2312) FROM ch;
+-------+--------------+--------+-------------+
| ucs2  | HEX(ucs2)    | gb2312 | HEX(gb2312) |
+-------+--------------+--------+-------------+
| A汌B | 00416C4C0042 | A?B    | 413F42      |
+-------+--------------+--------+-------------+
1 row in set (0.00 sec)

There are several things that need explanation here.

  1. The fact that it's a “warning” rather than an “error” is characteristic of MySQL. We like to try to do what we can, to get the best fit, rather than give up.

  2. The character isn't in the character set. We described that problem earlier.

  3. Admittedly the message is misleading. We didn't “truncate” in this case, we replaced with a question mark. We've had a complaint about this message (bug #9337). But until we come up with something better, just accept that error/warning code 2165 can mean a variety of things.

  4. With , there would be an error message, but instead of error 2165 you would see: .

10.10.8. Troubles with Access, Perl, PHP, etc.

You can't get things to look right with your special program for a GUI front end or browser? Get a direct connection to the server (with mysql on Unix or with mysql.exe on Windows) and try the same query there. If mysql is okay, then the trouble is probably that your application interface needs some initializing. Use mysql to tell you what character set(s) it uses, by saying . If it's Access, you're probably connecting with MyODBC. So you'll want to check out the Reference Manual page for configuring an ODBC DSN, and pay attention particularly to the illustrations for “SQL command on connect”. You should enter (supposing that you use ) (you don't need a here). If it's ASP, you might need to add in the code. Here is an example that has worked in the past:

<%
Session.CodePage=0
Dim strConnection
Dim Conn
strConnection="driver={MySQL ODBC 3.51 Driver};server=yourserver;uid=yourusername;" \ 
        & "pwd=yourpassword;database=yourdatabase;stmt=SET NAMES 'big5';"
Set Conn = Server.CreateObject(“ADODB.Connection”) 
Conn.Open strConnection
%> 

If it's PHP, here's a slightly different user suggestion:

<?php 
  $link = mysql_connect($host,$usr,$pwd); 
  mysql_select_db($db); 
  if (mysql_error()) { print "Database ERROR: " . mysql_error(); } 
  mysql_query("SET CHARACTER SET utf8", $link); 
  mysql_query("SET NAMES 'utf8'", $link); 
?>

In this case, the tipper used statement to change and , and used to change and and . So actually the statement is redundant. (Incidentally, MySQL people encourage the use of the extension, rather than the example that this example uses.) Another thing to check with PHP is the browser assumptions. Sometimes a meta tag change in the heading area suffices, for example:

For Connector/J tips, see the manual section in the Connectors chapter titled “Using Character Sets and Unicode”.

10.10.9. How can I get old MySQL 4.0 behaviour back?

In the old days, with MySQL Version 4.0, there was a single “global” character set for both server and client sides, and the decision was made by the server administrator. We changed that starting with MySQL Version 4.1. What happens now is a “handshake”. The MySQL Reference Manual describes it thus:

When a client connects, it sends to the server the name of the character set that it wants to use. The server uses the name to set the , , and system variables. In effect, the server performs a operation using the character set name.

The effect of this is: you can't control the client character set by saying . But some Asian customers said that they don't like that, they want the MySQL 4.0 behaviour. So we added a mysqld switch, , which (and this is the interesting part) can be turned off with . If you start mysqld with , then the behaviour is like this: When a client connects, it sends to the server the name of the character set that it wants to use. The server ignores it! Here is an illustration with the handshake switch on or off. Pretend that your favourite server character set is (of course that's unlikely in a CJK area but it's MySQL's default if there's no or file). Pretend that the client operates with because that's what the client's operating system supports. Start the server with a default character set, :

mysqld --character-set-server=latin1

Start the client with a default character set, :

mysql --default-character-set=utf8

Show what the current settings are:

mysql> 
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | utf8                                   |
| character_set_connection | utf8                                   |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | utf8                                   |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)

Stop the client. Stop the server with mysqladmin. Start the server again but this time say “skip the handshake”:

mysqld --character-set-server=utf8 --skip-character-set-client-handshake

Start the client with a default character set, , again. Show what the current settings are, again:

mysql> 
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                                 |
| character_set_connection | latin1                                 |
| character_set_database   | latin1                                 |
| character_set_filesystem | binary                                 |
| character_set_results    | latin1                                 |
| character_set_server     | latin1                                 |
| character_set_system     | utf8                                   |
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ |
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)

As you can see by comparing the results, the server ignores the client's initial settings if the is used.

10.10.10. Why do some LIKE and FULLTEXT searches fail?

There is a simple problem with searches on and columns: we need to know the end of a character. With multi-byte character sets, different characters might have different octet lengths. For example, in , requires one byte but requires three bytes. Illustration:

        +-------------------------+---------------------------+
        | octet_length(_utf8 'A') | octet_length(_utf8 'ペ') |
        +-------------------------+---------------------------+
        |                       1 |                         3 |
        +-------------------------+---------------------------+
        1 row in set (0.00 sec)
      

If we don't know where the first character ends, then we don't know where the second character begins, and even simple-looking searches like will fail. The solution is to use a regular CJK character set in the first place, or convert to a CJK character character set before comparing. Incidentally, this is one reason why MySQL cannot allow encodings of nonexistent characters: It must be strict about rejecting bad input, or it won't know where characters end. There is a simple problem with : we need to know the end of a word. With Western writing this is rarely a problem because there are spaces between words. With Asian writing this is not the case. We could use half-good solutions, like saying that all Han characters represent words, or depending on (Japanese) changes from Katakana to Hiragana which are due to grammatical endings. But the only good solution requires a dictionary, and we haven't found a good open-source dictionary.

10.10.11. What CJK character sets are available?

The list of CJK character sets may vary depending on version. For example, the character set is a recent addition. But the language name appears in the column for every entry in . Therefore, to get a current list of all the non-Unicode CJK character sets, say:

mysql> 
    -> 
    -> 
    -> 
    -> 
    -> 
+--------------------+---------------------------+
| character_set_name | description               |
+--------------------+---------------------------+
| big5               | Big5 Traditional Chinese  |
| cp932              | SJIS for Windows Japanese |
| eucjpms            | UJIS for Windows Japanese |
| euckr              | EUC-KR Korean             |
| gb2312             | GB2312 Simplified Chinese |
| gbk                | GBK Simplified Chinese    |
| sjis               | Shift-JIS Japanese        |
| ujis               | EUC-JP Japanese           |
+--------------------+---------------------------+
8 rows in set (0.01 sec)

10.10.12. Is character X available in all character sets?

The majority of everyday-use Chinese/Japanese characters (simplified Chinese and basic non-halfwidth Kana Japanese) appear in all CJK character sets. Here is a stored procedure which accepts a UCS-2 Unicode character, converts it to all other character sets, and displays the results in hexadecimal.

DELIMITER //

CREATE PROCEDURE p_convert (ucs2_char CHAR(1) CHARACTER SET ucs2)
BEGIN

CREATE TABLE tj
             (ucs2 CHAR(1) character set ucs2,
              utf8 CHAR(1) character set utf8,
              big5 CHAR(1) character set big5,
              cp932 CHAR(1) character set cp932,
              eucjpms CHAR(1) character set eucjpms,
              euckr CHAR(1) character set euckr,
              gb2312 CHAR(1) character set gb2312,
              gbk CHAR(1) character set gbk,
              sjis CHAR(1) character set sjis,
              ujis CHAR(1) character set ujis);

INSERT INTO tj (ucs2) VALUES (ucs2_char);

UPDATE tj SET utf8=ucs2,
              big5=ucs2,
              cp932=ucs2,
              eucjpms=ucs2,
              euckr=ucs2,
              gb2312=ucs2,
              gbk=ucs2,
              sjis=ucs2,
              ujis=ucs2;

/* If there's a conversion problem, UPDATE will produce a warning. */

SELECT hex(ucs2) AS ucs2,
       hex(utf8) AS utf8,
       hex(big5) AS big5,
       hex(cp932) AS cp932,
       hex(eucjpms) AS eucjpms,
       hex(euckr) AS euckr,
       hex(gb2312) AS gb2312,
       hex(gbk) AS gbk,
       hex(sjis) AS sjis,
       hex(ujis) AS ujis
FROM tj;

DROP TABLE tj;

END//

The input can be any single character, or it can be the code point value (hexadecimal representation) of that character. Here's an example of what can do. An earlier answer said that the character “Katakana Letter Pe” appears in all CJK character sets. We know that the code point value of Katakana Letter Pe is . (By the way, we got the name from Unicode's list of ucs2 encodings and names: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.) So we'll say:

mysql> 
+------+--------+------+-------+---------+-------+--------+------+------+------+
| ucs2 | utf8   | big5 | cp932 | eucjpms | euckr | gb2312 | gbk  | sjis | ujis |
+------+--------+------+-------+---------+-------+--------+------+------+------+
| 30DA | E3839A | C772 | 8379  | A5DA    | ABDA  | A5DA   | A5DA | 8379 | A5DA |
+------+--------+------+-------+---------+-------+--------+------+------+------+
1 row in set (0.04 sec)

Since none of the column values is , we know that every conversion worked.

10.10.13. Strings don't sort correctly in Unicode (I)

Sometimes people observe that the result of a or search or sort is not what they think a native would expect. Although we never rule out the chance that there is a bug, we have found in the past that people are not correctly reading the standard table of weights for the Unicode Collation Algorithm. So, here's how to check whether we're using the right collation. The correct table for MySQL is this one: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. This is different from the first table you will find by navigating from the home page. MySQL deliberately uses the older 4.0.0 “allkeys” table, instead of the current 4.1.0 table. We are very wary about changing ordering which affects indexes. Here is an example of a problem that we handled recently, for a complaint in our bugs database, http://bugs.mysql.com/bug.php?id=16526:

mysql> 
Query OK, 0 rows affected (0.05 sec)

mysql> 
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> 
+------+
| s1   |
+------+
| が  |
| か  |
+------+
2 rows in set (0.00 sec)

If your eyes are sharp, you'll see that the character in the first result row isn't the one that we searched for. Why did MySQL retrieve it? First we look for the Unicode code point value, which is possible by reading the hexadecimal number for the version of the characters:

mysql> 
+------+-----------------------------+
| s1   | HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が  | 304C                        |
| か  | 304B                        |
+------+-----------------------------+
2 rows in set (0.03 sec)

Now let's search for and in the 4.0.0 allkeys table. We'll find these lines:

304B  ; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
304C  ; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM

The official Unicode names (following the “#” mark) are informative; they tell us the Japanese syllabary (Hiragana), the informal classification (letter instead of digit or punctuation), and the Western identifier ( or , which happen to be voiced/unvoiced components of the same letter pair). More importantly, the Primary Weight (the first hexadecimal number inside the square brackets) is on both lines. For comparisons in both searching and sorting, MySQL pays attention only to the Primary Weight, it ignores all the other numbers. So now we know that we're sorting and correctly according to the Unicode specification. If we wanted to distinguish them, we'd have to use a non-Unicode-Collation-Algorithm collation ( or ), or compare the values, or say . Being correct “according to Unicode” isn't enough, of course: the person who submitted the bug was equally correct. We plan to add another collation for Japanese according to the JIS X 4061 standard, where voiced/unvoiced letters like KA/GA are distinguishable for ordering purposes.

10.10.14. Strings don't sort correctly in Unicode (II)

You're using Unicode ( or ), and you know what the Unicode sort order is (see the previous question and answer), but MySQL still seems to sort your table wrong? This might be easy.

mysql> 
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Hmm, the character set looks okay. Let's look at the for this column.

mysql> 
    -> 
    -> 
    -> 
+-------------+--------------------+-----------------+
| column_name | character_set_name | collation_name  |
+-------------+--------------------+-----------------+
| s1          | ucs2               | ucs2_general_ci |
+-------------+--------------------+-----------------+
1 row in set (0.01 sec)

Oops, the collation is instead of ! Here's why:

mysql> 
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)

For and , the “general” collation is the default. To specify that you wanted a “unicode” collation, you should have specified .

10.10.15. My supplementary characters get rejected

Right. MySQL doesn't support supplementary characters (characters which need more than 3 bytes with UTF-8). We support only what Unicode calls the Basic Multilingual Plane / Plane 0. Only a few very rare Han characters are supplementary; support for them is uncommon. This has led to bug #12600 (http://bugs.mysql.com/bug.php?id=12600) which we rejected as “not a bug”. With , we must truncate an input string when we encounter bytes that we don't understand. Otherwise, we wouldn't know how long the bad multi-byte character is. A workaround is: if you use instead of , then the bad characters will change to question marks, but there will be no truncation. Or change the data type to or , which have no validity checking. In our bugs database, bug #14052 (http://bugs.mysql.com/bug.php?id=14052) is a feature request for Wikipedia, asking us to support supplementary characters extending as well as .

10.10.16. Shouldn't it be CJKV (V for Vietnamese)?

No. The term CJKV (Chinese Japanese Korean Vietnamese) refers to character sets which contain Han (originally Chinese) characters. MySQL has no plan to support the old Vietnamese script using Han characters. MySQL does of course support the modern Vietnamese script with Western characters. Another question that has come up (once) is a request for specialized Vietnamese collation, see http://bugs.mysql.com/bug.php?id=4745. We might do something about it someday, if many more requests arise.

10.10.17. Will MySQL fix any CJK problems in version 5.1?

Yes. We're changing the names of files and directories. Here's an example, using mysql as under Linux:

  1. Create a table with a name containing a Han character:

    mysql> 
    Query OK, 0 rows affected (0.07 sec)
    

  2. Find out where MySQL stores database files:

    mysql> 
    +---------------+-----------------------+
    | Variable_name | Value                 |
    +---------------+-----------------------+
    | datadir       | /usr/local/mysql/var/ |
    +---------------+-----------------------+
    1 row in set (0.00 sec)
    

  3. Look at the directory to see the MyISAM table files:

    # cd /usr/local/mysql/var/dba
    # dir tab_*
    -rw-rw----  1 root root    0 2006-05-16 10:22 [email protected]
    -rw-rw----  1 root root 1024 2006-05-16 10:22 [email protected]
    -rw-rw----  1 root root 8556 2006-05-16 10:22 [email protected]
    

Notice that MySQL has converted the Han character to + (Unicode value of Han character), that is, to a purely ASCII representation. This solves an old problem, that database files weren't portable, because some computers wouldn't allow in a file name. Conversion to the new file names will be automatic when you upgrade to version 5.1. This should take care of bug #6313 in our bugs database, http://bugs.mysql.com/bug.php?id=6313.

10.10.18. When will MySQL translate the manual again?

A Beijing-based group has produced a Simplified Chinese version for us under contract. It's complete and can be found on http://dev.mysql.com/doc/#chinese-5.1. It's up to date as of version 5.1.2. The Japanese manual can be downloaded from http://dev.mysql.com/doc/#japanese-4.1. It is still for version 4.1.

10.10.19. Whom can I talk to?

Check http://dev.mysql.com/user-groups/ to see if there is a MySQL user group near you. If there isn't: why not start one yourself? To contact a sales engineer in MySQL KK's Japan office:

Tel: +81(0)3-5326-3133
Fax: +81(0)3-5326-3001
Email: [email protected]

To see feature requests about language issues:

  • Go to http://bugs.mysql.com.

  • Click Advanced Search.

  • In the Severity dropdown box, click .

  • In the list box beside Category, click .

  • Click the Search button.

You can post CJK questions, or see previous answers, on MySQL's “Character Sets, Collation, Unicode” forum: http://forums.mysql.com/list.php?103. MySQL plans to add native-language forums on http://forums.mysql.com/ very soon.