10.5. Collation Issues

MySQL 5.0

10.5. Collation Issues

The following sections various aspects of character set collations.

10.5.1. Using COLLATE in SQL Statements

With the clause, you can override whatever the default collation is for a comparison. may be used in various parts of SQL statements. Here are some examples:

  • With :

    SELECT k
    FROM t1
    ORDER BY k COLLATE latin1_german2_ci;
    
  • With :

    SELECT k COLLATE latin1_german2_ci AS k1
    FROM t1
    ORDER BY k1;
    
  • With :

    SELECT k
    FROM t1
    GROUP BY k COLLATE latin1_german2_ci;
    
  • With aggregate functions:

    SELECT MAX(k COLLATE latin1_german2_ci)
    FROM t1;
    
  • With :

    SELECT DISTINCT k COLLATE latin1_german2_ci
    FROM t1;
    
  • With :

         SELECT *
         FROM t1
         WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
    
         SELECT *
         FROM t1
         WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
    
  • With :

    SELECT k
    FROM t1
    GROUP BY k
    HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
    

10.5.2. COLLATE Clause Precedence

The clause has high precedence (higher than ), so the following two expressions are equivalent:

x || y COLLATE z
x || (y COLLATE z)

10.5.3. BINARY Operator

The operator casts the string following it to a binary string. This is an easy way to force a comparison to be done byte by byte rather than character by character. also causes trailing spaces to be significant.

mysql> 
        -> 1
mysql> 
        -> 0
mysql> 
        -> 1
mysql> 
        -> 0

is shorthand for AS BINARY).

The attribute in character column definitions has a different effect. A character column defined with the attribute is assigned the binary collation of the column's character set. Every character set has a binary collation. For example, the binary collation for the character set is , so if the table default character set is , these two column definitions are equivalent:

CHAR(10) BINARY
CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin

The effect of as a column attribute differs from its effect prior to MySQL 4.1. Formerly, resulted in a column that was treated as a binary string. A binary string is a string of bytes that has no character set or collation, which differs from a non-binary character string that has a binary collation. For both types of strings, comparisons are based on the numeric values of the string unit, but for non-binary strings the unit is the character and some character sets allow multi-byte characters. Section 11.4.2, “The and Types”.

The use of in the definition of a , , or column causes the column to be treated as a binary data type. For example, the following pairs of definitions are equivalent:

CHAR(10) CHARACTER SET binary
BINARY(10)

VARCHAR(10) CHARACTER SET binary
VARBINARY(10)

TEXT CHARACTER SET binary
BLOB

10.5.4. Some Special Cases Where the Collation Determination Is Tricky

In the great majority of statements, it is obvious what collation MySQL uses to resolve a comparison operation. For example, in the following cases, it should be clear that the collation is the collation of column :

SELECT x FROM T ORDER BY x;
SELECT x FROM T WHERE x = x;
SELECT DISTINCT x FROM T;

However, when multiple operands are involved, there can be ambiguity. For example:

SELECT x FROM T WHERE x = 'Y';

Should this query use the collation of the column , or of the string literal ?

Standard SQL resolves such questions using what used to be called “coercibility” rules. Basically, this means: Both and have collations, so which collation takes precedence? This can be difficult to resolve, but the following rules cover most situations:

  • An explicit clause has a coercibility of 0. (Not coercible at all.)

  • The concatenation of two strings with different collations has a coercibility of 1.

  • The collation of a column or a stored routine parameter or local variable has a coercibility of 2.

  • A “system constant” (the string returned by functions such as or ) has a coercibility of 3.

  • A literal's collation has a coercibility of 4.

  • or an expression that is derived from has a coercibility of 5.

The preceding coercibility values are current as of MySQL 5.0.3. In MySQL 5.0 prior to 5.0.3, there is no system constant or ignorable coercibility. Functions such as have a coercibility of 2 rather than 3, and literals have a coercibility of 3 rather than 4.

Those rules resolve ambiguities in the following manner:

  • Use the collation with the lowest coercibility value.

  • If both sides have the same coercibility, then it is an error if the collations aren't the same.

Examples:

Use collation of
Use collation of
Error

The function can be used to determine the coercibility of a string expression:

mysql> 
        -> 0
mysql> 
        -> 3
mysql> 
        -> 4

See Section 12.9.3, “Information Functions”.

10.5.5. Collations Must Be for the Right Character Set

Each character set has one or more collations, but each collation is associated with one and only one character set. Therefore, the following statement causes an error message because the collation is not legal with the character set:

mysql> 
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'

10.5.6. An Example of the Effect of Collation

Suppose that column in table has these column values:

Muffler
Müller
MX Systems
MySQL

Suppose also that the column values are retrieved using the following statement:

SELECT X FROM T ORDER BY X COLLATE ;

The following table shows the resulting order of the values if we use with different collations:

Muffler Muffler Müller
MX Systems Müller Muffler
Müller MX Systems MX Systems
MySQL MySQL MySQL

The character that causes the different sort orders in this example is the U with two dots over it (), which the Germans call “U-umlaut.

  • The first column shows the result of the using the Swedish/Finnish collating rule, which says that U-umlaut sorts with Y.

  • The second column shows the result of the using the German DIN-1 rule, which says that U-umlaut sorts with U.

  • The third column shows the result of the using the German DIN-2 rule, which says that U-umlaut sorts with UE.