The following sections various aspects of character set collations.
With the COLLATE
clause, you can override
whatever the default collation is for a comparison.
COLLATE
may be used in various parts of SQL
statements. Here are some examples:
-
With
ORDER BY
:SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
-
With
AS
:SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
-
With
GROUP BY
:SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
-
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
-
With
DISTINCT
:SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
-
With
WHERE
:SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
-
With
HAVING
:SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
The COLLATE
clause has high precedence
(higher than ||
), so the following two
expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
The BINARY
operator casts the string
following it to a binary string. This is an easy way to force a
comparison to be done byte by byte rather than character by
character. BINARY
also causes trailing spaces
to be significant.
mysql>SELECT 'a' = 'A';
-> 1 mysql>SELECT BINARY 'a' = 'A';
-> 0 mysql>SELECT 'a' = 'a ';
-> 1 mysql>SELECT BINARY 'a' = 'a ';
-> 0
BINARY
str
is
shorthand for CAST(
str
AS
BINARY).
The BINARY
attribute in character column
definitions has a different effect. A character column defined
with the BINARY
attribute is assigned the
binary collation of the column's character set. Every character
set has a binary collation. For example, the binary collation
for the latin1
character set is
latin1_bin
, so if the table default character
set is latin1
, these two column definitions
are equivalent:
CHAR(10) BINARY CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
The effect of BINARY
as a column attribute
differs from its effect prior to MySQL 4.1. Formerly,
BINARY
resulted in a column that was treated
as a binary string. A binary string is a string of bytes that
has no character set or collation, which differs from a
non-binary character string that has a binary collation. For
both types of strings, comparisons are based on the numeric
values of the string unit, but for non-binary strings the unit
is the character and some character sets allow multi-byte
characters. Section 11.4.2, “The BINARY
and VARBINARY
Types”.
The use of CHARACTER SET binary
in the
definition of a CHAR
,
VARCHAR
, or TEXT
column
causes the column to be treated as a binary data type. For
example, the following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary BINARY(10) VARCHAR(10) CHARACTER SET binary VARBINARY(10) TEXT CHARACTER SET binary BLOB
In the great majority of statements, it is obvious what
collation MySQL uses to resolve a comparison operation. For
example, in the following cases, it should be clear that the
collation is the collation of column x
:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should this query use the collation of the column
x
, or of the string literal
'Y'
?
Standard SQL resolves such questions using what used to be
called “coercibility” rules. Basically, this means:
Both x
and 'Y'
have
collations, so which collation takes precedence? This can be
difficult to resolve, but the following rules cover most
situations:
-
An explicit
COLLATE
clause has a coercibility of 0. (Not coercible at all.) -
The concatenation of two strings with different collations has a coercibility of 1.
-
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
-
A “system constant” (the string returned by functions such as
USER()
orVERSION()
) has a coercibility of 3. -
A literal's collation has a coercibility of 4.
-
NULL
or an expression that is derived fromNULL
has a coercibility of 5.
The preceding coercibility values are current as of MySQL 5.0.3.
In MySQL 5.0 prior to 5.0.3, there is no system
constant or ignorable coercibility. Functions such as
USER()
have a coercibility of 2 rather than
3, and literals have a coercibility of 3 rather than 4.
Those rules resolve ambiguities in the following manner:
-
Use the collation with the lowest coercibility value.
-
If both sides have the same coercibility, then it is an error if the collations aren't the same.
Examples:
column1 = 'A'
|
Use collation of column1 |
column1 = 'A' COLLATE x
|
Use collation of 'A' COLLATE x |
column1 COLLATE x = 'A' COLLATE y
|
Error |
The COERCIBILITY()
function can be used to
determine the coercibility of a string expression:
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0 mysql>SELECT COERCIBILITY(VERSION());
-> 3 mysql>SELECT COERCIBILITY('A');
-> 4
Each character set has one or more collations, but each
collation is associated with one and only one character set.
Therefore, the following statement causes an error message
because the latin2_bin
collation is not legal
with the latin1
character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
Suppose that column X
in table
T
has these latin1
column
values:
Muffler Müller MX Systems MySQL
Suppose also that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name
;
The following table shows the resulting order of the values if
we use ORDER BY
with different collations:
latin1_swedish_ci
|
latin1_german1_ci
|
latin1_german2_ci
|
Muffler | Muffler | Müller |
MX Systems | Müller | Muffler |
Müller | MX Systems | MX Systems |
MySQL | MySQL | MySQL |
The character that causes the different sort orders in this
example is the U with two dots over it
(ü
), which the Germans call
“U-umlaut.”
-
The first column shows the result of the
SELECT
using the Swedish/Finnish collating rule, which says that U-umlaut sorts with Y. -
The second column shows the result of the
SELECT
using the German DIN-1 rule, which says that U-umlaut sorts with U. -
The third column shows the result of the
SELECT
using the German DIN-2 rule, which says that U-umlaut sorts with UE.