Collation and International Terminology

SQL Server Setup

Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data. Collations that are used with character data types such as char and varchar dictate the code page and corresponding characters that can be represented for that data type. Whether you are installing a new instance of SQL Server, restoring a database backup, or connecting server to client databases, it is important that you understand the locale requirements, sorting order, and case and accent sensitivity of the data you will be working with.

When you select a collation for your server, database, column, or expression, you are assigning certain characteristics to your data that will affect the results of many operations in your database. For example, when you construct a query by using ORDER BY, the sort order of your result set might be dependent on the collation that is applied to the database or dictated in a COLLATE clause at the expression level of the query.

A collation can contain any or all of the following characteristics:

  • Case sensitivity
  • Accent sensitivity
  • Kana sensitivity
  • Width sensitivity

To best use collation support in SQL Server, you must understand the terms that are defined in this topic, and how they relate to the characteristics of your data.

Terms

Collation

A collation specifies the bit patterns that represent each character in a data set. Collations also determine the rules that sort and compare data. SQL Server supports storing objects with different collations in a single database. For non-Unicode columns, the collation setting specifies the code page for the data and which characters can be represented. Data that is moved between non-Unicode columns must be converted from the source code page to the destination code page.

Transact-SQL statement results can vary when the statement is run in the context of different databases that have different collation settings. If it is possible, use a standardized collation for your organization. This way, you do not have to explicitly specify the collation in every character or Unicode expression. If you must work with objects that have different collation and code page settings, code your queries to consider the rules of collation precedence. For more information, see Collation Precedence (Transact-SQL).

The characteristics of a collation are language sensitivity, case sensitivity, accent sensitivity, Kana-sensitive, and width sensitivity.

SQL Server collations include the following collation sets:

Windows collations

Windows collations define rules for storing character data that are based on an associated Windows system locale. For a Windows collation, comparison of non-Unicode data is implemented by using the same algorithm as Unicode data. The base Windows collation rules specify which alphabet or language is used when dictionary sorting is applied, and the code page that is used to store non-Unicode character data. Both Unicode and non-Unicode sorting are compatible with string comparisons in a particular version of Windows. This provides consistency across data types within SQL Server, and it also lets developers sort strings in their applications by using the same rules that are used by SQL Server. For more information, see Collation Settings in Setup.

Binary collations

Binary collations sort data based on the sequence of coded values that are defined by the locale and data type. They are case sensitive. A binary collation in SQL Server defines the locale and the ANSI code page to be used. This enforces a binary sort order. Because they are relatively simple, binary collations help improve application performance. For non-Unicode data types, data comparisons are based on the code points that are defined in the ANSI code page. For Unicode data types, data comparisons are based on the Unicode code points. For binary collations on Unicode data types, the locale is not considered in data sorts. For example, Latin_1_General_BIN and Japanese_BIN yield identical sorting results when they are used on Unicode data.

Binary collations in previous versions of SQL Server compared the first character as WCHAR, followed by a byte-by-byte comparison. For backwards compatibility reasons, existing binary collation semantics will not be changed.

Binary collations in this release of SQL Server include a set of pure code-point comparison collations. You can migrate to the binary collations to take advantage of true code-point comparisons. Youo should use binary collations to develop new applications. The BIN2 suffix identifies collation names that implement the new code-point collation semantics. There is also a comparison flag that corresponds to BIN2 for the new binary sort. For more information, see Guidelines for Using BIN and BIN2 Collations.

SQL Server

SQL Server collations (SQL_*) provide sort order compatibility with earlier versions of SQL Server. The dictionary sorting rules for non-Unicode data are not compatible with any sorting routine that is provided by Windows operating systems. However, sorting Unicode data is compatible with a particular version of Windows sorting rules. Because SQL Server collations use different comparison rules for non-Unicode and Unicode data, you will see different results for comparisons of the same data, depending on the underlying data type. For more information, see Using SQL Server Collations.

Note:
When you upgrade an English-language instance of SQL Server, SQL Server collations (SQL_*) can be specified for compatibility with existing instances of SQL Server. Because the default collation for an instance of SQL Server is defined during setup, it is important to specify collation settings carefully when the following are true:

  • Your application code depends on the behavior of previous SQL Server collations.
  • You are going to use SQL Server replication with existing installations of SQL Server 6.5 or SQL Server 7.0.
  • You must store character data that reflects multiple languages.

Setting collations are supported at the following levels of an instance of SQL Server:

Server-level collations

The default collation is set during SQL Server setup, and also becomes the default collation of the system databases. Note that Unicode-only collations cannot be selected during SQL Server setup because they are not supported as server-level collations.

After a collation has been assigned to any object other than a column or a database, you cannot change the collation except by dropping and re-creating the object. Instead of changing the default collation of an instance of SQL Server, you can specify the collation at the time you create a new database or database column.

To query the server collation for an instance of SQL Server, use the following Transact-SQL SERVERPROPERTY function:

Copy Code
SELECT CONVERT (varchar, SERVERPROPERTY('collation'))

To query the server for all available collations, use the following fn_helpcollations() built-in function:

Copy Code
SELECT * from ::fn_helpcollations()
Database-level collations

When a database is created, you can use the COLLATE clause of the CREATE DATABASE statement to specify the default database collation. If no collation is specified, the database is assigned the default collation of the model database. The collation of the model database is the same as the default collation of the instance of SQL Server.

The collation of a user database can be changed by using an ALTER DATABASE statement similar to the following:

Copy Code
ALTER DATABASE myDB COLLATE Greek_CS_AI

The current collation of a database can be retrieved by using a statement similar to the following:

Copy Code
SELECT CONVERT (varchar, DATABASEPROPERTYEX('database_name','collation'))
Note:
Altering the database-level collation does not affect column-level or expression-level collations.

Column-level collations

When you create a table, you can specify collations for each character-string column by using the COLLATE clause of the CREATE TABLE statement. If no collation is specified, the column is assigned the default collation of the database.

The collation of a column can be changed by using an ALTER TABLE statement similar to the following:

Copy Code
ALTER TABLE myTable ALTER COLUMN mycol NVARCHAR(10) COLLATE Greek_CS_AI
Expression-level collations

Expression-level collations are set when a statement is run, and they affect the way a result set is returned. This enables ORDER BY sort results to be locale-specific. Use a COLLATE clause such as the following to implement expression-level collations:

Copy Code
SELECT name FROM customer ORDER BY name COLLATE Latin1_General_CS_AI

Back to Top

Locale

A locale is a set of information that is associated with a place or a culture. This can include the name and identifier of the spoken language, the script that is used to write the language, and cultural conventions. Collations can be associated with one or more locales.

Back to Top

Unicode

If you store character data that reflects multiple languages, always use Unicode data types (nchar, nvarchar, and ntext) instead of the non-Unicode data types (char, varchar, and text).

Significant limitations are associated with non-Unicode data types. This is because a non-Unicode computer will be limited to use of a single code page. You might experience performance gain by using Unicode because fewer code-page conversions are required. Unicode collations must be selected individually at the database, column or expression level because they are not supported at the server level.

The code pages that a client uses are determined by your operating system settings. To set client code pages on the Windows 2000, Windows XP, Windows Server 2003, or Windows Server 2008 operating systems, use Regional Settings in Control Panel.

When you move data from a server to a client, your server collation might not be recognized by older client drivers. This can happen when you move data from a Unicode server to a non-Unicode client. Your best option might be to upgrade the client operating system so that the underlying system collations are updated. If your client has database client software installed, you might consider applying a service update to the database client software.

You can also try to use a different collation for the data on the server. Choose a collation that will map to a code page on the client. For more information, see the "Setting and Changing Collations" topic in SQL Server Books Online.

To evaluate issues that are related to using Unicode or non-Unicode data types, test your scenario to measure performance differences in your environment. It is a good practice to standardize the collation that is used on systems across your organization, and deploy Unicode servers and clients wherever possible. For more information about Unicode, see the Unicode Consortium Web site.

In many situations, SQL Server will interact with other servers or clients, and your organization might use multiple data access standards between applications and server instances. SQL Server clients are one of two main types:

  • Unicode clients that use OLE DB and Open Database Connectivity (ODBC) version 3.7 or a later version.
  • Non-Unicode clients that use DB-Library and ODBC version 3.6 or an earlier version.

The following table provides information about using multilingual data with various combinations of Unicode and non-Unicode servers.

Server Client Benefits or Limitations

Unicode

Unicode

Because Unicode data will be used throughout the system, this scenario provides the best performance and protection from corruption of retrieved data. This is the situation with ActiveX Data Objects (ADO), OLE DB, and ODBC version 3.7 or a later version.

Unicode

Non-Unicode

In this scenario, especially with connections between a server that is running a newer operating system and a client that is running an older version of SQL Server, or on an older operating system, there can be limitations or errors when you move data to a client computer. Unicode data on the server will try to map to a corresponding code page on the non-Unicode client to convert the data.

Non-Unicode

Unicode

This is not an ideal configuration for using multilingual data. You cannot write Unicode data to the non-Unicode server. Problems are likely to occur when data is sent to servers that are outside the server's code page.

Non-Unicode

Non-Unicode

This is a very limiting scenario for multilingual data. You can use only a single code page.

If you experience errors or difficulties when moving data from a server to a client, specifically from a Unicode server to a non-Unicode client, your server collation might not be recognized by older client drivers. In this situation, your best option might be to upgrade the client operating system so that the underlying system collations are updated. If your client has database client software installed, you might consider applying a service update to the database client software.

You can also try to use a different collation for the data on the server. Choose a collation that will map to a code page on the client. For more information about changing collations, see the "Setting and Changing Collations in SQL Server" topic in SQL Server Books Online. For more information about changing collations, see the "SQL Server Best Practices Collation Change" white paper. For more information about migrating non-Unicode datatypes to Unicode, see the "SQL Server Best Practices Migration to Unicode" whitepape.

Related Topics: Unicode Basics

Back to Top

Code Page

A code page is an ordered set of characters of a given script in which a numeric index, or code point value, is associated with each character. A Windows code page is commonly referred to as a character set or charset. Code pages are used to provide support for the character sets and keyboard layouts that are used by different Windows system locales.

All Windows Server 2008 Unicode collations are Unicode 5.0-based.

Back to Top

Data Types

A data type is a definition that specifies a range of values, the operations that can be performed on the values, and how the values are stored in memory. Defining data types lets SQL Server predictably manipulate data. Non-Unicode character data types are char, varchar, and text. Unicode data types are nchar, nvarchar, and ntext. We recommend that you use Unicode data types in your applications, especially if you store character data that reflects multiple languages.

For more information about migrating non-Unicode datatypes to Unicode, see the "SQL Server Best Practices Migration to Unicode" whitepaper.

Related Topics: Data Types (Database Engine), Data Types (Transact-SQL), Integration Services Data Types

Back to Top

Sort Order

Sort order specifies how data values are sorted. This affects the results of data comparison. Data is sorted by using collations, and it can be optimized by using indexes.

Related Topics: Windows Collation Sorting Styles, Indexes

Back to Top

See Also