Collations
The physical storage of character strings in Microsoft® SQL Server™ 2000 is controlled by collations. A collation specifies the bit patterns that represent each character and the rules by which characters are sorted and compared.
SQL Server 2000 supports objects that have different collations being stored in a single database. Separate SQL Server 2000 collations can be specified down to the level of columns. Each column in a table can be assigned different collations. Earlier versions of SQL Server support only one collation for each instance of SQL Server. All databases and database objects created in an instance of SQL Server 7.0 or earlier have the same collation.
How Character Data Is Stored
In a computer, characters are represented by different patterns of bits being either ON or OFF. There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns. There are 16 bits in 2 bytes, and 16 bits can be turned ON and OFF in 65,536 unique patterns. A program that uses 2 bytes to represent each character can represent up to 65,536 characters.
Single-byte code pages are definitions of the characters mapped to each of the 256 bit patterns possible in a byte. Code pages define bit patterns for uppercase and lowercase characters, digits, symbols, and special characters such as !, @, #, or %. Each European language, such as German or Spanish, has its own single-byte code page. Although the bit patterns used to represent the Latin alphabet characters A through Z are the same for all the code pages, the bit patterns used to represent accented characters such as 'é' and 'á' vary from one code page to the next. If data is exchanged between computers running different code pages, all character data must be converted from the code page of the sending computer to the code page of the receiving computer. If the source data has extended characters that are not defined in the code page of the receiving computer, data is lost. When a database serves clients from many different countries, it is difficult to pick a code page for the database that contains all the extended characters required by all the client computers. Also, there is a lot of processing time spent doing the constant conversions from one code page to another.
Single-byte character sets are also inadequate to store all the characters used by many languages. For example, some Asian languages have thousands of characters, so must use two bytes per character. Double-byte character sets have been defined for these languages. Still, each of these languages have their own code page, and there are difficulties in transferring data from a computer running one double-byte code page to a computer running another.
SQL Server 2000 supports these code pages.
Code page | Description |
1258 | Vietnamese |
1257 | Baltic |
1256 | Arabic |
1255 | Hebrew |
1254 | Turkish |
1253 | Greek |
1252 | Latin1 (ANSI) |
1251 | Cyrillic |
1250 | Central European |
950 | Chinese (Traditional) |
949 | Korean |
936 | Chinese (Simplified) |
932 | Japanese |
874 | Thai |
850 | Multilingual (MS-DOS Latin1) |
437 | MS-DOS U.S. English |
To address the character conversion and interpretation problems that occur when trying to support multiple code pages in a network, the ISO standards organization and a group called the Unicode Consortium defined the Unicode standard. Unicode uses two bytes to store each character. Because 65,536 characters are enough to cover all the commonly used characters from all the languages of the world, all major languages are covered by the Unicode standard. If all the computers and programs in a network use Unicode, there is no need for any character conversions, each user will see exactly the same characters as all other users, and no loss of characters will occur.
On computers running Microsoft Windows® operating systems, the code page used by the operating system and Windows applications is defined by the Windows locale. The locale is selected when the operating system is installed. Windows applications interpret character data using the code page defined by the Windows locale. Windows applications also support wide character, or Unicode, data.
SQL Server 2000 supports two categories of character data types:
- The Unicode data types nchar, nvarchar, and ntext. These data types use the Unicode character representation. Code pages do not apply to these data types.
- The non-Unicode character data types char, varchar, and text. These data types use the character representation scheme defined in a single or double-byte code page.
For more information about how character data is stored and the operation of code pages, Unicode, and sort orders, see Developing International Software for Windows 95 and Windows NT 4.0 in the MSDN® page at http://msdn.microsoft.com.
International Data and Unicode
Storing data in multiple languages within one database is difficult to manage when using only character data and code pages. It is difficult to find one code page for the database that can store all the required language-specific characters. It is also difficult to ensure the proper translation of special characters when being read or updated by different clients running various code pages. Databases that support international clients should always use Unicode data types instead of non-Unicode data types.
For example, a database of customers in North America has to handle three major languages:
- Spanish names and addresses for Mexico.
- French names and addresses for Quebec.
- English names and addresses for the rest of Canada and the United States.
When you use only character columns and code pages, care has to be taken to ensure the database is installed with a code page that will handle the characters of all three languages. More care must be taken to ensure the proper translation of characters from one of the languages when read by clients running a code page for another language.
With the growth of the Internet, it is becoming more important than ever before to support many client computers running different locales. It is difficult to pick a code page for character data types that will support all of the characters required by a worldwide audience.
The easiest way to manage character data in international databases is to always use the Unicode nchar, nvarchar, and ntext data types in place of their non-Unicode equivalents (char, varchar, and text). If all the applications that work with international databases also use Unicode variables instead of non-Unicode variables, character translations do not have to be performed anywhere in the system. All clients will see exactly the same characters in data as all other clients.
For systems that could use single-byte code pages, the fact that Unicode data needs twice as much storage space as non-Unicode character data is at least partially offset by eliminating the need to convert extended characters between code pages. Systems using double-byte code pages do not have this issue.
SQL Server 2000 stores all textual system catalog data in columns having Unicode data types. The names of database objects such as tables, views, and stored procedures are stored in Unicode columns. This allows applications to be developed using only Unicode, which avoids all issues with code page conversions.
Sort Order
A sort order specifies the rules used by SQL Server to interpret, collate, compare, and present character data. For example, a sort order defines whether 'a' is less than, equal to, or greater than 'b'. A sort order defines whether the collation is case-sensitive, for example whether 'm' is equal or not equal to 'M'. It also defines if the collation is accent-sensitive, for example whether 'á' is equal or not equal to 'ä'.
SQL Server 2000 uses two sort orders with each collation, one for Unicode data and another for the character code page.
Many SQL Server collations use the same code page, but have a different sort order for the code page. This allows sites to choose:
- Whether characters will simply be sorted based on the numeric value represented by their bit patterns. Binary sorting is fastest because SQL Server does not have to make any adjustments and can use fast, simple sorting algorithms. Binary sort orders are always case-sensitive. Because the bit patterns in a code page may not be arranged in the same sequence as defined by the dictionary rules for a specific language, binary sorting sometimes does not sort characters in a sequence users who speak that language might expect.
- Between case-sensitive or case-insensitive behavior.
- Between accent-sensitive or accent-insensitive behavior.
See Also
Collation Options for International Support