7.3.2 Unicode Objects
These are the basic Unicode object types used for the Unicode implementation in Python:
- This type represents the storage type which is used by Python internally as basis for holding Unicode ordinals. Python's default builds use a 16-bit type for Py_UNICODE and store Unicode values internally as UCS2. It is also possible to build a UCS4 version of Python (most recent Linux distributions come with UCS4 builds of Python). These builds then use a 32-bit type for Py_UNICODE and store Unicode data internally as UCS4. On platforms where wchar_t is available and compatible with the chosen Python Unicode build variant, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for either unsigned short (UCS2) or unsigned long (UCS4).
Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep this in mind when writing extensions or interfaces.
- This subtype of PyObject represents a Python Unicode object.
- This instance of PyTypeObject represents the Python Unicode type.
The following APIs are really C macros and can be used to do fast checks and to access internal read-only data of Unicode objects:
- Return true if the object o is a Unicode object or an instance of a Unicode subtype. Changed in version 2.2: Allowed subtypes to be accepted.
- Return true if the object o is a Unicode object, but not an instance of a subtype. New in version 2.2.
- Return the size of the object. o has to be a PyUnicodeObject (not checked).
- Return the size of the object's internal buffer in bytes. o has to be a PyUnicodeObject (not checked).
- Return a pointer to the internal Py_UNICODE buffer of the object. o has to be a PyUnicodeObject (not checked).
- Return a pointer to the internal buffer of the object. o has to be a PyUnicodeObject (not checked).
Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.
- Return 1 or 0 depending on whether ch is a whitespace character.
- Return 1 or 0 depending on whether ch is a lowercase character.
- Return 1 or 0 depending on whether ch is an uppercase character.
- Return 1 or 0 depending on whether ch is a titlecase character.
- Return 1 or 0 depending on whether ch is a linebreak character.
- Return 1 or 0 depending on whether ch is a decimal character.
- Return 1 or 0 depending on whether ch is a digit character.
- Return 1 or 0 depending on whether ch is a numeric character.
- Return 1 or 0 depending on whether ch is an alphabetic character.
- Return 1 or 0 depending on whether ch is an alphanumeric character.
These APIs can be used for fast direct character conversions:
- Return the character ch converted to lower case.
- Return the character ch converted to upper case.
- Return the character ch converted to title case.
-
Return the character ch converted to a decimal positive
integer. Return
-1
if this is not possible. This macro does not raise exceptions.
-
Return the character ch converted to a single digit integer.
Return
-1
if this is not possible. This macro does not raise exceptions.
-
Return the character ch converted to a (positive) double.
Return
-1.0
if this is not possible. This macro does not raise exceptions.
To create Unicode objects and access their basic sequence properties, use these APIs:
-
Return value: New reference.Create a Unicode Object from the Py_UNICODE buffer u of the given size. u may be NULL which causes the contents to be undefined. It is the user's responsibility to fill in the needed data. The buffer is copied into the new object. If the buffer is not NULL, the return value might be a shared object. Therefore, modification of the resulting Unicode object is only allowed when u is NULL.
- Return a read-only pointer to the Unicode object's internal Py_UNICODE buffer, NULL if unicode is not a Unicode object.
- Return the length of the Unicode object.
-
Return value: New reference.Coerce an encoded object obj to an Unicode object and return a reference with incremented refcount.
Coercion is done in the following way:
- Unicode objects are passed back as-is with incremented
refcount. Note:
These cannot be decoded; passing a non-NULL
value for encoding will result in a TypeError.
- String and other char buffer compatible objects are decoded
according to the given encoding and using the error handling
defined by errors. Both can be NULL to have the interface
use the default values (see the next section for details).
- All other objects cause an exception.
The API returns NULL if there was an error. The caller is responsible for decref'ing the returned objects.
- Unicode objects are passed back as-is with incremented
refcount. Note:
These cannot be decoded; passing a non-NULL
value for encoding will result in a TypeError.
-
Return value: New reference.Shortcut for
PyUnicode_FromEncodedObject(obj, NULL, "strict")
which is used throughout the interpreter whenever coercion to Unicode is needed.
If the platform supports wchar_t and provides a header file wchar.h, Python can interface directly to this type using the following functions. Support is optimized if Python's own Py_UNICODE type is identical to the system's wchar_t.
-
Return value: New reference.Create a Unicode object from the wchar_t buffer w of the given size. Return NULL on failure.
- Copy the Unicode object contents into the wchar_t buffer w. At most size wchar_t characters are copied (excluding a possibly trailing 0-termination character). Return the number of wchar_t characters copied or -1 in case of an error. Note that the resulting wchar_t string may or may not be 0-terminated. It is the responsibility of the caller to make sure that the wchar_t string is 0-terminated in case this is required by the application.
See About this document... for information on suggesting changes.