Selene Unicode
There are four string-like ctype closures: unicode.ascii,
latin1, utf8 and grapheme.
ascii and latin1 are single-byte like string,
but use the unicode table for upper/lower and character classes.
ascii does not touch bytes > 127 on upper/lower.
ascii or latin1 can be used as locale-independent
string replacement. (There is a compile switch to do this
automatically for ascii).
utf8 operates on UTF-8 sequences as of RFC 3629:
1 byte 0-7F, 2 byte 80-7FF, 3 byte 800-FFFF,
4 byte 1000-10FFFF (not exclusing UTF-16 surrogate characters).
Any byte not part of such a sequence is treated as it’s (Latin-1)
value.
grapheme takes care of grapheme clusters, which are characters
followed by “grapheme extension” characters (Mn+Me) like combining diacritical marks.
Calls are:
len(str)
sub(str, start [,end=-1])
byte(str, start [,end=-1])
lower(str)
upper(str)
char(i [,j...])
reverse(str)
Same as in string: rep, format, dump
TODO: use char count with %s in format? (sub does the job)
TODO: grapheme.byte: only first code of any cluster?
find,gfind,gsub: done, but need thorough testing …:asciidoes not match them on any%class(but on., literals and ranges).- Behaviour of
%classwith class not ASCII is undefined. - Frontier
%fcurrently disabled — should we?
Character classes are:
%a L* (Lu+Ll+Lt+Lm+Lo)
%c Cc
%d 0-9
%l Ll
%n N* (Nd+Nl+No, new)
%p P* (Pc+Pd+Ps+Pe+Pi+Pf+Po)
%s Z* (Zs+Zl+Zp) plus the controls 9-13 (HT,LF,VT,FF,CR)
%u Lu (also Lt ?)
%w %a+%n+Pc (e.g. '_')
%x 0-9A-Za-z
%z the 0 byte
c.f.
http://www.unicode.org/reports/tr44/tr44-6.html#Property_Values
http://unicode.org/Public/UNIDATA/UnicodeData.txt
NOTE:
findpositions are in bytes for all ctypes!- use
ascii.subto cut found ranges! This is a) faster, b) more reliable
utf8 behaviour:
matchis by codes, code ranges are supported
grapheme behaviour:
- any
%class, ‘.’ and range match includes any following grapheme extensions. - Ranges apply to single code points only.
- If a
[]enumeration contains a grapheme cluster, this matches only the exact same cluster. - However, a literal single ‘
o’ standalone or in an[]enumeration will match just that ‘o’, even if it has a extension in the string. Consequently, grapheme match positions are not always cluster positions.