Selene Unicode
There are four string-like ctype closures: unicode.ascii
,
latin1
, utf8
and grapheme
.
ascii
and latin1
are single-byte like string
,
but use the unicode table for upper/lower and character classes.
ascii
does not touch bytes > 127
on upper
/lower
.
ascii
or latin1
can be used as locale-independent
string replacement. (There is a compile switch to do this
automatically for ascii
).
utf8
operates on UTF-8 sequences as of RFC 3629:
1 byte 0-7F
, 2 byte 80-7FF
, 3 byte 800-FFFF
,
4 byte 1000-10FFFF
(not exclusing UTF-16 surrogate characters).
Any byte not part of such a sequence is treated as it’s (Latin-1)
value.
grapheme
takes care of grapheme clusters, which are characters
followed by “grapheme extension” characters (Mn+Me
) like combining diacritical marks.
Calls are:
len(str)
sub(str, start [,end=-1])
byte(str, start [,end=-1])
lower(str)
upper(str)
char(i [,j...])
reverse(str)
Same as in string
: rep
, format
, dump
TODO: use char count with %s
in format
? (sub
does the job)
TODO: grapheme.byte
: only first code of any cluster?
find
,gfind
,gsub
: done, but need thorough testing …:ascii
does not match them on any%class
(but on.
, literals and ranges).- Behaviour of
%class
with class not ASCII is undefined. - Frontier
%f
currently disabled — should we?
Character classes are:
%a L* (Lu+Ll+Lt+Lm+Lo)
%c Cc
%d 0-9
%l Ll
%n N* (Nd+Nl+No, new)
%p P* (Pc+Pd+Ps+Pe+Pi+Pf+Po)
%s Z* (Zs+Zl+Zp) plus the controls 9-13 (HT,LF,VT,FF,CR)
%u Lu (also Lt ?)
%w %a+%n+Pc (e.g. '_')
%x 0-9A-Za-z
%z the 0 byte
c.f.
http://www.unicode.org/reports/tr44/tr44-6.html#Property_Values
http://unicode.org/Public/UNIDATA/UnicodeData.txt
NOTE:
find
positions are in bytes for all ctypes!- use
ascii.sub
to cut found ranges! This is a) faster, b) more reliable
utf8
behaviour:
match
is by codes, code ranges are supported
grapheme
behaviour:
- any
%class
, ‘.
’ and range match includes any following grapheme extensions. - Ranges apply to single code points only.
- If a
[]
enumeration contains a grapheme cluster, this matches only the exact same cluster. - However, a literal single ‘
o
’ standalone or in an[]
enumeration will match just that ‘o
’, even if it has a extension in the string. Consequently, grapheme match positions are not always cluster positions.