Selene Unicode

LuaFAR 3

Selene Unicode


There are four string-like ctype closures: unicode.ascii, latin1, utf8 and grapheme.

ascii and latin1 are single-byte like string, but use the unicode table for upper/lower and character classes. ascii does not touch bytes > 127 on upper/lower.

ascii or latin1 can be used as locale-independent string replacement. (There is a compile switch to do this automatically for ascii).

utf8 operates on UTF-8 sequences as of RFC 3629: 1 byte 0-7F, 2 byte 80-7FF, 3 byte 800-FFFF, 4 byte 1000-10FFFF (not exclusing UTF-16 surrogate characters). Any byte not part of such a sequence is treated as it’s (Latin-1) value.

grapheme takes care of grapheme clusters, which are characters followed by “grapheme extension” characters (Mn+Me) like combining diacritical marks.

Calls are:

  len(str)
  sub(str, start [,end=-1])
  byte(str, start [,end=-1])
  lower(str)
  upper(str)
  char(i [,j...])
  reverse(str)

Same as in string: rep, format, dump

TODO: use char count with %s in format? (sub does the job)

TODO: grapheme.byte: only first code of any cluster?

  • find, gfind, gsub: done, but need thorough testing …: ascii does not match them on any %class (but on ., literals and ranges).
  • Behaviour of %class with class not ASCII is undefined.
  • Frontier %f currently disabled — should we?

Character classes are:

  %a L* (Lu+Ll+Lt+Lm+Lo)
  %c Cc
  %d 0-9
  %l Ll
  %n N* (Nd+Nl+No, new)
  %p P* (Pc+Pd+Ps+Pe+Pi+Pf+Po)
  %s Z* (Zs+Zl+Zp) plus the controls 9-13 (HT,LF,VT,FF,CR)
  %u Lu (also Lt ?)
  %w %a+%n+Pc (e.g. '_')
  %x 0-9A-Za-z
  %z the 0 byte

c.f.
http://www.unicode.org/reports/tr44/tr44-6.html#Property_Values
http://unicode.org/Public/UNIDATA/UnicodeData.txt

NOTE:

  • find positions are in bytes for all ctypes!
  • use ascii.sub to cut found ranges! This is a) faster, b) more reliable

utf8 behaviour:

  • match is by codes, code ranges are supported

grapheme behaviour:

  • any %class, ‘.’ and range match includes any following grapheme extensions.
  • Ranges apply to single code points only.
  • If a [] enumeration contains a grapheme cluster, this matches only the exact same cluster.
  • However, a literal single ‘o’ standalone or in an [] enumeration will match just that ‘o’, even if it has a extension in the string. Consequently, grapheme match positions are not always cluster positions.