Page 1 of 1

utf8 support in pure lua

Posted: Fri Nov 09, 2012 12:09 am
by TsT
Hello,

I started a pure Lua module to support operation on UTF-8 data.

See lua-utf8

First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion

Regards,
TsT

Re: utf8 support in pure lua

Posted: Sun Nov 11, 2012 4:24 pm
by spir
TsT wrote:Hello,

I started a pure Lua module to support operation on UTF-8 data.

See lua-utf8

First goal was reached :
- be able to get the good length
- be able to get substring
- automatic convertion

Regards,
TsT
Hello TsT,

Pleased to see someone else interested in unicode. I have had a look at your online repo. However, are you aware the following example you give will not work in general:

Code: Select all

Sample of use

code>local data = "àbcdéêèf"

local u = require("utf8")

local udata = u(data)

print(type(data), data) -- the orignal print(type(udata), udata) -- automatic convertion to string

print(#data) -- is not the good number of printed characters on screen print(#udata) -- is the number of printed characters on screen

print(udata:sub(4,5)) -- be able to use the sub() like a string
I will not give you a Lua example because you cannot even type unicode strings in Lua, but here is the best you can have shown in python:

Code: Select all

# coding:utf8
s = u"\u0041\u0302\u0020\u0041\u032D"
print(s)          # "Â A̭"   (3 chars!)
print(repr(s))    # u'A\u0302 A\u032d'
print (len(s))    # 5
The point is what unicode folks call "abstract characters", what is represented by "unicode code points", is not what you, me, or any other one would call "character", but just what they like to list in their set. In particular, basically, composite characters like  are represented by 2 codes, one for the base 'A', one for the combining '^'. Which is a very good thing, imo: simple, informative, efficient. But there are also "precomposed characters" with codes representing whole composite characters. These are the ones most (if not all) unicode-aware editors and other text-producing software use, indeed, so that everyone thinks "abstract characters" are just characters and codes just represent characters (even programmers working on unicode). But this is not true.

A single character is represented by a suite of codes (1 or more, there is no formal limit in fact). And each code is 1 number in utf-32 and 1 to 4 (or 6) bytes in utf-8, as you know. Thus, decoding utf-8 gives you an array of codes, but not array of character representations, in the everyday or programming sense of "character". As a consequence, your #udata on my example will give 5, not 3.

Anyway, it's still very, very nice to have utf-8 <--> unicode encoding and decoding routines, and I may reuse them if you don't mind.

Regards,
Denis

Re: utf8 support in pure lua

Posted: Tue Nov 13, 2012 1:11 pm
by TsT
Hello spir,

Thanks for your feedback.
I'm also appreciate to meet someone who cares about Unicode!

Unfortunately my current utf8.lua is a simple approach.

I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like : You searched a way to create an Unicode sequence by numerical code
You may use string.char

Code: Select all

> a = "Â A̭"

> print(a:byte(1,-1))
195	130	32	65	204	173

> for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end
1	195	0xc3
2	130	0x82
3	32	0x20
4	65	0x41
5	204	0xcc
6	173	0xad

> b=string.char(195,	130,	32,	65,	204,	173)
> b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad)
> print(b)
 A̭
If you understand how manage the composite Unicode Characters I will be happy to include changes to support them.

Regards,

EDIT: I discovered the ValidateUnicodeString page.

Posted: Wed Nov 14, 2012 7:06 pm
by spir
TsT wrote:Hello spir,
Unfortunately my current utf8.lua is a simple approach.
I tried to support more advanced stuff like lower/upper cases on Unicode, finally I thought it's too complicated...
I think if someone want a true and full support of UTF-8 (or Unicode) he must use a better solution, like :
Well, in fact, as long as people understand the (theoretical) issue with composite characters, any support for unicode code points can be pretty useful. The point is most text will be made of precomposed characters anyway. So if one knows the software that produced it, or is ready to take the risk... It's good in any case to be able to point to or select parts of a byte string while knowing we are at borders of valid code points.

About full unicode support, if you mean building a representation which is really a sequence characters, it is doable, but costly. (You need essentially to produce a normalised decomposed form.) If it is unicode support in the sense of providing tools like universal casing or locale-aware sorting or giving information about characters (is it a scripting char? a base or composing one? does it write right-to-left?), then it is another story ==> ICU, as you say.
TsT wrote: You searched a way to create an Unicode sequence by numerical code
You may use string.char

Code: Select all

> a = "Â A̭"

> print(a:byte(1,-1))
195	130	32	65	204	173

> for i,v in ipairs({a:byte(1,-1)}) do print(i,v, ("0x%x"):format(v)) end
1	195	0xc3
2	130	0x82
3	32	0x20
4	65	0x41
5	204	0xcc
6	173	0xad

> b=string.char(195,	130,	32,	65,	204,	173)
> b=string.char(0xc3, 0x82, 0x20, 0x41, 0xcc, 0xad)
> print(b)
 A̭
If you understand how manage the composite Unicode Characters I will be happy to include changes to support them.
Yes, thank you!
About composite Unicode Characters: no, at least not now, I don't have time for that. (But I have a lib for that in D; I also had a prototype in Lua, but cannot find it anymore.) However, it is probably not worth the pain and the cost (in time and memory).

Denis