[LIB] utf8string

raidho36 · Post by **raidho36** » Mon Feb 03, 2020 5:39 am

This is a merged and tweaked "uft8" and "string" library. It works exactly the same, except it will work properly with UTF-8 strings* in addition to handling standard ASCII strings. For many reasons, most of which have nothing to do with Lua itself, Unicode pattern matching generally doesn't work, but all basic string manipulation functions should** exhibit exactly the same behavior as standard functions.

It contains functions from both "string" and "utf8" standard libraries. "utf8string.codepoint" is aliased to "utf8string.code" for convenience, this is what you should use to get character code instead of "utf8string.byte" (because it fetches specified byte, not text character). "uft8.bytelen" function added to get string length in bytes, not in text characters. Uppercase and lowercase conversion works with just about every character system that has character casing (for example japanese writing doesn't have casing, neither do emojis). In addition to "uft8string.charpattern" there are "uft8string.upperpattern" and "uft8string.lowerpattern" that contain exhaustive list of uppercase and lowercase characters.

*on assumption that you're using normalized Unicode strings with no composite characters
**message or email me about any instance where utf8string library doesn't works the same way as string standard library

String handling functions will readily break when using composite Unicode characters because they can't tell what constitutes a complete, full character.
Pattern matching will break when mixing Unicode characters with pattern classes and items:

Code: Select all

[ ] * + - ? . %b

dusoft · Post by **dusoft** » Mon Feb 03, 2020 12:14 pm

Hey, good job! Have you got some unit tests? UTF-8 handling is a messy job, so I would like to be sure your library is error free. Have you tested it with different UTF-8 strings?

raidho36 · Post by **raidho36** » Mon Feb 03, 2020 12:43 pm

dusoft wrote: ↑Mon Feb 03, 2020 12:14 pm Hey, good job! Have you got some unit tests? UTF-8 handling is a messy job, so I would like to be sure your library is error free. Have you tested it with different UTF-8 strings?

I did exhaustive testing but as ad-hoc runs rather than unit tests. I tried to make sure that every single function produces exactly the same output given exactly the same input, with Unicode variants also being force-fed Unicode-spliced strings. There's little reason it shouldn't work as it basically just uses UTF-8 character count functions on top of binary string handlers. I had to work out some edge cases, particularly that native UTF-8 library uses different convention for handling out of bounds indices, that a regex dot pattern matches any single byte rather than a whole Unicode character, and that substring function with a negative second argument would corrupt the last Unicode character in the substring. That should be it for the quirks that need fixing, but you can never be 100% sure and if there's anything I missed is because I didn't think to test against it (I'm pretty sure I tested against everything).

pgimeno · Post by **pgimeno** » Mon Feb 03, 2020 1:40 pm

It seems patterns are not UTF-8 aware.

Code: Select all

print((string.gsub('ao', 'añ?o', 'x'))) -- prints 'ao', expected 'x'

(Edit: I have a port to Lua of the Lua interpreter's pattern matching code, except gsub: https://notabug.org/pgimeno/patlua in case you want to hack it in - it passes Lua's internal unit tests, another pattern matching library's, and my own unit tests)

Besides, it would be nice if you provide a way to replace the string metatable.

Code: Select all

string = require ( "utf8string" )
print(("año"):sub(1, 2)) -- prints aÃ in my terminal

dusoft · Post by **dusoft** » Mon Feb 03, 2020 1:45 pm

raidho36 wrote: ↑Mon Feb 03, 2020 12:43 pm
dusoft wrote: ↑Mon Feb 03, 2020 12:14 pm Hey, good job! Have you got some unit tests? UTF-8 handling is a messy job, so I would like to be sure your library is error free. Have you tested it with different UTF-8 strings?
I did exhaustive testing but as ad-hoc runs rather than unit tests. I tried to make sure that every single function produces exactly the same output given exactly the same input, with Unicode variants also being force-fed Unicode-spliced strings. There's little reason it shouldn't work as it basically just uses UTF-8 character count functions on top of binary string handlers. I had to work out some edge cases, particularly that native UTF-8 library uses different convention for handling out of bounds indices, that a regex dot pattern matches any single byte rather than a whole Unicode character, and that substring function with a negative second argument would corrupt the last Unicode character in the substring. That should be it for the quirks that need fixing, but you can never be 100% sure and if there's anything I missed is because I didn't think to test against it (I'm pretty sure I tested against everything).

You can try running your library through this test file (as this is older, you can add a set of random emojis to the test file):
https://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html

And then running all UTF-8 functions (+random char positions for substrings etc.) on the test file doing a printout for manual checks. If it passes, then all good!

raidho36 · Post by **raidho36** » Mon Feb 03, 2020 11:37 pm

pgimeno wrote: ↑Mon Feb 03, 2020 1:40 pm It seems patterns are not UTF-8 aware.

Should've expected that, since the entire system operates on single byte characters.

Yeah for pattern matching a simple wrapper will not do, what was I thinking.

I can't think of a way to automatically replace strings' metatable when and only when string module is replaced for utf8string. Calling an initializer function is always an option though.

pgimeno · Post by **pgimeno** » Fri Feb 07, 2020 9:28 am

raidho36 wrote: ↑Mon Feb 03, 2020 11:37 pm I can't think of a way to automatically replace strings' metatable when and only when string module is replaced for utf8string. Calling an initializer function is always an option though.

Never mind, this is easy enough: getmetatable("").__index = string

raidho36 · Post by **raidho36** » Fri Feb 07, 2020 11:36 am

pgimeno wrote: ↑Fri Feb 07, 2020 9:28 am Never mind, this is easy enough: getmetatable("").__index = string

Well yeah this is what you do but that's not automatic.

HDPLocust · Post by **HDPLocust** » Sun Feb 16, 2020 9:23 am

Functions like "lower" and "upper" can be seriously optimized like this:

Code: Select all

local _l = function(c) return tolower[c] or c end
local newlower = function(str)
  return str:gsub(utf8_charpattern, _l)
end

The simple bench with this optimisation gives me this results:

Code: Select all

local function check(msg, func)
  local t = os.clock()
	for i = 1, 1000000 do
		func("Привет, я - кириллический текст!")
	end
	print(msg, (os.clock() - t) .. "sec")
end

jit.off()
check("lower no jit", lower)
check("newlower no jit", newlower)
jit.on()
check("lower jit", lower)
check("newlower jit", newlower)

> lower no jit    5.74sec
> newlower no jit 4.738sec
> lower jit       5.783sec
> newlower jit    3.961sec

Also it gives more benefits of speed on short strings, and less memory consumption on long ones.

raidho36 · Post by **raidho36** » Sun Feb 16, 2020 6:25 pm

Interesting find. I also found that using substitution table directly as an argument is marginally faster than using a callback that looks up values from that table. Unfortunately, the library is generally broken for composite characters which are pretty common in some languages (e.g. korean) and there's nothing stopping people from using composition with any characters (e.g. zalgo text). It's possible to implement this functionality but at the time I don't wish to undertake such an endeavor.

[LIB] utf8string

[LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Re: [LIB] utf8string

Who is online