Validate UTF8 strings?

Questions about the LÖVE API, installing LÖVE and other support related questions go here.
Forum rules
Before you make a thread asking for help, read this.
Post Reply
wan-may
Prole
Posts: 5
Joined: Thu Sep 14, 2023 5:06 am

Validate UTF8 strings?

Post by wan-may »

I read, via the FFI, some nasty untrusted binary sludge sent from who-knows-where.

Sometimes this sludge contains a possibly strange UTF-8 string I might want to display - maybe I want an Elvish localisation when 12.0 adds all that custom ligature support.

The utf8 library is only concerned with encoding, so it doesn't keep text:add from choking on things:

Code: Select all

love.load = function()
  local utf8 = assert( require 'utf8' )
  local s = utf8.char( 62835, 55592 ) --cognitohazardous ZWJ sequence
  assert( utf8.len( s ) ) --This should return fail (nil) if len encounters 'any invalid byte sequence' 
  love.graphics.newText( love.graphics.getFont() ):set( s ) --Throws 'invalid code point' error when decoding anyways
end
I guess in the worst case I can get away with pcalling text:add or something. But:

is there a right way to do this? Is there a function that will decide if my string is acceptable utf8, before I actually pass it to a text object?
User avatar
pgimeno
Party member
Posts: 3582
Joined: Sun Oct 18, 2015 2:58 pm

Re: Validate UTF8 strings?

Post by pgimeno »

I don't think there's one, and unfortunately Lua doesn't have regular expressions, which would have been a solution.

The best I've found is this:

Code: Select all

local function validate(s)
  for p, c in utf8.codes(s) do
    if c >= 0xD800 and c <= 0xDFFF or c == 0xFFFE or c == 0xFFFF then
      error("invalid UTF-8 codepoint")
    end
  end
end
utf8.codes already catches overlong sequences and codes > U+10FFFF, so that's covered.
wan-may
Prole
Posts: 5
Joined: Thu Sep 14, 2023 5:06 am

Re: Validate UTF8 strings?

Post by wan-may »

I see, thank you!
Post Reply

Who is online

Users browsing this forum: No registered users and 3 guests