UTF8 strings and characters

From Free Pascal wiki
Revision as of 01:55, 1 February 2015 by JuhaManninen (talk | contribs)
Jump to navigationJump to search

Please note that simply iterating over chars as if the string was an array does not work in Unicode. This is not something specific to UTF-8: one simply cannot suppose that a character will have a fixed size in Unicode. If you want to iterate over the characters of an UTF-8 string, there are basically two ways:

  • iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing XML files.
  • iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.

Searching a substring

Due to the special nature of UTF8 you can simply use the normal string functions for searching a sub-string. Even though UTF-8 is a multi-byte encoding the first byte can not be confused with the second. So searching for a valid UTF-8 string with Pos will always return a valid UTF-8 position:

uses lazutf8;
...
procedure Where(SearchFor, aText: string);
var
  BytePos: LongInt;
  CharacterPos: LongInt;
begin
  BytePos:=Pos(SearchFor,aText);
  CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
  writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
    ' at byte position ',BytePos,' and at character position ',CharacterPos);
end;

Due to the ambiguity of Unicode, Pos() (just like any compare) might show unexpected behavior, when e.g. one of the string contains decomposed characters, while the other uses the direct codes for the same letter. This is not automatically handled by the RTL.

Accessing UTF8 characters

Unicode characters can vary in length, so the best solution for accessing them is to use an iteration when one intends to access the characters in the sequence in which they are. For iterating through the characters use this code:

uses lazutf8;
...
procedure DoSomethingWithString(AnUTF8String: string);
var
  p: PChar;
  CharLen: integer;
  FirstByte, SecondByte, ThirdByte: Char;
begin
  p:=PChar(AnUTF8String);
  repeat
    CharLen := UTF8CharacterLength(p);

    // Here you have a pointer to the char and its length
    // You can access the bytes of the UTF-8 Char like this:
    if CharLen >= 1 then FirstByte := P[0];
    if CharLen >= 2 then SecondByte := P[1];
    if CharLen >= 3 then ThirdByte := P[2];

    inc(p,CharLen);
  until (CharLen=0) or (p^ = #0);
end;

Accessing the Nth UTF8 character

Besides iterating one might also want to have random access to UTF-8 Characters.

uses lazutf8;
...
var
  AnUTF8String, NthChar: string;
begin
  NthChar := UTF8Copy(AnUTF8String, N, 1);

Showing character codepoints with UTF8CharacterToUnicode

The following demonstrates how to show the 32bit code point value of each character in an UTF8 string:

uses lazutf8;
...
procedure IterateUTF8Characters(const AnUTF8String: string);
var
  p: PChar;
  unicode: Cardinal;
  CharLen: integer;
begin
  p:=PChar(AnUTF8String);
  repeat
    unicode:=UTF8CharacterToUnicode(p,CharLen);
    writeln('Unicode=',unicode);
    inc(p,CharLen);
  until (CharLen=0) or (unicode=0);
end;

Mac OS X

The file functions of the FileUtil unit also take care of Mac OS X specific behaviour: OS X normalizes filenames. For example the filename 'ä.txt' can be encoded in Unicode with two different sequences (#$C3#$A4 and 'a'#$CC#$88). Under Linux and BSD you can create a filename with both encodings. OS X automatically converts the a umlaut to the three byte sequence. This means:

if Filename1 = Filename2 then ... // is not sufficient under OS X
if AnsiCompareFileName(Filename1, Filename2) = 0 then ... // not sufficient under fpc 2.2.2, not even with cwstring
if CompareFilenames(Filename1, Filename2) = 0 then ... // this always works (unit FileUtil or FileProcs

See also