Unicode Support in Lazarus

From Free Pascal wiki
Jump to navigationJump to search

Template:Better Unicode Support in Lazarus

Introduction

This page covers Unicode support in Lazarus programs (console or server, no GUI) and applications (GUI with LCL) using features of FPC 3.0+.

The old way to support UTF-8 in LCL using FPC versions up to 2.6.4 is explained here: LCL Unicode Support

RTL with default codepage UTF-8

By default RTL uses the system codepage for AnsiStrings (e.g. FileExists and TStringList.LoadFromFile). On Windows this is a non Unicode encoding, so you can only use characters from your language group (at most 256 characters). LCL on the other hand works with UTF-8 encoding which is the full Unicode range. On Linux and Mac OS X UTF-8 is typically the system codepage, so the RTL uses here by default CP_UTF8.

FPC since version 3.0 provides an API to change the default system codepage of RTL to something else. Lazarus (actually its LazUtils package) takes advantage of that API and changes it to UTF-8 (CP_UTF8). It means also Windows users can now use UTF-8 strings in the RTL.

  • For example FileExists and aStringList.LoadFromFile(Filename) now support full Unicode. See here for the complete list of functions that already support full Unicode:

RTL changes

  • SysToUTF8, UTF8ToSys have no effect. They were mainly used for the above RTL functions, which no longer need a conversion. For WinAPI functions see below.
  • Many UTF8Encode and UTF8Decode calls are no longer needed, because when assigning UnicodeString to String and vice versus the compiler does it automatically for you.
  • When accessing the WinAPI you must use the "W" functions or use the functions UTF8ToWinCP and WinCPToUTF8. The same is true for libraries that still use Ansi WinAPI function. For example in FPC 3.0 and below the unit registry needs this.
  • "String" and "UTF8String" are different types. If you assign a String to an UTF8String the compiler adds code to check if the encoding is the same. This costs unnecessary time and increases code size. Simply use String instead of UTF8String.
  • The Windows console uses a different encoding. writeln of FPC 3.0+ automatically converts the UTF-8 to the console codepage. Some Windows console programs expects as input the console codepage and outputs the console codepage. You can use UTF8ToConsole and ConsoleToUTF8 to convert.


More information about the new FPC Unicode Support: FPC Unicode support

Usage in Lazarus

The new mode is enabled automatically when compiling with FPC 3.0+. It can be disabled by defining -dDisableUTF8RTL, see page Lazarus with FPC3.0 without UTF-8 mode for details.

If you use string literals with the new mode, your sources must have UTF-8 encoding. However -FcUTF8 is not typically needed. See "String Literals" below for more details.

What actually happens in the new mode? These 2 FPC functions are called in an early initialization section, setting the default String encoding in FPC to UTF-8 :

 SetMultiByteConversionCodePage(CP_UTF8);
 SetMultiByteRTLFileSystemCodePage(CP_UTF8);

Also the UTF8...() functions in LazUTF8 (LazUtils) are set as backends for RTL's Ansi...() functions.

For console programs (no LCL) a dependency for LazUtils must be added manually. (LCL applications already have it through LCL dependency.)

Using UTF-8 in non LCL programs

If you want to use the new UTF-8 strings in your non LCL programs, add dependency for LazUtils package. Then add LazUTF8 unit in the uses section of the main program file. It must be near the beginning, just after the critical memorymanagers and threading stuff (e.g. cmem, heaptrc, cthreads).

Compatibility with Unicode Delphi

For console programs LazUTF8 unit must be in the uses section of main program file. Delphi has no such unit.

RTL functions in ASCII area

RTL functions that work in ASCII area (e.g. UpperCase) are compatible, but they work faster in the UTF-8 RTL. In Delphi all string functions became slower after they switched to UTF-16.

RTL Ansi...() Unicode functions

RTL Ansi...() functions that work with codepages / Unicode (e.g. AnsiUpperCase) are compatible.

Reading individual codepoints

Not compatible, although it is quite easy to make source code that works with both encodings.

Delphi has functions like NextCharIndex, IsHighSurrogate, IsLowSurrogate to deal with UTF-16 surrogate pairs, codepoints consisting of 2 UnicodeChar(*) (WideChar, Word, 2 bytes). However those functions are not used much in example code and tutorials. Most tutorials say that Copy() function works just as it did with Delphi versions before D2009. No, a codepoint can now be 2 UnicodeChar(*) and Copy() may return half of it.

UTF-8 has an advantage here. It must be done always right because multi-byte codepoints are so common.

See section : Dealing with UTF8 strings and characters in code below for examples of how to use UTF-8 and how to make code that works with both encodings.

(*)

  • "UnicodeString" and "UnicodeChar" names for UTF-16 types was a very unfortunate choice from Borland.
  • A Unicode codepoint is a "real" character definition in Unicode which can be encoded differently and its length depends on the encoding.
  • A Unicode character is either one codepoint or a decomposed character of multiple codepoints. Yes, this is complex ...

Calling Windows API

Only the "W" versions of Windows API functions should be called. It is like in Delphi except that you must assign strings to/from API calls to UnicodeString variables or typecast with UnicodeString(). In Delphi the default string type already is UnicodeString and is not needed explicitly. Note: WideString is used traditionally with WinAPI but it is only needed with COM/OLE programming where the OS takes care of memory management. Use UnicodeString instead.

Compatibility with LCL in Lazarus 1.x

Many Lazarus LCL applications will continue to work without changes. However the handling of Unicode has become simpler and it makes sense to clean the code. Code that reads or writes data from/to streams, files or DBs with non-UTF-8 encoding, breaks and must be changed. (See below for examples).

Explicit conversion functions are only needed when calling Windows Ansi functions. Otherwise FPC takes care of converting encodings automatically. Empty conversion functions are provided to make your old code compile.

  • UTF8Decode, UTF8Encode - Almost all can be removed.
  • UTF8ToAnsi, AnsiToUTF8 - Almost all can be removed.
  • UTF8ToSys, SysToUTF8 - All can be removed. They are now dummy no-ops and only return their parameter.

File functions in RTL now take care of file name encoding. All (?) file name related ...UTF8() functions can be replaced with the Delphi compatible function without UTF8 suffix. For example FileExistsUTF8 can be replaced with FileExists.

Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. The UTF8...() functions in LazUTF8 are registered as callback functions for the Ansi...() functions in SysUtils.

UTF-8 works in non-GUI programs, too. It only requires a dependency for LazUtils and placing LazUTF8 unit into the uses section of main program file.

Reading / writing text file with Windows codepage

This is not compatible with former Lazarus code. In practice you must encapsulate the code dealing with system codepage and convert the data to UTF-8 as quickly as possible.

Use RawByteString and do an explicit conversion

 uses ... , LConvEncoding;
 ...
 var
   StrIn: RawByteString;
   StrOut: String;
 ...
 StrOut := CP1252ToUTF8(StrIn,true);  // Uses fixed codepage
 // or
 StrOut := WinCPToUTF8(StrIn,true);  // Uses system codepage in this particular computer

Set the right codepage for an exising string

 var
   StrIn, StrOut: String;
 ...
   SetCodePage(RawByteString(StrIn), 1252, false);  // Fixed 1252 (or Windows.GetACP())

Note: there must be some text in the string variable. An empty string is actually a Nil pointer and you cannot set its codepage.

Windows.GetACP() returns the Windows system codepage.

ToDo ...

Code that depends very much on Windows codepage

Sometimes program code depends so much on system codepage that using the new UTF-8 mode is not practical. There are 2 choices then :

  • Continue using Lazarus with FPC 2.6.4. This is a good solution for code that is in maintenance mode. Lazarus can still be compiled with FPC 2.6.4 for some time to come and the old UTF8...() functions will be there.
  • Use FPC 3.0 without the new UTF-8 mode by defining DisableUTF8RTL. This causes some nasty problems which are explained here : Lazarus with FPC3.0 without UTF-8 mode.

Calling Windows API

See above Delphi compatibility section. Use only the "W" versions of Windows API and assign to/from UnicodeString. The old Ansi Windows API functions are not recommended, this is a fully Unicode aware system after all.

Helper functions for CodePoints

LazUtils will have special functions for dealing with codepoints. They will use the old UTF8...() functions in LCL now but can be made alias to functions using other encoding in Delphi and in FPC's {$mode DelphiUnicode}.

  • CodePointCopy() - Like UTF8Copy()
  • CodePointLength() - Like UTF8Length()
  • CodePointPos() - Like UTF8Pos()
  • CodePointToWinCP() - Like UTF8ToWinCP()
  • WinCPToCodePoint() - Like WinCPToUTF8()
  • CodePointByteCount() - Like UTF8CharacterLength()

An interesting question is how CodePointCopy, CodePointLength and CodePointPos should be implemented in Delphi which does not provide such functions for UTF-16. (Or does it?) Practically all Delphi code uses plain Copy, Length and Pos when codepoint aware functions should be used.

Dealing with UTF8 strings and characters in code

See details in UTF8_strings_and_characters.

String Literals

Sources should be saved in UTF-8 encoding. Lazarus creates such files by default. You can change the encoding of imported files via right click in source editor / File Settings / Encoding.

In most cases {$codepage utf8} / -FcUTF8 is not needed. This is rather counter-intuitive because the meaning of that flag is to treat string literals as UTF-8. However the new UTF-8 mode switches the encoding at run-time, yet the constants are evaluated at compile-time.

So, without -FcUTF8 the compiler (wrongly) thinks the constant string is encoded with system code page. Then it sees a String variable with default encoding (which will be changed to UTF-8 at run-time but the compiler does not know it). Thus, same default encodings, no conversion needed, the compiler happily copies the characters and everything goes right, while actually it was fooled twice during the process.

Examples:

  • AnsiString/String literals work with and without {$codepage utf8} / -FcUTF8.
const s: string = 'äй';
  • ShortString literals work only without {$codepage utf8} / -FcUTF8. You can
unit unit1;
{$Mode ObjFPC}{$H+}
{$modeswitch systemcodepage} // disable -FcUTF8
interface
const s: string[15] = 'äй';
end.

Alternatively you can use shortstring with $codepage via codes:

unit unit1;
{$Mode ObjFPC}{$H+}
{$codepage utf8}
interface
const s: String[15] = #$C3#$A4; // ä
end.
  • WideString/UnicodeString/UTF8String literals only work with {$codepage utf8} / -FcUTF8.
unit unit1;
{$Mode ObjFPC}{$H+}
{$codepage utf8}
interface
const ws: WideString = 'äй';
end.

Writing to console

Some console programs use the box-drawing characters like '╩'. For example in codepage CP437 this was one byte #202 and was often used like this:

write('╩');

When you convert the code to UTF-8, for example by using Lazarus' source editor popup menu item File Settings / Encoding / UTF-8 and clicking on the dialog button "Change file on disk", the becomes 3 bytes (#226#149#169), so the literal becomes a string. The procedures write and writeln convert the UTF-8 string to the current console codepage. So your console program now outputs the '╩' on Windows with any codepage (i.e. not only CP437) and it even works on Linux and Mac OS X. You can also use the '╩' in LCL strings, for instance Memo1.Lines.Add('╩');

FPC codepages

The compiler (FPC) supports specifying the code page in which the source code has been written via the command option -Fc (e.g. -Fcutf8) and the equivalent codepage directive (e.g. {$codepage utf8}). In this case, rather than literally copying the bytes that represent the string constants in your program, the compiler will interpret all character data according to that codepage. There are two things to watch out for though:

  • on Unix platforms, a widestring manager must be included by adding the cwstring unit to uses-clause. Without it, the program will not be able to convert all character data correctly when running.

It is included by default with the new UTF-8 mode although it makes the program dependent on libc and makes cross-compilation harder.

  • The compiler converts all string constants that contain non-ASCII characters to widestring constants. These are automatically converted back to ansistring (either at compile time or at run time), but this can cause one caveat if you try to mix both characters and ordinal values in a single string constant:

For example:

program project1;
{$codepage utf8}
{$mode objfpc}{$H+}
{$ifdef unix}
uses cwstring;
{$endif}
var
  a,b,c: string;
begin
  a:='ä';
  b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
  c:='ä='#$C3#$A4; // after non ascii 'ä' the compiler interprets #$C3 as widechar.
  writeln(a,b); // writes ä=ä
  writeln(c);   // writes ä=ä
end.

When compiled and executed, this will write:

ä=ä
ä=ä

The reason is once the ä is encountered, as mentioned above the rest of the constant string assigned to 'c' will be parsed as a widestring. As a result the #$C3 and #$A4 are interpreted as widechar(#$C3) and widechar(#$A4), rather than as ansichars.

Open issues

  • TFormatSettings char (bug 27086): for example: ThousandSeparator, DecimalSeparator, DateSeparator, TimeSeparator, ListSeparator. These should be replaced with string to support UTF-8. For example under Linux with LC_NUMERIC=ru_RU.utf8 the thousand separator is the two byte nbsp/160.
    • Workaround: use single space characters instead as done in patch for bug 27099


WinAPI function calls in FPC libs

  • Unit registry, TRegistry - this unit uses Windows Ansi functions and therefore you need to use UTF8ToWinCP, WinCPToUTF8. Formerly it needed UTF8ToSys.
  • All Windows API Ansi function calls in FPC's libraries must be replaced with the "W" function version. This must be done in any case for the future UTF-16 support, thus there is no conflict of interest here.
  • TProcess - under Windows TProcess FPC 3.0 only supports system codepage. Use either TProcessUTF8 of unit utf8process or patch FPC, see bug 29136

ToDo: List all related FPC bug tracker issues and patches that should be applied.

Future

The goal of FPC project is to create a Delphi compatible UnicodeString (UTF-16) based solution, but it is not ready yet. It may take some time to be ready.

This UTF-8 solution of LCL in its current form can be considered temporary. In the future, when FPC supports UnicodeString fully in RTL and FCL, Lazarus project will provide a solution for LCL that uses it. At the same time the goal is to preserve UTF-8 support although it may require changes to string types or something. Nobody know the details yet. We will tell when we know...

In essence LCL will probably have 2 versions, one for UTF-8 and one for UTF-16.

FAQ

What about Mode DelphiUnicode?

The {$mode delphiunicode} was added in FPC 2.7.1 and is like {$Mode Delphi} with {ModeSwitch UnicodeStrings}. See the next question about ModeSwitch UnicodeStrings.

What about ModeSwitch UnicodeStrings?

The {$ModeSwitch UnicodeStrings} was added in FPC 2.7.1 and defines "String" as "UnicodeString" (UTF-16), "Char" as "WideChar", "PChar" as "PWideChar" and so forth. This affects only the current unit. Other units including those used by this unit have their own "String" definition. Many RTL strings and types (e.g. TStringList) uses 8-bit strings, which require conversions from/to UnicodeString, which are added automatically by the compiler. The LCL uses UTF-8 strings. It is recommended to use UTF-8 sources with or without "-FcUTF8".

Why not use UTF8String in Lazarus?

Short answer: Because the FCL does not use it.

Long answer: UTF8String is defined in the system unit as

UTF8String = type AnsiString(CP_UTF8);

The compiler always assumes it has UTF-8 encoding (CP_UTF8), which is a multi byte encoding (i.e. 1-4 bytes per codepoint). Note that the [] operator accesses bytes, not characters nor codepoints. Same for UnicodeString, but words instead of bytes. On the other hand a String is assumed at compile time to have DefaultSystemCodePage (CP_ACP). DefaultSystemCodePage is defined at run time, so the compiler conservatively assumes that String and UTF8String have different encodings. When you assign or combine String and UTF8String the compiler inserts conversion code. Same for ShortString and UTF8String.

Lazarus uses the FCL, which uses String, so using UTF8String would add conversions. If DefaultSystemCodePage is not UTF-8 you lose characters. If it is UTF-8 then there is no point to use UTF8String.

UTF8String becomes useful when eventually there is an UTF-16 FCL.