Unicode Support in Lazarus

From Free Pascal wiki
Revision as of 17:49, 31 January 2015 by JuhaManninen (talk | contribs) (FAQ)


This page covers Unicode support in LCL starting from version 2.0 which uses features of FPC 3.0+.


Note: The feature and this page are under construction. Please test it with Lazarus trunk and report your findings in Lazarus mailing list or in bug tracker.

The old way to support UTF-8 in LCL using FPC versions up to 2.6.4 is explained here: LCL_Unicode_Support

RTL with default codepage UTF-8

Usually the RTL uses the system codepage for strings (e.g. FileExists and TStringList.LoadFromFile). On Windows this is a non Unicode encoding, so you can only use characters from your language group. The LCL works with UTF-8 encoding, which is the full Unicode range. On Linux and Mac OS X UTF-8 is typically the system codepage, so the RTL uses here by default CP_UTF8.

Since FPC 2.7.1 the default system codepage of the RTL can be changed to UTF-8 (CP_UTF8). So Windows users can now use UTF-8 strings in the RTL. You can test it by adding "-dEnableUTF8RTL" to the Lazarus build options and recompiling your project (you might want to add -FcUTF8 to your project and packages).

  • For example FileExists and aStringList.LoadFromFile(Filename) now support full Unicode. See here for the complete list of functions that already support full Unicode:


  • AnsiToUTF8, UTF8ToAnsi, SysToUTF8, UTF8ToAnsi have no effect. They were mainly used for the above RTL functions, which no longer need a conversion. For WinAPI functions see below.
  • Many UTF8Encode and UTF8Decode calls are no longer needed, because when assigning UnicodeString to String and vice versus the compiler does it automatically for you.
  • When accessing the WinAPI you must use the "W" functions or use the functions UTF8ToWinCP and WinCPToUTF8.
  • You can enable the new mode by compiling Lazarus clean with -dEnableUTF8RTL.
  • If you use string literals with WideString, UnicodeString or UTF8String, your sources now must have the right encoding. For example you can use UTF-8 source files (Lazarus default) and pass -FcUTF8 to the compiler.
  • "String" and "UTF8String" are different types. If you assign a String to an UTF8String the compiler adds code to check if the encoding is the same. This costs unnecessary time and increases code size. Simply use String instead of UTF8String.

More information about the new FPC Unicode Support: http://wiki.freepascal.org/FPC_Unicode_support

Open issues

  • TFormatSettings char: for example: ThousandSeparator, DecimalSeparator, DateSeparator, TimeSeparator, ListSeparator. These should be replaced with string to support UTF-8. For example under Linux with LC_NUMERIC=ru_RU.utf8 the thousand separator is the two byte nbsp/160. Workaround: use space instead of nbsp.

Compatibility with Unicode Delphi

RTL functions in ASCII area

RTL functions that work in ASCII area (e.g. UpperCase) are compatible, but they work faster in the UTF-8 RTL. In Delphi all string functions became slower after they switched to UTF-16.

RTL Ansi...() Unicode functions

RTL Ansi...() functions that work with codepages / Unicode (e.g. AnsiUpperCase) are compatible.

Reading individual codepoints

Code that needs to read individual codepoints inside a string must be different for UTF-16 and UTF-8.

ToDo: Examples for both.

Helper functions for CodePoints

  • LCL will have CodePointCopy() and CodePointLength() functions ...

Compatibility with LCL in Lazarus 1.x

ToDo ...


What about Mode DelphiUnicode?

The {$mode delphiunicode} was added in FPC 2.7.1 and is like {$Mode Delphi} with {ModeSwitch UnicodeStrings}. See the next question about ModeSwitch UnicodeStrings.

What about ModeSwitch UnicodeStrings?

The {$ModeSwitch UnicodeStrings} was added in FPC 2.7.1 and defines "String" as "UnicodeString" (UTF-16), "Char" as "WideChar", "PChar" as "PWideChar" and so forth. This affects only the current unit. Other units including those used by this unit have their own "String" definition. Many RTL strings and types (e.g. TStringList) uses 8-bit strings, which require conversions from/to UnicodeString, which are added automatically by the compiler. The LCL uses UTF-8 strings. It is recommended to use UTF-8 sources and compile with "-FcUTF8".