Difference between revisions of "Unicode Support in Lazarus"

From Free Pascal wiki
Jump to navigationJump to search
Line 65: Line 65:
  
 
See section : "Dealing with UTF8 strings and characters in code" below for examples of how to use UTF-8.
 
See section : "Dealing with UTF8 strings and characters in code" below for examples of how to use UTF-8.
 
== Helper functions for CodePoints ==
 
 
LCL will have special functions for dealing with codepoints. They will use the old UTF8...() functions in LCL but can be made alias to functions using other encoding in Delphi and in FPC's {$mode DelphiUnicode}.
 
 
* CodePointCopy() - Like UTF8Copy()
 
* CodePointLength() - Like UTF8Length()
 
* ToDo ...
 
  
 
= Compatibility with LCL in Lazarus 1.x =
 
= Compatibility with LCL in Lazarus 1.x =

Revision as of 01:04, 6 February 2015

Introduction

This page covers Unicode support in Lazarus + LCL starting from version 2.0 which uses features of FPC 3.0+.

Light bulb  Note: The feature and this page are under construction. Please test it with Lazarus trunk and report your findings in Lazarus mailing list or in bug tracker.

The old way to support UTF-8 in LCL using FPC versions up to 2.6.4 is explained here: LCL_Unicode_Support

RTL with default codepage UTF-8

Usually the RTL uses the system codepage for strings (e.g. FileExists and TStringList.LoadFromFile). On Windows this is a non Unicode encoding, so you can only use characters from your language group. The LCL works with UTF-8 encoding, which is the full Unicode range. On Linux and Mac OS X UTF-8 is typically the system codepage, so the RTL uses here by default CP_UTF8.

Since FPC 2.7.1 the default system codepage of the RTL can be changed to UTF-8 (CP_UTF8). So Windows users can now use UTF-8 strings in the RTL.

  • For example FileExists and aStringList.LoadFromFile(Filename) now support full Unicode. See here for the complete list of functions that already support full Unicode:

http://wiki.freepascal.org/FPC_Unicode_support#RTL_changes

  • AnsiToUTF8, UTF8ToAnsi, SysToUTF8, UTF8ToSys have no effect. They were mainly used for the above RTL functions, which no longer need a conversion. For WinAPI functions see below.
  • Many UTF8Encode and UTF8Decode calls are no longer needed, because when assigning UnicodeString to String and vice versus the compiler does it automatically for you.
  • When accessing the WinAPI you must use the "W" functions or use the functions UTF8ToWinCP and WinCPToUTF8.
  • "String" and "UTF8String" are different types. If you assign a String to an UTF8String the compiler adds code to check if the encoding is the same. This costs unnecessary time and increases code size. Simply use String instead of UTF8String.

More information about the new FPC Unicode Support: http://wiki.freepascal.org/FPC_Unicode_support

Testing with Lazarus

You can enable the new mode by adding -dEnableUTF8RTL to your project options, "Additions and Overrides" page. This way it affects both the project and all its dependent packages, including LCL and LazUtils.

If you use string literals with WideString, UnicodeString or UTF8String, your sources now must have the right encoding. For example you can use UTF-8 source files (Lazarus default) and pass -FcUTF8 to the compiler.

So, the flags are:

-dEnableUTF8RTL
-FcUTF8

There is a shortcut button "Support UTF-8 RTL" on top of Project options dialog, next to the Build Mode selection. It adds these flags to the active build mode with a single click. (Remember: Compiler Options section must be selected to see the Build Mode controls.)

ToDo: A screenshot.

What actually happens then? These 2 FPC functions are called in an early initialization section, setting the default String encoding in FPC to UTF-8 :

 SetMultiByteConversionCodePage(CP_UTF8);
 SetMultiByteRTLFileSystemCodePage(CP_UTF8);

Also the UTF8...() functions in LCL are set as backends for RTL's Ansi...() functions.

Compatibility with Unicode Delphi

RTL functions in ASCII area

RTL functions that work in ASCII area (e.g. UpperCase) are compatible, but they work faster in the UTF-8 RTL. In Delphi all string functions became slower after they switched to UTF-16.

RTL Ansi...() Unicode functions

RTL Ansi...() functions that work with codepages / Unicode (e.g. AnsiUpperCase) are compatible.

Reading individual codepoints

Not compatible. Code that must read individual codepoints inside a string must be different for UTF-16 and UTF-8. Fortunately it is not needed very often in user code because it is encapsulated in libraries and because often the characters of interest are in ASCII area. For example many XML and HTML parsers continue to work with both encodings.

Delphi has functions like NextCharIndex to deal with codepoints consisting of 2 UnicodeChars. It also has functions for surrogate pairs. However those functions are not used much in example code and tutorials. Most tutorials say that Copy() function works just as it did with Delphi versions before D2009. No, it now works with UnicodeChar resolution and it is possible to get half of a codepoint.

UTF-8 has an advantage here. It must be done always right because multi-byte codepoints are so common.

See section : "Dealing with UTF8 strings and characters in code" below for examples of how to use UTF-8.

Compatibility with LCL in Lazarus 1.x

Lazarus LCL applications will continue to work without changes. However the handling Unicode has become simpler and it makes sense to clean the code.

Explicit conversion functions are not needed. FPC takes care of converting encodings automatically when needed. Empty conversion functions are provided to make your old code compile.

  • UTF8Decode, UTF8Encode - Almost all can be removed.
  • UTF8ToSys, SysToUTF8 - All can be removed.
  • UTF8ToAnsi, AnsiToUTF8 - All can be removed.

File functions in RTL now take care of file name encoding. All (?) file name related ...UTF8() functions can be replaced with the Delphi compatible function without UTF8 suffix. For example FileExistsUTF8 can be replaced with FileExists.

Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. ToDo: explain more.

Dealing with UTF8 strings and characters in code

See details in UTF8_strings_and_characters.

FPC codepages

The compiler (FPC) supports specifying the code page in which the source code has been written via the command option -Fc (e.g. -Fcutf8) and the equivalent codepage directive (e.g. {$codepage utf8}). In this case, rather than literally copying the bytes that represent the string constants in your program, the compiler will interpret all character data according to that codepage. There are two things to watch out for though:

  • on Unix platforms, make sure you include a widestring manager by adding the cwstring unit to your uses-clause. Without it, the program will not be able to convert all character data correctly when running. It's not included by default because this unit makes your program dependent on libc, which makes cross-compilation harder.
  • The compiler converts all string constants that contain non-ASCII characters to widestring constants. These are automatically converted back to ansistring (either at compile time or at run time), but this can cause one caveat if you try to mix both characters and ordinal values in a single string constant:

For example:

program project1;
{$codepage utf8}
{$mode objfpc}{$H+}
{$ifdef unix}
uses cwstring;
{$endif}
var
  a,b,c: string;
begin
  a:='ä';
  b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
  c:='ä='#$C3#$A4;
  writeln(a,b); // writes ä=ä
  writeln(c);   // writes ä=ä
end.

When compiled and executed, this will write:

ä=ä
ä=ä

The reason is once the ä is encountered, as mentioned above the rest of the constant string assigned to 'c' will be parsed as a widestring. As a result the #$C3 and #$A4 are interpreted as widechar(#$C3) and widechar(#$A4), rather than as ansichars.

Open issues

  • TFormatSettings char: for example: ThousandSeparator, DecimalSeparator, DateSeparator, TimeSeparator, ListSeparator. These should be replaced with string to support UTF-8. For example under Linux with LC_NUMERIC=ru_RU.utf8 the thousand separator is the two byte nbsp/160. Workaround: use space instead of nbsp.
  • ToDo: Other open issues.

FAQ

What about Mode DelphiUnicode?

The {$mode delphiunicode} was added in FPC 2.7.1 and is like {$Mode Delphi} with {ModeSwitch UnicodeStrings}. See the next question about ModeSwitch UnicodeStrings.

What about ModeSwitch UnicodeStrings?

The {$ModeSwitch UnicodeStrings} was added in FPC 2.7.1 and defines "String" as "UnicodeString" (UTF-16), "Char" as "WideChar", "PChar" as "PWideChar" and so forth. This affects only the current unit. Other units including those used by this unit have their own "String" definition. Many RTL strings and types (e.g. TStringList) uses 8-bit strings, which require conversions from/to UnicodeString, which are added automatically by the compiler. The LCL uses UTF-8 strings. It is recommended to use UTF-8 sources and compile with "-FcUTF8".