Difference between revisions of "LCL Unicode Support"

From Free Pascal wiki
Jump to navigationJump to search
(→‎Widestrings and Ansistrings: Added widestring link)
(set up MW-redirect)
Tag: New redirect
 
(38 intermediate revisions by 12 users not shown)
Line 1: Line 1:
{{LCL Unicode Support}}
+
#REDIRECT [[Unicode Support in Lazarus]]
 
 
== Introduction ==
 
 
 
As of 0.9.25, Lazarus has full Unicode support in all platforms except Gtk 1. In this page one can find instructions for Lazarus users, roadmaps, descriptions of basic concepts and implementation details.
 
 
 
== Instructions for users ==
 
 
 
Even though Lazarus has Unicode widgetsets, it's important to note that not everything is Unicode. It's the responsibility of the developer to know what is the encoding of their strings and do the proper conversion between libraries which expect different encodings.
 
 
 
Usually the encoding is per library (e.g. a dynamic library DLL or a Lazarus package). Each library will uniformly expect 1 kind of encoding, which will usually either be Unicode (UTF-8 for Lazarus) or ANSI (which means the system encoding, and may be UTF-8 or not). The RTL and the FCL of FPC 2.4-2.6 expect ANSI strings.
 
 
 
You can convert between Unicode and ANSI using the '''UTF8ToAnsi''' and '''AnsiToUTF8''' functions from the System unit or the '''UTF8ToSys''' and '''SysToUTF8''' from the FileUtil unit. The latter two are smarter (faster) but pull more code into your program.
 
 
 
===FPC is not Unicode aware===
 
The Free Pascal Runtime Library (RTL), and the Free Pascal Free Component Library (FCL) in current FPC versions (up to 2.6.1) are ANSI, so you will need to convert strings coming from Unicode libraries or going to Unicode libraries (e.g. the LCL).
 
 
 
===Converting between ANSI and Unicode===
 
Examples:
 
 
 
Say you get a string from a TEdit and you want to give it to some RTL file routine:
 
 
 
<syntaxhighlight>var
 
  MyString: string; // utf-8 encoded
 
begin
 
  MyString := MyTEdit.Text;
 
  SomeRTLRoutine(UTF8ToAnsi(MyString));
 
end;</syntaxhighlight>
 
 
 
And for the opposite direction:
 
 
 
<syntaxhighlight>var
 
  MyString: string; // ANSI encoded
 
begin
 
  MyString := SomeRTLRoutine;
 
  MyTEdit.Text := AnsiToUTF8(MyString);
 
end;</syntaxhighlight>
 
 
 
'''Important''': UTF8ToAnsi will return an empty string if the UTF8 string contains invalid characters.
 
 
 
'''Important''': AnsiToUTF8 and UTF8ToAnsi require a widestring manager under Linux, BSD and Mac OS X. You can use the SysToUTF8 and UTF8ToSys functions (unit FileUtil) or add the widestring manager by adding cwstring as one of the first units to your program's uses section.
 
 
 
===Widestrings and Ansistrings===
 
A widestring is a string type whose basic data holding elements have a size of 2 bytes. Widestrings almost always hold data in the UTF-16 encoding. See [[Widestrings]]
 
 
 
Note that while each data point accessible as an array of a widestring has 2 bytes, in UTF-16 a character may have 1 or 2 data points, which would then occupy 2 or 4 bytes. This means that accessing a Widestring as an array and expecting to obtain UTF-16 characters this way is '''completely wrong''' and will fail when a 4 byte character is present in the string. Note also that UTF-16, like UTF-8, may have decomposed characters. The character "Á" for example might be encoded as a single character or as 2 characters: "A" + a modifying accent. Thus in Unicode a text which involves accented letters can often be encoded in multiple ways and Lazarus and FPC do not handle this automatically.
 
 
 
When passing Ansistrings to Widestrings you have to convert the encoding.
 
 
 
<syntaxhighlight>var
 
  w: widestring;
 
begin
 
  w:='Über'; // wrong, because FPC will convert system codepage to UTF16
 
  w:=UTF8ToUTF16('Über'); // correct
 
  Button1.Caption:=UTF16ToUTF8(w);
 
end;</syntaxhighlight>
 
 
 
===Dealing with UTF8 strings and characters===
 
 
 
Until Lazarus 0.9.30 the UTF-8 handling routines were in the LCL in the unit LCLProc. In Lazarus 0.9.31+ the routines in LCLProc are still available for backwards compatibility but the real code to deal with UTF-8 is located in the lazutils package in the unit lazutf8.
 
 
 
To execute operations on UTF-8 strings please use routines from the unit lazutf8 instead of routines from the SysUtils routine from Free Pascal, because SysUtils is not yet prepared to deal with Unicode, while lazutf8 is. Simply substitute the routines from SysUtils with their lazutf8 equivalent, which always has the same name except for an added "UTF8" prefix.
 
 
 
Also note that simply iterating over chars as if the string was an array does not work in Unicode. This is not something specific to UTF-8: one simply cannot suppose that a character will have a fixed size in Unicode. If you want to iterate over the characters of an UTF-8 string, there are basically two ways:
 
 
 
*iterate over the bytes - useful for searching a substring or when looking only at the ASCII characters of the UTF8 string. For example when parsing XML files.
 
*iterate over the characters - useful for graphical components like synedit. For example when you want to know the third printed character on the screen.
 
 
 
====Searching a substring====
 
 
 
Due to the special nature of UTF8 you can simply use the normal string functions for searching a sub-string. Even though UTF-8 is a multi-byte encoding the first byte can not be confused with the second. So searching for a valid UTF-8 string with Pos will always return a valid UTF-8 position:
 
 
 
<syntaxhighlight>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
 
...
 
procedure Where(SearchFor, aText: string);
 
var
 
  BytePos: LongInt;
 
  CharacterPos: LongInt;
 
begin
 
  BytePos:=Pos(SearchFor,aText);
 
  CharacterPos:=UTF8Length(PChar(aText),BytePos-1);
 
  writeln('The substring "',SearchFor,'" is in the text "',aText,'"',
 
    ' at byte position ',BytePos,' and at character position ',CharacterPos);
 
end;</syntaxhighlight>
 
 
 
Due to the ambiguity of Unicode, Pos() (just like any compare) might show unexpected behavior, when e.g. one of the string contains decomposed characters, while the other uses the direct codes for the same letter. This is not automatically handled by the RTL.
 
 
 
====Accessing UTF8 characters====
 
 
 
Unicode characters can vary in length, so the best solution for accessing them is to use an iteration when one intends to access the characters in the sequence in which they are. For iterating through the characters use this code:
 
 
 
<syntaxhighlight>uses lazutf8; // LCLProc for Lazarus 0.9.30 or lower
 
...
 
procedure DoSomethingWithString(AnUTF8String: string);
 
var
 
  p: PChar;
 
  CharLen: integer;
 
  FirstByte, SecondByte, ThirdByte: Char;
 
begin
 
  p:=PChar(AnUTF8String);
 
  repeat
 
    CharLen := UTF8CharacterLength(p);
 
 
 
    // Here you have a pointer to the char and its length
 
    // You can access the bytes of the UTF-8 Char like this:
 
    if CharLen >= 1 then FirstByte := P[0];
 
    if CharLen >= 2 then SecondByte := P[1];
 
    if CharLen >= 3 then ThirdByte := P[2];
 
 
 
    inc(p,CharLen);
 
  until (CharLen=0) or (p^ = #0);
 
end;</syntaxhighlight>
 
 
 
====Accessing the Nth UTF8 character====
 
 
 
Besides iterating one might also want to have random access to UTF-8 Characters.
 
 
 
<syntaxhighlight>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
 
...
 
var
 
  AnUTF8String, NthChar: string;
 
begin
 
  NthChar := UTF8Copy(AnUTF8String, N, 1);
 
</syntaxhighlight>
 
 
 
====Showing character codepoints with UTF8CharacterToUnicode====
 
 
 
The following demonstrates how to show the 32bit code point value of each character in an UTF8 string:
 
 
 
<syntaxhighlight>uses lazutf8; // LCLProc for Lazarus 0.9.30 or inferior
 
...
 
procedure IterateUTF8Characters(const AnUTF8String: string);
 
var
 
  p: PChar;
 
  unicode: Cardinal;
 
  CharLen: integer;
 
begin
 
  p:=PChar(AnUTF8String);
 
  repeat
 
    unicode:=UTF8CharacterToUnicode(p,CharLen);
 
    writeln('Unicode=',unicode);
 
    inc(p,CharLen);
 
  until (CharLen=0) or (unicode=0);
 
end;</syntaxhighlight>
 
 
 
====UTF-8 String Copy, Length, LowerCase, etc====
 
 
 
Nearly all operations which one might want to execute with UTF-8 strings are covered by the routines in the unit lazutf8 (unit LCLProc for Lazarus 0.9.30 or lower). See the following list of routines taken from lazutf8.pas:
 
 
 
<syntaxhighlight>
 
function UTF8CharacterLength(p: PChar): integer;
 
function UTF8Length(const s: string): PtrInt;
 
function UTF8Length(p: PChar; ByteCount: PtrInt): PtrInt;
 
function UTF8CharacterToUnicode(p: PChar; out CharLen: integer): Cardinal;
 
function UnicodeToUTF8(u: cardinal; Buf: PChar): integer; inline;
 
function UnicodeToUTF8SkipErrors(u: cardinal; Buf: PChar): integer;
 
function UnicodeToUTF8(u: cardinal): shortstring; inline;
 
function UTF8ToDoubleByteString(const s: string): string;
 
function UTF8ToDoubleByte(UTF8Str: PChar; Len: PtrInt; DBStr: PByte): PtrInt;
 
function UTF8FindNearestCharStart(UTF8Str: PChar; Len: integer;
 
                                  BytePos: integer): integer;
 
// find the n-th UTF8 character, ignoring BIDI
 
function UTF8CharStart(UTF8Str: PChar; Len, CharIndex: PtrInt): PChar;
 
// find the byte index of the n-th UTF8 character, ignoring BIDI (byte len of substr)
 
function UTF8CharToByteIndex(UTF8Str: PChar; Len, CharIndex: PtrInt): PtrInt;
 
procedure UTF8FixBroken(P: PChar);
 
function UTF8CharacterStrictLength(P: PChar): integer;
 
function UTF8CStringToUTF8String(SourceStart: PChar; SourceLen: PtrInt) : string;
 
function UTF8Pos(const SearchForText, SearchInText: string): PtrInt;
 
function UTF8Copy(const s: string; StartCharIndex, CharCount: PtrInt): string;
 
procedure UTF8Delete(var s: String; StartCharIndex, CharCount: PtrInt);
 
procedure UTF8Insert(const source: String; var s: string; StartCharIndex: PtrInt);
 
 
 
function UTF8LowerCase(const AInStr: string; ALanguage: string=''): string;
 
function UTF8UpperCase(const AInStr: string; ALanguage: string=''): string;
 
function FindInvalidUTF8Character(p: PChar; Count: PtrInt;
 
                                  StopOnNonASCII: Boolean = false): PtrInt;
 
function ValidUTF8String(const s: String): String;
 
 
 
procedure AssignUTF8ListToAnsi(UTF8List, AnsiList: TStrings);
 
 
 
//compare functions
 
 
 
function UTF8CompareStr(const S1, S2: string): Integer;
 
function UTF8CompareText(const S1, S2: string): Integer;
 
</syntaxhighlight>
 
 
 
===Dealing with directory and filenames===
 
 
 
Lazarus controls and functions expect filenames and directory names in UTF-8 encoding, but the RTL uses ANSI strings for directories and filenames.
 
 
 
For example, consider a button which sets the Directory property of the TFileListBox to the current directory. The RTL Function [[doc:rtl/sysutils/getcurrentdir.html|GetCurrentDir]] is ANSI, not Unicode, so conversion is needed:
 
 
 
<syntaxhighlight>procedure TForm1.Button1Click(Sender: TObject);
 
begin
 
  FileListBox1.Directory:=SysToUTF8(GetCurrentDir);
 
  // or use the functions from the FileUtil unit
 
  FileListBox1.Directory:=GetCurrentDirUTF8;
 
end;</syntaxhighlight>
 
 
 
The unit FileUtil defines common file functions with UTF-8 strings:
 
 
 
<syntaxhighlight>// basic functions similar to the RTL but working with UTF-8 instead of the
 
// system encoding
 
 
 
// AnsiToUTF8 and UTF8ToAnsi need a widestring manager under Linux, BSD, Mac OS X
 
// but normally these OS use UTF-8 as system encoding so the widestringmanager
 
// is not needed.
 
function NeedRTLAnsi: boolean;// true if system encoding is not UTF-8
 
procedure SetNeedRTLAnsi(NewValue: boolean);
 
function UTF8ToSys(const s: string): string;// as UTF8ToAnsi but more independent of widestringmanager
 
function SysToUTF8(const s: string): string;// as AnsiToUTF8 but more independent of widestringmanager
 
 
 
// file operations
 
function FileExistsUTF8(const Filename: string): boolean;
 
function FileAgeUTF8(const FileName: string): Longint;
 
function DirectoryExistsUTF8(const Directory: string): Boolean;
 
function ExpandFileNameUTF8(const FileName: string): string;
 
function ExpandUNCFileNameUTF8(const FileName: string): string;
 
{$IFNDEF VER2_2_0}
 
function ExtractShortPathNameUTF8(Const FileName : String) : String;
 
{$ENDIF}
 
function FindFirstUTF8(const Path: string; Attr: Longint; out Rslt: TSearchRec): Longint;
 
function FindNextUTF8(var Rslt: TSearchRec): Longint;
 
procedure FindCloseUTF8(var F: TSearchrec);
 
function FileSetDateUTF8(const FileName: String; Age: Longint): Longint;
 
function FileGetAttrUTF8(const FileName: String): Longint;
 
function FileSetAttrUTF8(const Filename: String; Attr: longint): Longint;
 
function DeleteFileUTF8(const FileName: String): Boolean;
 
function RenameFileUTF8(const OldName, NewName: String): Boolean;
 
function FileSearchUTF8(const Name, DirList : String): String;
 
function FileIsReadOnlyUTF8(const FileName: String): Boolean;
 
function GetCurrentDirUTF8: String;
 
function SetCurrentDirUTF8(const NewDir: String): Boolean;
 
function CreateDirUTF8(const NewDir: String): Boolean;
 
function RemoveDirUTF8(const Dir: String): Boolean;
 
function ForceDirectoriesUTF8(const Dir: string): Boolean;
 
 
 
// environment
 
function ParamStrUTF8(Param: Integer): string;
 
function GetEnvironmentStringUTF8(Index : Integer): String;
 
function GetEnvironmentVariableUTF8(const EnvVar: String): String;
 
function GetAppConfigDirUTF8(Global: Boolean): string;</syntaxhighlight>
 
 
 
====Mac OS X====
 
 
 
The file functions of the FileUtil unit also take care of Mac OS X specific behaviour: OS X normalizes filenames. For example the filename 'ä.txt' can be encoded in Unicode with two different sequences (#$C3#$A4 and 'a'#$CC#$88). Under Linux and BSD you can create a filename with both encodings. OS X automatically converts the a umlaut to the three byte sequence. This means:
 
 
 
<syntaxhighlight>if Filename1 = Filename2 then ... // is not sufficient under OS X
 
if AnsiCompareFileName(Filename1, Filename2) = 0 then ... // not sufficient under fpc 2.2.2, not even with cwstring
 
if CompareFilenames(Filename1, Filename2) = 0 then ... // this always works (unit FileUtil or FileProcs</syntaxhighlight>
 
 
 
===East Asian languages on Windows===
 
 
 
The default font (Tahoma) for user interface controls under Windows XP is capable of correctly displaying several scripts/alphabets/languages, including Arabic, Russian (Cyrillic alphabet) and Western languages (Latin/Greek alphabets), but not East Asian languages, like Chinese, Japanese and Korean.
 
 
 
Simply by going to the Control Panel, choosing Regional Settings, clicking on the Languages Tab and installing the East Asia Language Pack, the standard user interface font will start showing those languages correctly. Obviously Windows XP versions localized for those languages will already have this language pack installed. Extended instructions [http://newton.uor.edu/Departments&Programs/AsianStudiesDept/Language/asianlanguageinstallation_XP.html here].
 
 
 
Later Windows versions presumably have support for these languages out of the box.
 
 
 
== Free Pascal Particularities ==
 
 
 
===UTF8 and source files - the missing BOM===
 
 
 
When you create source files with Lazarus and type some non-ASCII characters the file is saved in UTF8. It does '''not''' use a '''BOM''' (Byte Order Mark).
 
You can change the encoding via right click on source editor / File Settings / Encoding. Apart from the fact that UTF-8 files are not supposed to have BOMs, the reason for the lacking BOM is how FPC treats Ansistrings. For compatibility the LCL uses Ansistrings and for portability the LCL uses UTF8.
 
 
 
Note: Some MS Windows text editors might treat the files as encoded with the system codepage (OEM codepage) and show them as invalid characters. Do not add the BOM. If you add the BOM you have to change all string assignments.
 
 
 
For example:
 
 
 
<syntaxhighlight>Button1.Caption := 'Über';</syntaxhighlight>
 
 
 
When no BOM is given (and no codepage parameter was passed) the compiler treats the string as system encoding and copies each byte unconverted to the string. This is how the LCL expects strings.
 
 
 
<syntaxhighlight>// source file saved as UTF without BOM
 
if FileExists('Über.txt') then ; // wrong, because FileExists expects system encoding
 
if FileExistsUTF8('Über.txt') then ; // correct</syntaxhighlight>
 
 
 
 
 
== Unicode essentials ==
 
 
 
The Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).
 
 
 
There are three major schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. Conversions between all of them are possible. Here are their basic properties:
 
 
 
                            UTF-8 UTF-16 UTF-32
 
Smallest code point [hex] 000000 000000 000000
 
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
 
Code unit size [bits]          8    16    32
 
Minimal bytes/character        1      2      4
 
Maximal bytes/character        4      4      4
 
 
 
'''UTF-8''' has several important and useful properties:
 
It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set.
 
No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy searching for substrings. The first byte of a multibyte sequence (representing a non-ASCII character) is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
 
 
 
'''UTF-16''' has the following most important properties:
 
It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.
 
 
 
Finally, any Unicode character can be represented as a single 4 byte/32-bit unit in '''UTF-32'''.
 
 
 
For more, see:
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 
[http://en.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
 
[http://en.wikipedia.org/wiki/ISO-8859]
 
 
 
= Implementation Details  =
 
 
 
== Lazarus/LCL generally uses only UTF-8 ==
 
Since the GTK1 interface was declared obsolete in Lazarus 0.9.31, all LCL interfaces are Unicode capable and Lazarus and the LCL use and accept only UTF-8 encoded strings, unless in routines explicitly marked as accepting other encodings.
 
 
 
== Unicode-enabling the win32 interface  ==
 
 
 
=== Overview ===
 
First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At his moment all existing programs that use ANSI characters will need migration to Unicode.
 
 
 
=== No Unicode support on Win9x ===
 
Windows platforms <=Win9x are based on ISO code page standards and only partially support Unicode. Windows platforms starting with WinNT and Windows CE fully support Unicode.
 
 
 
Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows 9x has all *W functions but they have empty implementations. So they do nothing. Except for some of them which are fully implemented even in 9x and listed below in the section "Wide functions present on Windows 9x". This property is relevant as it allows to have one single application for both Win9x and WinNT and detect at runtime which set of APIs to use.
 
 
 
Windows CE only uses Wide API functions.
 
 
 
====Wide functions present on Windows 9x====
 
Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp
 
 
 
Conversion example:
 
 
 
<syntaxhighlight>GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
 
Length(ButtonCaption), TextSize);</syntaxhighlight>
 
 
 
Becomes:
 
 
 
<syntaxhighlight>{$ifdef WindowsUnicodeSupport}
 
  GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
 
{$else}
 
  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
 
{$endif}</syntaxhighlight>
 
 
 
====Functions that need Ansi and Wide versions====
 
 
 
First Conversion example:
 
 
 
<syntaxhighlight>function TGDIWindow.GetTitle: String;
 
var
 
l: Integer;
 
begin
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(Result, l);
 
  Windows.GetWindowText(Handle, @Result[1], l);
 
end;</syntaxhighlight>
 
 
 
Becomes:
 
 
 
<syntaxhighlight>function TGDIWindow.GetTitle: String;
 
var
 
  l: Integer;
 
  AnsiBuffer: string;
 
  WideBuffer: WideString;
 
begin
 
 
 
{$ifdef WindowsUnicodeSupport}
 
 
 
if UnicodeEnabledOS then
 
begin
 
  l := Windows.GetWindowTextLengthW(Handle);
 
  SetLength(WideBuffer, l);
 
  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
 
  SetLength(WideBuffer, l);
 
  Result := Utf8Encode(WideBuffer);
 
end
 
else
 
begin
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(AnsiBuffer, l);
 
  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
 
  SetLength(AnsiBuffer, l);
 
  Result := AnsiToUtf8(AnsiBuffer);
 
end;
 
 
 
{$else}
 
 
 
  l := Windows.GetWindowTextLength(Handle);
 
  SetLength(Result, l);
 
  Windows.GetWindowText(Handle, @Result[1], l);
 
 
 
{$endif}
 
 
 
end;</syntaxhighlight>
 
 
 
=== Screenshots ===
 
 
 
[[Image:Lazarus Unicode Test.png]]
 
 
 
= FPC codepages =
 
 
 
Why does the LCL not use codepage for sources?
 
 
 
The -Fcutf8 and {$codepage utf8} exist since ages.
 
 
 
The following programs require FPC 2.7.1. The described problems exist on older compilers too.
 
 
 
There are some traps with -Fcutf8 and {$codepage utf8}. It only works if the RTL DefaultSystemCodePage is CP_UTF8. Otherwise your strings are converted by the compiler.
 
For example under Linux the RTL default is CP_ACP, which defaults to ISO_8859-1. The RTL does '''not''' read your environment language on its own. So the the default is ISO_8859-1. This means your UTF-8 string constants are converted by the compiler:
 
 
 
Compile this with -Fcutf8 and run it on a Linux with LANG set to utf-8:
 
 
 
<syntaxhighlight>
 
program project1;
 
{$mode objfpc}{$H+}
 
begin
 
  writeln(DefaultSystemCodePage,' ',CP_UTF8);
 
  writeln('ä');
 
end.
 
</syntaxhighlight>
 
 
 
This results in
 
<pre>
 
0 65001
 
ä
 
</pre>
 
 
 
The LCL uses a widestringmanager (at the moment cwstring), which sets the DefaultSystemCodePage. You can do the same in your non LCL programs:
 
 
 
<syntaxhighlight>
 
program project1;
 
{$mode objfpc}{$H+}
 
uses cwstring;
 
begin
 
  writeln(DefaultSystemCodePage,' ',CP_UTF8);
 
  writeln('ä');
 
end.
 
</syntaxhighlight>
 
 
 
This results in
 
<pre>
 
65001 65001
 
ä
 
</pre>
 
 
 
The above is a simplification though, a lie for children.
 
See the program below:
 
 
 
<syntaxhighlight>
 
program project1;
 
{$mode objfpc}{$H+}
 
uses cwstring;
 
var
 
  a,b,c: string;
 
begin
 
  writeln(DefaultSystemCodePage,' ',CP_UTF8);
 
  a:='ä'; b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
 
  c:=      'ä='#$C3#$A4;
 
  writeln(a,b); // writes ä=ä
 
  writeln(c);  // writes ä=ä
 
end.
 
</syntaxhighlight>
 
 
 
<pre>
 
65001 65001
 
ä=ä
 
ä=ä
 
</pre>
 
 
 
You can see that an UTF-8 string constant works, a string constant with UTF-8 codes works too, but the combination does not work. The above was compiled with -Fcutf8 and uses cwstring to set the DefaultSystemCodePage to CP_UTF8.
 
 
 
So what went wrong?
 
 
 
The compiler treats any non ASCII string constant (here: the ä) as widestring (UCS2, not UTF-16).
 
 
 
You can not fool the compiler with 'ä='+#$C3#$A4. You must define two separate string constants.
 
 
 
Using any character outside the UCS-2 range results in
 
<pre>
 
Fatal: illegal character "'�'" ($F0)
 
</pre>
 
 
 
You can specify them with UTF-16 codes: ''#$D834#$DD1E''. Yes, you read right. Specifying the codepage with -Fcutf8 or {$codepage utf8} actually defines a mix of UTF-8 and UTF-16.
 
 
 
Now compile the above without -Fcutf8:
 
<pre>
 
65001 65001
 
ä=ä
 
ä=ä
 
</pre>
 
 
 
Wow, everything looks as expected. You can even mix ASCII and non-ASCII string constants.
 
Without the codepage the compiler stores string constants as byte sequences. That's what UTF-8 is.
 
 
 
That's one of the reasons why LCL applications do not use the codepage flags.
 
 
 
msegui has implemented an ecosystem of widestrings, so it works better with a codepage.
 
 
 
= See Also =
 
 
 
* [[UTF-8]] - Description of UTF-8 strings
 
 
 
[[Category:LCL]]
 

Latest revision as of 00:58, 25 October 2019