Unicode Support in Lazarus/ja

From Free Pascal wiki
Revision as of 18:50, 19 June 2017 by Miyatakejiro (talk | contribs)
Jump to navigationJump to search

English (en) 日本語 (ja) русский (ru)

日本語版メニュー
メインページ - Lazarus Documentation日本語版 - 翻訳ノート - 日本語障害情報

導入

このページは、 FPC 3.0+ 以降の機能を用いた、 Lazarus での プログラム (コンソールやサーバなどの GUI 無し) および アプリケーション (LCL による GUI を用いた) におけるユニコードへの対応を説明します。

FPC 2.6.4 までの古い方法による UTF-8 の扱い方は、LCL Unicode Support/ja で説明しています。

デフォルトのコードページが UTF-8 である RTL

RTL のデフォルトでは、システムのコードページを AnsiStrings (例: FileExists や TStringList.LoadFromFile)で扱います。 ウィンドウズでは、これはもちろんユニコードではないので、あなたは自身の言語グループから文字を用いることしかできません(ほとんどは 256 文字です)。 一方で LCL は ユニコードの範囲である UTF-8 でのエンコーディングを扱えます。 Linux と Mac OS X では、 UTF-8 は一般的なシステムのコードページですので、こちらの場合は RTL はデフォルトで UTF8 を使用できます。

FPC 3.0 以降は、 RTLのデフォルトシステムコードページを別のものに変える1つのAPIを提供します。 Lazarus (実際にはその LazUtils パッケージ) はそのAPIを利用しそれを UTF-8 (CP_UTF8)に変えます。 それは今はWindowsユーザーもUTF-8文字列をRTLの中で使える事を意味します。

  • 例えば、 FileExists と StringList.LoadFromFile(Filename) は今ではユニコードに完全に対応しています。 ユニコードに完全対応している関数のリストは次のリンク先を見てください。:

RTL changes

  • SysToUTF8, UTF8ToSys は効果がありません. それらは主にRTL関数の上で使われ、既に変換は必要ありません。WinAPI関数のために下を見てください。
  • 多くの UTF8EncodeUTF8Decodeは既に呼び出す必要はありません。なぜならUnicodeStringをStringにアサインする時とそれの反対の時にコンパイラは自動でそれをあなたの為に行うからです。
  • WinAPIにアクセスする時は"W" 関数を使うかUTF8ToWinCPWinCPToUTF8を使います。それはライブラリの為にAnsi WinAPI関数を使うのと同様です。例えばFPC 3.0とそれ以下ではそのregistryユニットはそれを必要とします。
  • "String" と "UTF8String" は種類が異なります。 もしも StringUTF8String にアサインした場合、コンパイラはそのエンコーディングが同じかどうかチェックする為のコードを追加します。それは必要ない時間を掛けた上コードサイズを増やします。UTF8String の代わりに単純に String を使いましょう。
  • Windows のコンソールは異なるエンコーディングを用います。 FPC 3.0+ における writeln では UTF-8 をコンソールのコードページに自動的に変換します。 Windows コンソールプログラムで、入力も出力もコンソールのコードページで行いたい場合は、 UTF8ToConsoleConsoleToUTF8 を変換に使用しましょう。


新しい FPC によるユニコードへの対応については、さらに以下をご参照ください。:FPC Unicode support

Lazarus における使い方

FPC 3.0+でコンパイルする時には新しいモードが自動で有効になります。 -dDisableUTF8RTLを定義する事によって無効化する事が出来ます。詳細はLazarus with FPC3.0 without UTF-8 modeを見てください。

もしあなたが新しいモードで文字列リテラルを使うのならばあなたのソースはUTF-8エンコーディングでなければなりません。 しかし-FcUTF8は主として必要ありません。詳細は"文字列リテラル"を見てください。

実際に新しいモードで何が起きましたか? FPCのデフォルト文字列エンコーディングをUTF-8へ設定する為に、以下のの2つのFPC関数がinitializationセクションの最初の方で呼ばれます。

 SetMultiByteConversionCodePage(CP_UTF8);
 SetMultiByteRTLFileSystemCodePage(CP_UTF8);

さらにLazUTF8 (LazUtils)のUTF8...()関数はRTLのAnsi...()関数のバックエンドとして 設定されています。

コンソールプログラム(非LCL)ではLazUtilsを手動で依存物に追加しなければなりません。(LCLアプリケーションはすでにLCL依存になっています。)

UTF-8を非LCLプログラムで使うには

もしあなたが非LCLプログラムでUTF-8文字列を使うならば、LazUtilsパッケージを依存物に追加してください。そしてLazUTF8ユニットをメインプログラムファイルのusesセクションに追加してください。それは出来るだけ最初に近くなければなりません。それはちょうど重要なメモリマネージャとスレッディング要素(例: cmem, heaptrc, cthreads)の後です。

ユニコードDelphiとの互換性

コンソールプログラムではLazUTF8ユニットはメインプログラムのusesセクションになければなりません。Delphiにはそのようなユニットはありません。

ASCII領域のRTL関数

ASCII領域で働くRTL関数は互換性があります。しかしこれらはUTF-8 RTLでよりはやく働きます。 Delphiの全ての文字列関数はこれらをUTF-16にスイッチした後よりも遅くなります。

RTL Ansi...() Unicode関数

コードページかUnicodeと共に働くRTL Ansi...()関数(例:AnsiUpperCase)は互換性があります。

個別のコードポイントの読み込み

互換性はありません。けれどもソースコードを双方のエンコーディングと共に働くようにする事はとても簡単です。

Delphiはコードポイントが2つのUnicodeChar(*) (WideChar, Word, 2 bytes)で成り立っているUTF-16サロゲートペアを分配するNextCharIndex、IsHighSurrogate、IsLowSurrogateに似た関数を持っています。 However those functions are not used much in example code and tutorials. Most tutorials say that Copy() function works just as it did with Delphi versions before D2009. No, a codepoint can now be 2 UnicodeChar(*) and Copy() may return half of it.

UTF-8 has an advantage here. It must be done always right because multi-byte codepoints are so common.

See section : Dealing with UTF8 strings and characters in code below for examples of how to use UTF-8 and how to make code that works with both encodings.

(*)

  • "UnicodeString" and "UnicodeChar" names for UTF-16 types was a very unfortunate choice from Borland.
  • A Unicode codepoint is a "real" character definition in Unicode which can be encoded differently and its length depends on the encoding.
  • A Unicode character is either one codepoint or a decomposed character of multiple codepoints. Yes, this is complex ...

Calling Windows API

Only the "W" versions of Windows API functions should be called. It is like in Delphi except that you must assign strings to/from API calls to UnicodeString variables or typecast with UnicodeString(). In Delphi the default string type already is UnicodeString and is not needed explicitly. Note: WideString is used traditionally with WinAPI but it is only needed with COM/OLE programming where the OS takes care of memory management. Use UnicodeString instead.

Compatibility with LCL in Lazarus 1.x

Many Lazarus LCL applications will continue to work without changes. However the handling of Unicode has become simpler and it makes sense to clean the code. Code that reads or writes data from/to streams, files or DBs with non-UTF-8 encoding, breaks and must be changed. (See below for examples).

Explicit conversion functions are only needed when calling Windows Ansi functions. Otherwise FPC takes care of converting encodings automatically. Empty conversion functions are provided to make your old code compile.

  • UTF8Decode, UTF8Encode - Almost all can be removed.
  • UTF8ToAnsi, AnsiToUTF8 - Almost all can be removed.
  • UTF8ToSys, SysToUTF8 - All can be removed. They are now dummy no-ops and only return their parameter.

File functions in RTL now take care of file name encoding. All (?) file name related ...UTF8() functions can be replaced with the Delphi compatible function without UTF8 suffix. For example FileExistsUTF8 can be replaced with FileExists.

Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. The UTF8...() functions in LazUTF8 are registered as callback functions for the Ansi...() functions in SysUtils.

UTF-8 works in non-GUI programs, too. It only requires a dependency for LazUtils and placing LazUTF8 unit into the uses section of main program file.

Reading / writing text file with Windows codepage

This is not compatible with former Lazarus code. In practice you must encapsulate the code dealing with system codepage and convert the data to UTF-8 as quickly as possible.

Use RawByteString and do an explicit conversion

 uses ... , LConvEncoding;
 ...
 var
   StrIn: RawByteString;
   StrOut: String;
 ...
 StrOut := CP1252ToUTF8(StrIn,true);  // Uses fixed codepage
 // or
 StrOut := WinCPToUTF8(StrIn,true);  // Uses system codepage in this particular computer

Set the right codepage for an exising string

 var
   StrIn, StrOut: String;
 ...
   SetCodePage(RawByteString(StrIn), 1252, false);  // Fixed 1252 (or Windows.GetACP())

Note: there must be some text in the string variable. An empty string is actually a Nil pointer and you cannot set its codepage.

Windows.GetACP() returns the Windows system codepage.

ToDo ...

Code that depends very much on Windows codepage

Sometimes program code depends so much on system codepage that using the new UTF-8 mode is not practical. There are 2 choices then :

  • Continue using Lazarus with FPC 2.6.4. This is a good solution for code that is in maintenance mode. Lazarus can still be compiled with FPC 2.6.4 for some time to come and the old UTF8...() functions will be there.
  • Use FPC 3.0 without the new UTF-8 mode by defining DisableUTF8RTL. This causes some nasty problems which are explained here : Lazarus with FPC3.0 without UTF-8 mode.

Calling Windows API

See above Delphi compatibility section. Use only the "W" versions of Windows API and assign to/from UnicodeString. The old Ansi Windows API functions are not recommended, this is a fully Unicode aware system after all.

With earlier versions Ansi Windows API functions were often called and string data was converted explicitly to/from UTF-8.

Helper functions for CodePoints

LazUtils will have special functions for dealing with codepoints. They will use the old UTF8...() functions in LCL now but can be made alias to functions using other encoding in Delphi and in FPC's {$mode DelphiUnicode}.

  • CodePointCopy() - Like UTF8Copy()
  • CodePointLength() - Like UTF8Length()
  • CodePointPos() - Like UTF8Pos()
  • CodePointToWinCP() - Like UTF8ToWinCP()
  • WinCPToCodePoint() - Like WinCPToUTF8()
  • CodePointByteCount() - Like UTF8CharacterLength()

An interesting question is how CodePointCopy, CodePointLength and CodePointPos should be implemented in Delphi which does not provide such functions for UTF-16. (Or does it?) Practically all Delphi code uses plain Copy, Length and Pos when codepoint aware functions should be used.

Unicode characters and codepoints in code

See details for UTF-8 in UTF8_strings_and_characters.

CodePoint functions for encoding agnostic code

LazUtils package has unit LazUnicode with special functions for dealing with codepoints, regardless of encoding. They use the UTF8...() functions from LazUTF8 when used in the UTF-8 mode, and UTF16...() functions from LazUTF16 when used in FPC's {$ModeSwitch UnicodeStrings} or in Delphi (yes, Delphi is supported!).

Currently the {$ModeSwitch UnicodeStrings} can be tested by defining "UseUTF16". There is also a test program LazUnicodeTest in components/lazutils/test directory. It has 2 build modes, UTF8 and UTF16, for easy testing. The test program also supports Delphi, then the UTF-16 mode is used obviously.

LazUnicode allows one source code to work between :

  • Lazarus with its UTF-8 solution.
  • Future FPC and Lazarus with Delphi compatible UTF-16 solution.
  • Delphi, where String = UnicodeString.

It provides these encoding agnostic functions:

  • CodePointCopy() - Like UTF8Copy()
  • CodePointLength() - Like UTF8Length()
  • CodePointPos() - Like UTF8Pos()
  • CodePointSize() - Like UTF8CharacterLength()
  • UnicodeToWinCP() - Like UTF8ToWinCP()
  • WinCPToUnicode() - Like WinCPToUTF8()

It also provides an enumerator for CodePoints which the compiler uses for its for-in loop. As a result, regardless of encoding, this code works:

 var s, ch: String;
 ...
 for ch in s do
   writeln('ch=',ch);

Delphi does not provide similar functions for CodePoints for its UTF-16 solution. Practically most Delphi code treats UTF-16 as a fixed-width encoding which has lead to lots of broken UTF-16 code out there. It means using LazUnicode also for Delphi will improve code quality!

Both units LazUnicode and LazUTF16 are needed for Delphi usage.


String Literals

Sources should be saved in UTF-8 encoding. Lazarus creates such files by default. You can change the encoding of imported files via right click in source editor / File Settings / Encoding.

Usually {$codepage utf8} / -FcUTF8 is not needed. This is rather counter-intuitive because the meaning of that flag is to treat string literals as UTF-8. However the new UTF-8 mode switches the encoding at run-time, yet constants are evaluated at compile-time.

So, without -FcUTF8 the compiler (wrongly) thinks the constant string is encoded with system code page. Then it sees a String variable with default encoding (which will be changed to UTF-8 at run-time but the compiler does not know it). Thus, same default encodings, no conversion needed, the compiler happily copies the characters and everything goes right, while actually it was fooled twice during the process.

Example:

As a rule of thumb, use "String" type and assigning literals works. Note: UTF-8 string can be composed from numbers, as done for s2.

const s1: string = 'äй';
const s2: string = #$C3#$A4; // ä

Assigning a string literal to other string types than plain "String" is more tricky. See the tables for what works and what doesn't.

Assign string literals to different string types

Here working means correct codepage and correct codepoints. Codepage 0 or codepage 65001 are both correct, they mean UTF-8.

Without {$codepage utf8} or compilerswitch -FcUTF8

String Type, UTF-8 Source Example Const (in Source) Assigned to String Assigned to UTF8String Assigned to UnicodeString Assigned to CP1252String Assigned to RawByteString Assigned to ShortString Assigned to PChar
const const s = 'äöü'; working working wrong wrong wrong working working working
String const s: String = 'äöü'; working working working working working working working working
ShortString const s: String[15] = 'äöü'; working working working working wrong encoded working working not available
UTF8String const s: UTF8String = 'äöü'; wrong wrong wrong wrong wrong wrong wrong wrong
UnicodeString const s: UnicodeString = 'äöü'; wrong wrong wrong wrong wrong wrong wrong wrong
String with declared code page type CP1252String = type AnsiString(1252); wrong wrong wrong wrong wrong wrong wrong wrong
RawbyteString const s: RawbyteString = 'äöü'; working working working working to codepage 0 changed working working working
PChar const c: PChar = 'äöü'; working working working working wrong working working working

With {$codepage utf8} or compilerswitch -FcUTF8

String Type, UTF-8 Source Example Const (in Source) Assigned to String Assigned to UTF8String Assigned to UnicodeString Assigned to CP1252String Assigned to RawByteString Assigned to ShortString Assigned to PChar
const const s = 'äöü'; UTF-16 encoded working working working working working working working
String const s: String = 'äöü'; working working working working working working working working
ShortString const s: String[15] = 'äöü'; wrong wrong wrong wrong wrong wrong wrong not available
UTF8String const s: UTF8String = 'äöü'; working working working working working working working working
UnicodeString const s: UnicodeString = 'äöü'; working working working working working working working wrong
String with declared code page type CP1252String = type AnsiString(1252); working working working working working working wrong wrong
RawbyteString const s: RawbyteString = 'äöü'; working working working working to codepage 0 changed working working working
PChar const c: PChar = 'äöü'; wrong wrong wrong wrong wrong wrong wrong wrong

Remember, assignment between variables of different string types always works thanks to their dynamic encoding in FPC 3+. The data is converted automatically when needed.
Only string literals are a challenge.


Coming from older Lazarus + LCL versions

Earlier (before 1.6.0) LCL supported Unicode with dedicated UTF8 functions. The code was not at all compatible with Delphi.

Now many old LCL applications continue to work without changes. However it makes sense to clean the code to make it simpler and more Delphi compatible. Code that reads/writes data with Windows system codepage encoding breaks and must be changed. (See #Reading / writing text file with Windows codepage).

Explicit conversion functions are only needed when calling Windows Ansi functions. Otherwise FPC takes care of converting encodings automatically. Empty conversion functions are provided to make your old code compile.

  • UTF8Decode, UTF8Encode - Almost all can be removed.
  • UTF8ToAnsi, AnsiToUTF8 - Almost all can be removed.
  • UTF8ToSys, SysToUTF8 - All can be removed. They are now dummy no-ops and only return their parameter.

File functions in RTL now take care of file name encoding. All (?) file name related ...UTF8() functions can be replaced with the Delphi compatible function without UTF8 suffix. For example FileExistsUTF8 can be replaced with FileExists.

Most UTF8...() string functions can be replaced with the Delphi compatible Ansi...() functions. For example UTF8UpperCase() -> AnsiUpperCase().

Now Unicode works in non-GUI programs, too. It only requires a dependency for LazUtils and placing LazUTF8 unit into the uses section of main program file.

For historical reference, this was the old Unicode support in LCL: Old LCL Unicode Support

Technical implementation

What actually happens in the Unicode system? These 2 FPC functions are called in an early initialization section, setting the default String encoding in FPC to UTF-8 :

 SetMultiByteConversionCodePage(CP_UTF8);
 SetMultiByteRTLFileSystemCodePage(CP_UTF8);

Under Windows the UTF8...() functions in LazUTF8 (LazUtils) are set as backends for RTL's Ansi...() string functions. Thus those functions work in a Delphi compatible way.


Open issues

  • TFormatSettings char (bug 27086): for example: ThousandSeparator, DecimalSeparator, DateSeparator, TimeSeparator, ListSeparator. These should be replaced with string to support UTF-8. For example under Linux with LC_NUMERIC=ru_RU.utf8 the thousand separator is the two byte nbsp/160.
    • Workaround: use single space characters instead as done in patch here: 27099

WinAPI function calls in FPC libs

  • Unit registry, TRegistry - this unit uses Windows Ansi functions and therefore you need to use UTF8ToWinCP, WinCPToUTF8. Formerly it needed UTF8ToSys.
  • All Windows API Ansi function calls in FPC's libraries must be replaced with the "W" function version. This must be done in any case for the future UTF-16 support, thus there is no conflict of interest here.
  • TProcess - under Windows TProcess FPC 3.0 only supports system codepage. Use either TProcessUTF8 of unit utf8process or use FPC trunk where it is fixed, see issue 29136

ToDo: List all related FPC bug tracker issues and patches that should be applied.


Future

The goal of FPC project is to create a Delphi compatible UnicodeString (UTF-16) based solution, but it is not ready yet. It may take some time to be ready.

This UTF-8 solution of LCL in its current form can be considered temporary. In the future, when FPC supports UnicodeString fully in RTL and FCL, Lazarus project will provide a solution for LCL that uses it. At the same time the goal is to preserve UTF-8 support although it may require changes to string types or something. Nobody know the details yet. We will tell when we know...

In essence LCL will probably have 2 versions, one for UTF-8 and one for UTF-16.


FAQ

What about Mode DelphiUnicode?

The {$mode delphiunicode} was added in FPC 2.7.1 and is like {$Mode Delphi} with {$ModeSwitch UnicodeStrings}. See the next question about ModeSwitch UnicodeStrings.

What about ModeSwitch UnicodeStrings?

The {$ModeSwitch UnicodeStrings} was added in FPC 2.7.1 and defines "String" as "UnicodeString" (UTF-16), "Char" as "WideChar", "PChar" as "PWideChar" and so forth. This affects only the current unit. Other units including those used by this unit have their own "String" definition. Many RTL strings and types (e.g. TStringList) uses 8-bit strings, which require conversions from/to UnicodeString, which are added automatically by the compiler. The LCL uses UTF-8 strings. It is recommended to use UTF-8 sources with or without "-FcUTF8".

Why not use UTF8String in Lazarus?

Short answer: Because the FCL does not use it.

Long answer: UTF8String is defined in the system unit as

UTF8String = type AnsiString(CP_UTF8);

The compiler always assumes it has UTF-8 encoding (CP_UTF8), which is a multi byte encoding (i.e. 1-4 bytes per codepoint). Note that the [] operator accesses bytes, not characters nor codepoints. Same for UnicodeString, but words instead of bytes. On the other hand a String is assumed at compile time to have DefaultSystemCodePage (CP_ACP). DefaultSystemCodePage is defined at run time, so the compiler conservatively assumes that String and UTF8String have different encodings. When you assign or combine String and UTF8String the compiler inserts conversion code. Same for ShortString and UTF8String.

Lazarus uses the FCL, which uses String, so using UTF8String would add conversions. If DefaultSystemCodePage is not UTF-8 you lose characters. If it is UTF-8 then there is no point to use UTF8String.

UTF8String becomes useful when eventually there is an UTF-16 FCL.

Why does UTF8String show strange characters and String works

For example:

var s: UTF8String = 'ä';  // with default flags (i.e. no -Fc) this creates garbage
               // even on a UTF-8 Linux system

Question: Is it a bug? Answer: No, because it works as documented.

FPC ignores the LANG variable to create on every system the same result. For historical reasons/Delphi compatibility it uses ISO-8859-1 as default. Lazarus prefers UTF-8.

  • UTF-8 sources work with String, because
    1. FPC does not add conversion code for normal String literals by default.
    2. The source codepage is equal to the runtime codepage. On Windows LazUTF8 sets it to CP_UTF8.
  • UTF8String requires UTF-8 sources. Since FPC 3.0 for UTF8String you must tell the compiler, that the source is UTF8 (-FcUTF8, {$codepage UTF8}, or save file as UTF-8 with BOM).
    • Note: If you tell the compiler the source encoding UTF-8 it changes all non ASCII string literals of this unit to UTF-16, increasing the size of the binary, adding some overhead and PChar on literals require an explicit conversion. That's why Lazarus does not add it by default.

What happens when I use $codepage utf8?

FPC has very limited UTF-8 support. In fact, FPC only supports storing literals as either "default" encoded 8-bit strings or widestrings. So any non default codepage is converted to widestring, even if it is the system codepage. For example most Linux/Mac/BSD use UTF-8 as system codepage. Passing -Fcutf8 to the compiler will store the string literal as widestring.

At run time the widestring literal is converted. When you assign the literal to an AnsiString the widestring literal is converted using the widestringmanager to the system encoding. The default widestringmanager under Unix simply converts the widechars to chars, destroying any non ASCII character. You must use a widestringmanager like the unit cwstring to get correct conversion. Unit LazUTF8 does that.


関連情報

Unicode や コードページ、文字列の種別や FPC での RTL についての情報は以下にあります。
注) この情報はすべてが Lazarus の UTF-8 システムで有効ではありません。なぜなら、 FPC の観点からは、デフォルトのコードページを変更することは ハック にあたるからです。s point of view it is a hack and changes the default codepage.
FPC Unicode support/ja