Difference between revisions of "LCL Unicode Support/de"

From Free Pascal wiki
Jump to navigationJump to search
m
Line 3: Line 3:
 
== Einleitung ==
 
== Einleitung ==
  
Die Lazarus Unterstützung des Unicode Standards benötigt einen weiteren Ausbau, mostly in regard to the Windows platform. Here are some basic information for those who would like to further develop the Lazarus Unicode support. Bitte korrigieren, erweitern und updaten sie diese Seite.
+
Lazarus support of the Unicode standard needs further development, mostly in regard to the Windows platform. Here are some basic information for those who would like to further develop the Lazarus Unicode support. Please correct, extend and update this page.
  
Es ist hilfreich, wenn sie bereit vom Unicode Standard gehört haben und vielleicht erste Erfahrungen mit WideStrings unter Delphi gemacht haben. Previous use of non-(western)Latin scripts and their various character sets will help too.
+
It will help if you have already heard for the Unicode standard and if you perhaps had some experience with WideStrings under Delphi. Previous use of non-(western)Latin scripts and their various character sets will help too.
  
Anmerkung: Die Implementierungsdetails befinden sich in Diskussion, daher kann sich der Inhalt dieser Seite noch ändern.
+
Note: Implementation details are still being discussed, and the contents of this document may change.
  
 
== Implementierungsrichtlinien ==
 
== Implementierungsrichtlinien ==
  
Ein großes Problem für die Implementierung der Unicode-Unterstützung in Lazarus ist die Abwärtskompatibilität für bestehende Lazarus Programme. The core of Lazarus development today (pre-1.0 era) is concentrated into stabilizing win32 and gtk interfaces, both of which are not unicode capable. When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. This means that if we break compatibility for old software now, there will be a period of time (and a possibly large one) when Lazarus doesn´t work for ISO group and doesn´t work for Unicode, until we have interfaces stable enougth for the IDE to fully support Unicode.
+
=== Voraussetzungen ===
  
To avoid this problem a smooth transition is the best way to go. And one that keeps old software running. This can be done by dividing the widgetsets into 2 groups. One for ANSI supporting widgetsets and another for Unicode supporting widgetsets.
+
The spirit of Lazarus is: "Write once, compile everywhere."
 +
This means that, ideally, an Unicode enabled application
 +
should have only one Unicode supporting source code version,
 +
without any conditional defines in respect to various target
 +
platforms.
 +
 
 +
The "interface" part of the LCL should support Unicode for
 +
the target platforms which support it themselves, concealing
 +
at the same time all peculiarities from the application
 +
programmer.
  
The ANSI group will consist of: win32 and gtk (1) interfaces.
+
What concerns Lazarus, the internal string communication at
 +
the boundaries "Application code <--> LCL", as well as "LCL
 +
<--> Widgetsets" is based on the classical (byte oriented)
 +
strings. Logically, their contents should be encoded according
 +
to the [[UTF-8]].
  
The UNICODE group will consist of most other interfaces: win32, gtk2, qt, carbon, wince, fpGUI
 
  
It is also possible to add Unicode support for the gtk 1 interface in the future.
+
=== Migration to Unicode ===
  
Notice that win32 is on both groups, because a define will be used to select which subset to compile. After the transition is finished, win32 will be set to unicode mode by default.
+
Most existing Lazarus use Ansi encodings, because that´s the default for Gtk1 and win32 interfaces today. This will change in the future and all widgetsets will support UTF-8, so all applications that pass strings directly to the interface (be written on code or on the object inspector) will need to be converted to utf-8.
  
In this division, existing software will work, if recompiled for win32 or gtk interfaces. And new software, using UTF-8 will work when recompiled for any of the widgetsets on the Unicode group. This will satisfy both new and old users.
+
When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. To avoid inconsistencies (like passing iso characters for a utf-8 widgetset), it´s necessary to use an IDE working on the same encoding as the target widgetset. This means that we will need stable UTF-8 IDE before completing the migration to Unicode.
 +
 
 +
 
 +
Currently we have various groups of widgetsets, according to the encoding:
 +
 
 +
*Interfaces that use ANSI encoding: win32 and gtk (1) interfaces.
 +
 
 +
*Interfaces that use UTF-8 encoding: gtk (1), gtk2, qt, fpGUI, carbon
 +
 
 +
*Interfaces that currently use ANSI encoding, but need migration to UTF-8: win32, wince
 +
 
 +
 
 +
Notice that gtk 1 is on both ANSI and UTF-8 groups. That´s because the encoding is controlled by an environment variable on Gtk 1.
 +
 
 +
As Lazarus is today, existing software will work, if recompiled for win32, wince or gtk interfaces, but will face encoding issues compiling for other widgetset. And new software, using UTF-8 will work when recompiled for any of the widgetsets on the Unicode group.
  
 
One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.
 
One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.
Line 29: Line 55:
 
== Roadmap ==
 
== Roadmap ==
  
Jetzt da wir die Richtlinien haben, ist es Zeit, um eine Roadmap zu erstellen und put it into practice. Dafür wurde der folgende Plan erstellt. Unser Plan unterteilt die Aufgaben in 2 Gruppen, eine für primäre Aufgaben und die andere für sekundäre Aufgaben.
+
Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.
  
Alle primären Aufgaben müssen vollständig implementiert sein before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.
+
All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.
  
Sekundäre Aufgaben sind erstrebenswert, werden aber nicht implementiert, bis sich Freiwillige dafür finden oder ein 'Kopfgeld' dafür ausgesetzt wird.
+
Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.
  
  
 
=== Primäre Aufgaben ===
 
=== Primäre Aufgaben ===
 +
  
 
'''Make Win32 Widgetset support UTF-8'''
 
'''Make Win32 Widgetset support UTF-8'''
  
Anmerkungen: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.
+
Notes: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.
  
Details about how to support unicode on Win9x are being debated.
+
Status: Partially implemented
  
Status: Nicht implementiert
 
  
 +
'''Update Gtk 2 keyboard functions so they work with UTF-8'''
  
'''Update der GTK 2 Keyboard Funktionen zum Funktionieren mit UTF-8'''
+
Notes:
  
Anmerkungen:
+
Status: Almost complete. Some pre-editing features of the gtk2 are not yet supported in custom controls. I don't know, which language needs them.
  
Status: Nicht implementiert
 
  
 +
'''Make sure the Lazarus IDE runs correctly with Win32 Unicode widgetset and supports UTF-8'''
  
'''Make sure the Lazarus IDE runs correctly with Win32U widgetset and support UTF-8'''
+
Notes:
  
Anmerkungen:  
+
Status: Complete. Except for the character map, which still shows only 255 characters. But all modern OS provide nice unicode character maps anyway.
  
Status: Nicht implementiert
 
  
 +
'''Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and supports UTF-8'''
  
'''Make sure the Lazarus IDE runs correctly with GTK 2 widgetset and support UTF-8'''
+
Notes:  
 
 
Anmerkungen:  
 
 
 
Status: Nicht implementiert
 
 
 
  
 +
Status: Complete. There are gtk2 intf bugs, but they have nothing to do with utf-8.
  
 
=== Sekundäre Aufgaben ===
 
=== Sekundäre Aufgaben ===
  
  
'''Update des Windows CE widgetset, so daß es UTF-8 verwendet'''
+
'''Update Windows CE widgetset so it uses UTF-8'''
  
Anmerkungen: Stringkonvertierungsroutinen sind in der winceproc.pp Datei konzentriert. Viele Tests sind notwendig.
+
Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.
  
Status: Nicht implementiert
+
Status: Not implemented
  
  
'''Unterstützung für Windows 9x im Win32 Unicode widgetset implementieren'''
+
'''Update Gtk 1 keyboard functions so they work with UTF-8'''
  
Anmerkungen: TNT Komponenten für Delphi nutzen einige Tricks, die Win 9x Unterstützung erlauben. Wir mögen die selben Tricks verwenden, aber ein Freiwilliger wird benötigt, um sie zu implementieren.
+
Notes:
  
Status: Nicht implementiert
+
Status: Not implemented
  
  
'''Update der GTK 1 Keyboard Funktionen zum Funktionieren mit UTF-8'''
+
'''Complete RTL in synedit'''
  
Anmerkungen:
+
Notes: RTL means right to left as used for example by arabic
 
 
Status: Nicht implementiert
 
  
 +
Status: Not implemented.
  
 
== Unicode essentials ==
 
== Unicode essentials ==
Line 97: Line 119:
 
Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).
 
Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).
  
There are three schemes for representing Unicode code points as unique byte sequences. Diese Schemas werden Unicode Transformation Formate genannt: UTF-8, UTF-16 und UTF-32. Die Konvertierung zwischen allen von ihnen ist möglich. Hier sind ihre Haupteigenschaften:
+
There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:
  
 
                             UTF-8 UTF-16 UTF-32
 
                             UTF-8 UTF-16 UTF-32
Line 131: Line 153:
 
single 32-bit unit in '''UTF-32'''.
 
single 32-bit unit in '''UTF-32'''.
  
Für mehr siehe:  
+
For more, see:  
 +
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 +
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 +
[http://en.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
 +
[http://en.wikipedia.org/wiki/ISO-8859]
  
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - grundlegende Fragen]
+
== Lazarus component library architecture essentials ==
  
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM]
+
The LCL consists of two parts:  
 +
# A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
 +
# "Interfaces" - a part that implements the interface to APIs of each target platform.
  
[http://de.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
+
The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.
  
[http://de.wikipedia.org/wiki/ISO-8859 ISO-8859]
+
The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.
  
== Lazarus Component Library Architekturbestandteile ==
+
Gtk2 widgetset only works with UTF-8 encoding and supports UTF-8 completely.
  
Die LCL besteht aus zwei Teilen:
+
The win32 interface is setup with ansi widgets and UTF-8 support is started, but not yet complete and therefore disabled by default. So it is currently not possible to use Unicode with win32.
# einem Zielplattform-unabhängigen Teil, welcher eine Klassenhierarchie analog zur Delphi VCL implementiert,
 
# "Interfaces" - dem Teil, der die Schnittstelle zu den APIs jeder Zielplattform implementiert.
 
  
Die Kommunikation zwischen den beiden Teilen wird erledigt von der abstrakten Klasse TWidgetset. Jedes widgetset wird durch seine eigene von TWidgetset abgeleitete Klasse implementiert.
+
Qt interface is prepared for UTF-8. Qt itself uses UTF-16 as native encoding, but the lazarus interface for Qt converts from UTF-8 to UTF-16.
  
Das GTK 1 widgetset ist das älteste. In diesem widgetset wird die String Kodierung durch die LANG Umgebungsvariable bestimmt, welche is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.
+
Windows CE only support UTF-16 as character encoding, but our interface for it currently converts strings from ISO to UTF-16 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.
  
Das GTK2 widgetset funktioniert nur mit UTF-8 Kodierung, aber der Keyboard Code der Schnittstelle basiert noch auf dem alten GTK 1 Code, so daß es UTF-8 nicht komplett unterstützt.
+
For more, see: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
  
Die win32 Schnittstelle ist mit ANSI widgets eingerichtet, so daß es gegenwärtig nicht möglich ist, Unicode mit win32 zu verwenden.
+
== Unicode-enabling the win32 interface  ==
 +
 
 +
=== Compiling LCL-Win32 with Unicode ===
 +
 
 +
To enable unicode on LCL for Windows go to the menu "Tools" --> "Configure Build Lazarus"
 +
 
 +
Put -dWindowsUnicodeSupport on the "Options" field. Select all targets to NONE, and only LCL to Clean+Build. Select win32 as target widgetset. Click on "Build".
 +
 
 +
Now you can recompile your existing applications and they will have Unicode mode enabled. Note that at the moment only a few parts of the software will be really unicode enabled and you may find bugs on those parts.
 +
 
 +
=== Richtlinien ===
  
Die Qt Schnittstelle ist auf UTF-8 vorbereitet. Qt selbst verwendet UCS-2 als native Kodierung, aber die Lazarus Schnittstelle für Qt konvertiert von UTF-8 nach UCS-2.
+
First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At this moment all existing programs that use ANSI characters will need migration to Unicode.
  
Windows CE only support UCS-2 as character encoding, but our interface for it currently converts strings from ISO to UCS-2 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.
+
Windows platforms <=Win9x are based on ISO code page
 +
standards and only partially support Unicode. Windows platforms
 +
starting with WinNT and Windows CE fully support Unicode. Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W
 +
functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows CE only uses Wide API functions.
  
Für mehr siehe: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
+
==== Wide functions present on Windows 9x ====
  
== Unicode-enabling the win32 interface ==
+
Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp
 +
 
 +
Conversion example:
 +
 
 +
<pre>
 +
  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
 +
Length(ButtonCaption), TextSize);
 +
</pre>
 +
 
 +
Becomes:
 +
 
 +
<pre>
 +
  {$ifdef WindowsUnicodeSupport}
 +
    GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
 +
  {$else}
 +
    GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
 +
  {$endif}
 +
</pre>
 +
 
 +
==== Funktionen die Ansi und Wide Versionen benötigen ====
 +
 
 +
First Conversion example:
 +
 
 +
<delphi>
 +
function TGDIWindow.GetTitle: String;
 +
var
 +
l: Integer;
 +
begin
 +
  l := Windows.GetWindowTextLength(Handle);
 +
  SetLength(Result, l);
 +
  Windows.GetWindowText(Handle, @Result[1], l);
 +
end;
 +
</delphi>
 +
 
 +
Becomes:
 +
 
 +
<delphi>
 +
function TGDIWindow.GetTitle: String;
 +
var
 +
l: Integer;
 +
AnsiBuffer: string;
 +
WideBuffer: WideString;
 +
begin
 +
 
 +
{$ifdef WindowsUnicodeSupport}
 +
 
 +
if UnicodeEnabledOS then
 +
begin
 +
  l := Windows.GetWindowTextLengthW(Handle);
 +
  SetLength(WideBuffer, l);
 +
  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
 +
  SetLength(WideBuffer, l);
 +
  Result := Utf8Encode(WideBuffer);
 +
end
 +
else
 +
begin
 +
  l := Windows.GetWindowTextLength(Handle);
 +
  SetLength(AnsiBuffer, l);
 +
  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
 +
  SetLength(AnsiBuffer, l);
 +
  Result := AnsiToUtf8(AnsiBuffer);
 +
  end;
 +
 
 +
{$else}
 +
 
 +
  l := Windows.GetWindowTextLength(Handle);
 +
  SetLength(Result, l);
 +
  Windows.GetWindowText(Handle, @Result[1], l);
 +
 
 +
{$endif}
 +
 
 +
end;
 +
 
 +
</delphi>
 +
 
 +
=== Roadmap ===
 +
 
 +
What should already be working with Unicode:
  
=== Anforderungen ===
+
* TForm, TButton, TLabel
 +
* Most controls
 +
* Menus
 +
* LCLIntf.ExtTextOut and most other text related winapis
 +
* TStrings based controls. Examples: TComboBox, TListBox, etc
  
The spirit of Lazarus is: "Write once, compile everywhere."
+
Known problems with Unicode support:
This means that, ideally, an Unicode enabled application
 
should have only one Unicode supporting source code version,
 
without any conditional defines in respect to various target
 
platforms.
 
  
The "interface" part of the LCL should support Unicode for
+
* SynEdit does not support RTL (right to left)
the target platforms which support it themselves, concealing
+
* MessageBox doesn't seam to be working with unicode
at the same time all peculiarities from the application
 
programmer.
 
  
Windows platforms <=Win9x are based on ISO code page
+
List of units to be checked:
standards and do not support Unicode. Windows platforms
 
starting with WinNT support Unicode. In doing that, these
 
platforms offer two parallel sets of API functions: the old
 
ANSII enabled *A and the new, Unicode enabled *W. *W
 
functions accept wide strings, i.e. UTF-16 encoded strings,
 
as parameters.
 
  
What concerns Lazarus, the internal string communication at
+
*"win32callback.inc"
the boundaries "Application code <--> LCL", as well as "LCL
+
*"win32def.pp"
<--> Widgetsets" is based on the classical (byte oriented)
+
*"win32int.pp"
strings. Logically, their contents should be encoded according
+
*"win32lclintf.inc"
to the UTF-8.
+
*"win32lclintfh.inc"
 +
*"win32listsl.inc"
 +
*"win32listslh.inc"
 +
*"win32memostrings.inc"
 +
*"win32object.inc"
 +
*"win32proc.pp"
 +
*"win32winapi.inc"
 +
*"win32winapih.inc"
 +
*"win32wsactnlist.pp"
 +
*"win32wsarrow.pp"
 +
*"win32wsbuttons.pp"
 +
*"win32wscalendar.pp"
 +
*"win32wschecklst.pp"
 +
*"win32wsclistbox.pp"
 +
*"win32wscomctrls.pp"
 +
*"win32wscontrols.pp"
 +
*"win32wscustomlistview.inc"
 +
*"win32wsdbctrls.pp"
 +
*"win32wsdbgrids.pp"
 +
*"win32wsdialogs.pp"
 +
*<s>"win32wsdirsel.pp"</s> - Felipe
 +
*<s>"win32wseditbtn.pp"</s> - Felipe
 +
*<s>"win32wsextctrls.pp"</s> - Felipe
 +
*<s>"win32wsextdlgs.pp"</s> - Felipe
 +
*<s>"win32wsfilectrl.pp"</s> - Felipe
 +
*<s>"win32wsforms.pp"</s> - Felipe
 +
*<s>"win32wsgrids.pp"</s> - Felipe
 +
*<s>"win32wsimglist.pp"</s> - Felipe
 +
*<s>"win32wsmaskedit.pp"</s> - Felipe
 +
*<s>"win32wsmenus.pp"</s> - Felipe
 +
*<s>"win32wspairsplitter.pp"</s> - Felipe
 +
*<s>"win32wsspin.pp"</s> - Felipe
 +
*<s>"win32wsstdctrls.pp"</s> - Felipe
 +
*<s>"win32wstoolwin.pp"</s> - Felipe
 +
*<s>"winext.pas"</s> - Felipe
  
It is sound to assume that the existing WinXX application
+
=== Screenshots ===
base internally does not use UTF-8 encoded strings, but
 
the ISO code page based ones. Any Unicode enabling changes
 
to LCL and widget sets for win32 must not break the existing
 
application base. At the same time they should support
 
applications which are internally based on the Unicode UTF-8 encoded
 
strings, both on older Win9x platforms, as well as on
 
Unicode based >=WinNT platforms.
 
  
 +
[[Image:Lazarus Unicode Test.png]]
  
=== Richtlinien ===
+
== Siehe auch ==
  
=== Making progress ===
+
* [[UTF-8]] - Beschreibung von UTF-8 Zeichenketten

Revision as of 21:56, 31 July 2007

Deutsch (de) English (en) español (es) français (fr) 日本語 (ja) 한국어 (ko) русский (ru) 中文(中国大陆)‎ (zh_CN) 中文(台灣)‎ (zh_TW)

Einleitung

Lazarus support of the Unicode standard needs further development, mostly in regard to the Windows platform. Here are some basic information for those who would like to further develop the Lazarus Unicode support. Please correct, extend and update this page.

It will help if you have already heard for the Unicode standard and if you perhaps had some experience with WideStrings under Delphi. Previous use of non-(western)Latin scripts and their various character sets will help too.

Note: Implementation details are still being discussed, and the contents of this document may change.

Implementierungsrichtlinien

Voraussetzungen

The spirit of Lazarus is: "Write once, compile everywhere." This means that, ideally, an Unicode enabled application should have only one Unicode supporting source code version, without any conditional defines in respect to various target platforms.

The "interface" part of the LCL should support Unicode for the target platforms which support it themselves, concealing at the same time all peculiarities from the application programmer.

What concerns Lazarus, the internal string communication at the boundaries "Application code <--> LCL", as well as "LCL <--> Widgetsets" is based on the classical (byte oriented) strings. Logically, their contents should be encoded according to the UTF-8.


Migration to Unicode

Most existing Lazarus use Ansi encodings, because that´s the default for Gtk1 and win32 interfaces today. This will change in the future and all widgetsets will support UTF-8, so all applications that pass strings directly to the interface (be written on code or on the object inspector) will need to be converted to utf-8.

When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. To avoid inconsistencies (like passing iso characters for a utf-8 widgetset), it´s necessary to use an IDE working on the same encoding as the target widgetset. This means that we will need stable UTF-8 IDE before completing the migration to Unicode.


Currently we have various groups of widgetsets, according to the encoding:

  • Interfaces that use ANSI encoding: win32 and gtk (1) interfaces.
  • Interfaces that use UTF-8 encoding: gtk (1), gtk2, qt, fpGUI, carbon
  • Interfaces that currently use ANSI encoding, but need migration to UTF-8: win32, wince


Notice that gtk 1 is on both ANSI and UTF-8 groups. That´s because the encoding is controlled by an environment variable on Gtk 1.

As Lazarus is today, existing software will work, if recompiled for win32, wince or gtk interfaces, but will face encoding issues compiling for other widgetset. And new software, using UTF-8 will work when recompiled for any of the widgetsets on the Unicode group.

One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.

Roadmap

Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.

All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.

Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.


Primäre Aufgaben

Make Win32 Widgetset support UTF-8

Notes: On this step we will target all 32-bits Windows versions at the same time. All code produced on this effort will be isolated from the current win32 interface by IFDEFs, to avoid introducing bugs on this primary interface. After the transition time, the IFDEFs will be removed and only the Unicode code will remain.

Status: Partially implemented


Update Gtk 2 keyboard functions so they work with UTF-8

Notes:

Status: Almost complete. Some pre-editing features of the gtk2 are not yet supported in custom controls. I don't know, which language needs them.


Make sure the Lazarus IDE runs correctly with Win32 Unicode widgetset and supports UTF-8

Notes:

Status: Complete. Except for the character map, which still shows only 255 characters. But all modern OS provide nice unicode character maps anyway.


Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and supports UTF-8

Notes:

Status: Complete. There are gtk2 intf bugs, but they have nothing to do with utf-8.

Sekundäre Aufgaben

Update Windows CE widgetset so it uses UTF-8

Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.

Status: Not implemented


Update Gtk 1 keyboard functions so they work with UTF-8

Notes:

Status: Not implemented


Complete RTL in synedit

Notes: RTL means right to left as used for example by arabic

Status: Not implemented.

Unicode essentials

Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).

There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:

                           UTF-8 UTF-16 UTF-32
Smallest code point [hex] 000000 000000 000000
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Code unit size [bits]          8     16     32
Minimal bytes/character        1      2      4
Maximal bytes/character        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 32-bit unit in UTF-32.

For more, see: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8 [1]

Lazarus component library architecture essentials

The LCL consists of two parts:

  1. A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
  2. "Interfaces" - a part that implements the interface to APIs of each target platform.

The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.

The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.

Gtk2 widgetset only works with UTF-8 encoding and supports UTF-8 completely.

The win32 interface is setup with ansi widgets and UTF-8 support is started, but not yet complete and therefore disabled by default. So it is currently not possible to use Unicode with win32.

Qt interface is prepared for UTF-8. Qt itself uses UTF-16 as native encoding, but the lazarus interface for Qt converts from UTF-8 to UTF-16.

Windows CE only support UTF-16 as character encoding, but our interface for it currently converts strings from ISO to UTF-16 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.

For more, see: Internals of the LCL

Unicode-enabling the win32 interface

Compiling LCL-Win32 with Unicode

To enable unicode on LCL for Windows go to the menu "Tools" --> "Configure Build Lazarus"

Put -dWindowsUnicodeSupport on the "Options" field. Select all targets to NONE, and only LCL to Clean+Build. Select win32 as target widgetset. Click on "Build".

Now you can recompile your existing applications and they will have Unicode mode enabled. Note that at the moment only a few parts of the software will be really unicode enabled and you may find bugs on those parts.

Richtlinien

First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At this moment all existing programs that use ANSI characters will need migration to Unicode.

Windows platforms <=Win9x are based on ISO code page standards and only partially support Unicode. Windows platforms starting with WinNT and Windows CE fully support Unicode. Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters. Windows CE only uses Wide API functions.

Wide functions present on Windows 9x

Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mslu/winprog/other_existing_unicode_support.asp

Conversion example:

  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
Length(ButtonCaption), TextSize);

Becomes:

  {$ifdef WindowsUnicodeSupport}
    GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
  {$else}
    GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
  {$endif}

Funktionen die Ansi und Wide Versionen benötigen

First Conversion example:

<delphi> function TGDIWindow.GetTitle: String; var

l: Integer;

begin

  l := Windows.GetWindowTextLength(Handle);
  SetLength(Result, l);
  Windows.GetWindowText(Handle, @Result[1], l);

end; </delphi>

Becomes:

<delphi> function TGDIWindow.GetTitle: String; var

l: Integer;
AnsiBuffer: string;
WideBuffer: WideString;

begin

{$ifdef WindowsUnicodeSupport}

if UnicodeEnabledOS then
begin
  l := Windows.GetWindowTextLengthW(Handle);
  SetLength(WideBuffer, l);
  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
  SetLength(WideBuffer, l);
  Result := Utf8Encode(WideBuffer);
end
else
begin
  l := Windows.GetWindowTextLength(Handle);
  SetLength(AnsiBuffer, l);
  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
  SetLength(AnsiBuffer, l);
  Result := AnsiToUtf8(AnsiBuffer);
end;

{$else}

  l := Windows.GetWindowTextLength(Handle);
  SetLength(Result, l);
  Windows.GetWindowText(Handle, @Result[1], l);

{$endif}

end;

</delphi>

Roadmap

What should already be working with Unicode:

  • TForm, TButton, TLabel
  • Most controls
  • Menus
  • LCLIntf.ExtTextOut and most other text related winapis
  • TStrings based controls. Examples: TComboBox, TListBox, etc

Known problems with Unicode support:

  • SynEdit does not support RTL (right to left)
  • MessageBox doesn't seam to be working with unicode

List of units to be checked:

  • "win32callback.inc"
  • "win32def.pp"
  • "win32int.pp"
  • "win32lclintf.inc"
  • "win32lclintfh.inc"
  • "win32listsl.inc"
  • "win32listslh.inc"
  • "win32memostrings.inc"
  • "win32object.inc"
  • "win32proc.pp"
  • "win32winapi.inc"
  • "win32winapih.inc"
  • "win32wsactnlist.pp"
  • "win32wsarrow.pp"
  • "win32wsbuttons.pp"
  • "win32wscalendar.pp"
  • "win32wschecklst.pp"
  • "win32wsclistbox.pp"
  • "win32wscomctrls.pp"
  • "win32wscontrols.pp"
  • "win32wscustomlistview.inc"
  • "win32wsdbctrls.pp"
  • "win32wsdbgrids.pp"
  • "win32wsdialogs.pp"
  • "win32wsdirsel.pp" - Felipe
  • "win32wseditbtn.pp" - Felipe
  • "win32wsextctrls.pp" - Felipe
  • "win32wsextdlgs.pp" - Felipe
  • "win32wsfilectrl.pp" - Felipe
  • "win32wsforms.pp" - Felipe
  • "win32wsgrids.pp" - Felipe
  • "win32wsimglist.pp" - Felipe
  • "win32wsmaskedit.pp" - Felipe
  • "win32wsmenus.pp" - Felipe
  • "win32wspairsplitter.pp" - Felipe
  • "win32wsspin.pp" - Felipe
  • "win32wsstdctrls.pp" - Felipe
  • "win32wstoolwin.pp" - Felipe
  • "winext.pas" - Felipe

Screenshots

Lazarus Unicode Test.png

Siehe auch

  • UTF-8 - Beschreibung von UTF-8 Zeichenketten