Difference between revisions of "LCL Unicode Support/de"

From Free Pascal wiki
Jump to navigationJump to search
Line 3: Line 3:
 
== Einleitung ==
 
== Einleitung ==
  
Die Lazarus Unterstützung des Unicode Standards benötigt weitere Entwicklung, hauptsächlich hinsichtlich der Windows Plattform. Hier sind einige grundlegende Informationen für diejenigen, die die Lazarus Unicode Unterstützung weiter entwickeln wollen.
+
Lazarus support of the Unicode standard needs further development, mostly in regard to the Windows platform. Here are some basic information for those who would like to further develop the Lazarus Unicode support. Please correct, extend and update this page.
Bitte berichtigen, erweitern und aktualisieren sie diese Seite.
+
 
 +
It will help if you have already heard for the Unicode standard and if you perhaps had some experience with WideStrings under Delphi. Previous use of non-(western)Latin scripts and their various character sets will help too.
 +
 
 +
== Implementierungsrichtlinien ==
 +
 
 +
One big problem for implementing unicode support on Lazarus in the backwards compatibility for existing Lazarus programs. The core of Lazarus development today (pre-1.0 era) is concentrated into stabilizing win32 and gtk interfaces, both of which are not unicode capable. When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. This means that if we break compatibility for old software now, there will be a period of time (and a possibly large one) when Lazarus doesn´t work for ISO group and doesn´t work for Unicode, until we have interfaces stable enougth for the IDE to fully support Unicode.
 +
 
 +
To avoid this problem a smooth transition is the best way to go. And one that keeps old software running. This can be done by dividing the widgetsets into 2 groups. One for ANSI supporting widgetsets and another for Unicode supporting widgetsets.
 +
 
 +
The ANSI group will consist of: win32 and gtk (1) interfaces.
 +
 
 +
The UNICODE group will consist of most other interfaces: win32u, gtk2, qt, carbon, wince
 +
 
 +
It is also possible to add Unicode support for the gtk 1 interface in the future.
 +
 
 +
In this division, existing software will work, if recompiled for win32 or gtk interfaces. And new software, using utf-8 will work when recompiled for any of the widgetsets on the Unicode group. This will satisfy both new and old users.
 +
 
 +
One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.
 +
 
 +
== Roadmap ==
 +
 
 +
Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.
 +
 
 +
All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.
 +
 
 +
Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.
 +
 
 +
 
 +
=== Primäre Aufgaben ===
 +
 
 +
 
 +
'''Create a Unicode enabled widgetset that descends from Win32'''
 +
 
 +
Notes: On this step we will only target Windows NT and Windows 9x using the Unicode Layer, simply because most people today use NT already. Those that need Win 9x support without the Unicode Layer can volunteer for the secondary task for implementing it.
 +
 
 +
Status: Nicht implementiert
 +
 
 +
 
 +
'''Update Gtk 2 keyboard functions so they work with Utf-8'''
 +
 
 +
Anmerkungen:
 +
 
 +
Status: Nicht implementiert
 +
 
 +
 
 +
'''Make sure the Lazarus IDE runs correctly with Win32U widgetset and support UTF-8'''
 +
 
 +
Anmerkungen:
 +
 
 +
Status: Nicht implementiert
 +
 
 +
 
 +
'''Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and support UTF-8'''
 +
 
 +
Anmerkungen:
 +
 
 +
Status: Nicht implementiert
 +
 
 +
 
 +
 
 +
=== Sekundäre Aufgaben ===
 +
 
 +
 
 +
'''Update Windows CE widgetset so it uses Utf-8'''
 +
 
 +
Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.
 +
 
 +
Status: Not implemented
 +
 
 +
 
 +
'''Implement support for Windows 9x on the Win32U widgetset'''
 +
 
 +
Notes: TNT Components for Delphi use some tricks that allow Win 9x support. We may use the same tricks, but a volunteer is needed to implement them.
 +
 
 +
Status: Not implemented
 +
 
 +
 
 +
'''Update Gtk 1 keyboard functions so they work with Utf-8'''
 +
 
 +
Notes:
 +
 
 +
Status: Not implemented
  
Es ist hilfreich, wenn sie bereits vom Unicode
 
Standard gehört haben und vielleicht einige Erfahrung mit
 
WideStrings unter Delphi haben. Previous use of
 
non-(western)Latin scripts and their various character sets
 
wird auch helfen.
 
  
 
== Unicode essentials ==
 
== Unicode essentials ==
  
Der Unicode Standard bildet Integers von 0 bis 10FFFF(h) auf Zeichen ab. Jede solche Abbildung wird code point genannt. Mit
+
Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).
anderen Worten, Unicode Zeichen sind im Prinzip definiert für
+
 
code points von U+000000 bis U+10FFFF (0 bis 1 114 111).
+
There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:
  
Es gibt drei Schemata für die Darstellung von Unicode code points
+
                            UTF-8 UTF-16 UTF-32
als eindeutige Bytesequenzen. Diese Schematas werden Unicode
+
  Smallest code point [hex] 000000 000000 000000
transformation formats genannt: UTF-8, UTF-16 und UTF-32. Die Umsetzungen zwischen ihnen sind möglich.
+
  Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Hier sind ihre Haupteigenschaften:
+
  Code unit size [bits]         8    16    32
                            UTF-8 UTF-16 UTF-32
+
  Minimal bytes/character       1      2      4
  Kleinster code point [hex] 000000 000000 000000
+
  Maximal bytes/character       4      4      4
  Größter code point  [hex] 10FFFF 10FFFF 10FFFF
 
  Code unit Größe [Bits]       8    16    32
 
  Minimal bytes/Zeichen       1      2      4
 
  Maximal bytes/Zeichen       4      4      4
 
  
 
'''UTF-8''' has several important and useful properties: It is
 
'''UTF-8''' has several important and useful properties: It is
Line 54: Line 126:
 
single 32-bit unit in '''UTF-32'''.
 
single 32-bit unit in '''UTF-32'''.
  
Für weitere Informationen siehe:  
+
Für mehr siehe:  
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 
[http://de.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
 
[http://de.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
 +
[http://de.wikipedia.org/wiki/ISO-8859]
  
== Lazarus Component Library Architektur essentials ==
+
== Lazarus Component Library architecture essentials ==
  
(Dieser Teil basiert auf einer Email von [[user:Marc|Marc Weustink]].)
 
 
Die LCL besteht aus zwei Teilen:  
 
Die LCL besteht aus zwei Teilen:  
# Ein Zielplattform-unabhängiger Teil, welcher eine Klassenhierarchie analog zur Delphi VCL implementiert;  
+
# A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;  
# "Interfaces" - ein Teil, der die Schnittstelle zu den APIs jeder Zielplattform implementiert.
+
# "Interfaces" - a part that implements the interface to APIs of each target platform.
 +
 
 +
The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.
 +
 
 +
The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.
 +
 
 +
Gtk2 widgetset only works with utf-8 encoding, but the keyboard code of the interface is still based on the old gtk 1 code, so it does not support utf-8 completely.
  
Die Kommunikation zwischen den beiden Teilen is done by an
+
The win32 interfaces is setup with ansii widgets, so it is currently not possible to use Unicode with win32.
abstract class TWidgetset. Each widgetset is implemented by
 
its own derived class form TWidgetset.
 
  
Das GTK widgetset ist das älteste. In this widgetset the
+
Qt interface is prepared for utf-8. Qt itself uses UCS-2 as native encoding, but the lazarus interface for Qt converts from utf-8 to UCS-2.
string encoding is determined by the LANG environment var.
 
If it is a UTF8 variant, all strings from and to native
 
controls/widgets are UTF8 encoded. However UTF8 may affect
 
keyboard handling for GTK1. Bei GTK2 ist dieses Problem gelöst,
 
aber noch nicht implementiert, the keyboard routines still rely on
 
GTK1 code there.
 
  
Die win32 Schnittstelle ist mit ansii widgets eingerichtet. Daher ist es gegenwärtig nicht möglich, Unicode mit win32 zu verwenden.
+
Windows CE only support UCS-2 as character encoding, but our interface for it currently converts strings from ISO to UCS-2 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.
  
Für weitere Informationen siehe: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
+
Für mehr siehe: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
  
 
== Unicode-enabling the win32 interface  ==
 
== Unicode-enabling the win32 interface  ==
  
=== Voraussetzungen ===
+
=== Anforderungen ===
  
 
The spirit of Lazarus is: "Write once, compile everywhere."
 
The spirit of Lazarus is: "Write once, compile everywhere."
Line 121: Line 191:
  
  
=== A solution approach ===
+
=== Richtlinien ===
  
 
=== Making progress ===
 
=== Making progress ===

Revision as of 10:56, 16 October 2006

Deutsch (de) English (en) español (es) français (fr) 日本語 (ja) 한국어 (ko) русский (ru) 中文(中国大陆)‎ (zh_CN) 中文(台灣)‎ (zh_TW)

Einleitung

Lazarus support of the Unicode standard needs further development, mostly in regard to the Windows platform. Here are some basic information for those who would like to further develop the Lazarus Unicode support. Please correct, extend and update this page.

It will help if you have already heard for the Unicode standard and if you perhaps had some experience with WideStrings under Delphi. Previous use of non-(western)Latin scripts and their various character sets will help too.

Implementierungsrichtlinien

One big problem for implementing unicode support on Lazarus in the backwards compatibility for existing Lazarus programs. The core of Lazarus development today (pre-1.0 era) is concentrated into stabilizing win32 and gtk interfaces, both of which are not unicode capable. When people develop software for non-fully working widgetsets like Gtk 2, Qt, WinCE (and futurely Win32U), they use the IDE compiled for more stable widgetsets, like Gtk and win32. This means that if we break compatibility for old software now, there will be a period of time (and a possibly large one) when Lazarus doesn´t work for ISO group and doesn´t work for Unicode, until we have interfaces stable enougth for the IDE to fully support Unicode.

To avoid this problem a smooth transition is the best way to go. And one that keeps old software running. This can be done by dividing the widgetsets into 2 groups. One for ANSI supporting widgetsets and another for Unicode supporting widgetsets.

The ANSI group will consist of: win32 and gtk (1) interfaces.

The UNICODE group will consist of most other interfaces: win32u, gtk2, qt, carbon, wince

It is also possible to add Unicode support for the gtk 1 interface in the future.

In this division, existing software will work, if recompiled for win32 or gtk interfaces. And new software, using utf-8 will work when recompiled for any of the widgetsets on the Unicode group. This will satisfy both new and old users.

One very important note is that you must use the IDE compiled for the same group you are targeting. This is because the IDE uses the encoding of the widgetset it was compiled to, and not the one of the target widgetset to write LFM and LRS files.

Roadmap

Now that we have guidelines, it´s time to create a roadmap and put it into practice. For this, the following realistic plan was created. Our plan splits tasks in 2 groups. One for Primary tasks and another for Secondary tasks.

All primary tasks must be fully implemented before we can say Lazarus is Unicode enabled, and as such will be the main attention of our effort.

Secondary tasks are desirable, but won´t be implemented unless someone volunteers for them, or posts a bounty for them.


Primäre Aufgaben

Create a Unicode enabled widgetset that descends from Win32

Notes: On this step we will only target Windows NT and Windows 9x using the Unicode Layer, simply because most people today use NT already. Those that need Win 9x support without the Unicode Layer can volunteer for the secondary task for implementing it.

Status: Nicht implementiert


Update Gtk 2 keyboard functions so they work with Utf-8

Anmerkungen:

Status: Nicht implementiert


Make sure the Lazarus IDE runs correctly with Win32U widgetset and support UTF-8

Anmerkungen:

Status: Nicht implementiert


Make sure the Lazarus IDE runs correctly with Gtk 2 widgetset and support UTF-8

Anmerkungen:

Status: Nicht implementiert


Sekundäre Aufgaben

Update Windows CE widgetset so it uses Utf-8

Notes: String conversion routines are concentrated on the winceproc.pp file. Many tests are needed.

Status: Not implemented


Implement support for Windows 9x on the Win32U widgetset

Notes: TNT Components for Delphi use some tricks that allow Win 9x support. We may use the same tricks, but a volunteer is needed to implement them.

Status: Not implemented


Update Gtk 1 keyboard functions so they work with Utf-8

Notes:

Status: Not implemented


Unicode essentials

Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).

There are three schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. The conversions between all of them are possible. Here are their basic properties:

                           UTF-8 UTF-16 UTF-32
Smallest code point [hex] 000000 000000 000000
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Code unit size [bits]          8     16     32
Minimal bytes/character        1      2      4
Maximal bytes/character        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 32-bit unit in UTF-32.

Für mehr siehe: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8 [1]

Lazarus Component Library architecture essentials

Die LCL besteht aus zwei Teilen:

  1. A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
  2. "Interfaces" - a part that implements the interface to APIs of each target platform.

The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own class derived from TWidgetset.

The GTK 1 widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment variable, which is usually a iso The ISO-8859-n group of single byte encodings. Recently (as of Mandriva 2007, for example), many distributions have being shipping Gtk 1 configured for UTF-8. Our Gtk 1 interface lacks proper support for UTF-8 on the keyboard handling routines, so this is a big problem, that increases the need for Lazarus to implement cross-platform Unicode support.

Gtk2 widgetset only works with utf-8 encoding, but the keyboard code of the interface is still based on the old gtk 1 code, so it does not support utf-8 completely.

The win32 interfaces is setup with ansii widgets, so it is currently not possible to use Unicode with win32.

Qt interface is prepared for utf-8. Qt itself uses UCS-2 as native encoding, but the lazarus interface for Qt converts from utf-8 to UCS-2.

Windows CE only support UCS-2 as character encoding, but our interface for it currently converts strings from ISO to UCS-2 before calling the Windows API. This is very easy to fix, as all conversion code is concentrated on a few routines on the winceproc.pp file.

Für mehr siehe: Internals of the LCL

Unicode-enabling the win32 interface

Anforderungen

The spirit of Lazarus is: "Write once, compile everywhere." This means that, ideally, an Unicode enabled application should have only one Unicode supporting source code version, without any conditional defines in respect to various target platforms.

The "interface" part of the LCL should support Unicode for the target platforms which support it themselves, concealing at the same time all peculiarities from the application programmer.

Windows platforms <=Win9x are based on ISO code page standards and do not support Unicode. Windows platforms starting with WinNT support Unicode. In doing that, these platforms offer two parallel sets of API functions: the old ANSII enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters.

What concerns Lazarus, the internal string communication at the boundaries "Application code <--> LCL", as well as "LCL <--> Widgetsets" is based on the classical (byte oriented) strings. Logically, their contents should be encoded according to the UTF-8.

It is sound to assume that the existing WinXX application base internally does not use UTF-8 encoded strings, but the ISO code page based ones. Any Unicode enabling changes to LCL and widget sets for win32 must not break the existing application base. At the same time they should support applications which are internally based on the Unicode UTF-8 encoded strings, both on older Win9x platforms, as well as on Unicode based >=WinNT platforms.


Richtlinien

Making progress