Difference between revisions of "LCL Unicode Support/de"

From Free Pascal wiki
Jump to navigationJump to search
m
 
m
Line 6: Line 6:
 
Bitte berichtigen, erweitern und aktualisieren sie diese Seite.
 
Bitte berichtigen, erweitern und aktualisieren sie diese Seite.
  
It will help if you have already heard for the Unicode
+
Es ist hilfreich, wenn sie bereits vom Unicode
standard and if you perhaps had some experience with
+
Standard gehört haben und vielleicht einige Erfahrung mit
WideStrings under Delphi. Previous use of
+
WideStrings unter Delphi haben. Previous use of
 
non-(western)Latin scripts and their various character sets
 
non-(western)Latin scripts and their various character sets
will help too.
+
wird auch helfen.
  
 
== Unicode essentials ==
 
== Unicode essentials ==
  
Unicode standard maps integers from 0 to 10FFFF(h) to
+
Der Unicode Standard bildet Integers von 0 bis 10FFFF(h) auf Zeichen ab. Jede solche Abbildung wird code point genannt. Mit
characters. Each such mapping is called a code point. In
+
anderen Worten, Unicode Zeichen sind im Prinzip definiert für
other words, Unicode characters are in principle defined for
+
code points von U+000000 bis U+10FFFF (0 bis 1 114 111).
code points from U+000000 to U+10FFFF (0 to 1 114 111).
 
  
There are three schemes for representing Unicode code points
+
Es gibt drei Schemata für die Darstellung von Unicode code points
as unique byte sequences. These schemes are called Unicode
+
als eindeutige Bytesequenzen. Diese Schematas werden Unicode
transformation formats: UTF-8, UTF-16 and UTF-32. The
+
transformation formats genannt: UTF-8, UTF-16 und UTF-32. Die Umsetzungen zwischen ihnen sind möglich.  
conversions between all of them are possible.  
+
Hier sind ihre Haupteigenschaften:
Here are their basic properties:
+
                            UTF-8 UTF-16 UTF-32
                            UTF-8 UTF-16 UTF-32
+
  Kleinster code point [hex] 000000 000000 000000
  Smallest code point [hex] 000000 000000 000000
+
  Größter code point  [hex] 10FFFF 10FFFF 10FFFF
  Largest code point  [hex] 10FFFF 10FFFF 10FFFF
+
  Code unit Größe [Bits]       8    16    32
  Code unit size [bits]         8    16    32
+
  Minimal bytes/Zeichen       1      2      4
  Minimal bytes/character       1      2      4
+
  Maximal bytes/Zeichen       4      4      4
  Maximal bytes/character       4      4      4
 
  
 
'''UTF-8''' has several important and useful properties: It is
 
'''UTF-8''' has several important and useful properties: It is
Line 56: Line 54:
 
single 32-bit unit in '''UTF-32'''.
 
single 32-bit unit in '''UTF-32'''.
  
For more, see:  
+
Für weitere Informationen siehe:  
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
 
[http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
[http://en.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
+
[http://de.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
  
 
== Lazarus Component Library Architektur essentials ==
 
== Lazarus Component Library Architektur essentials ==
Line 80: Line 78:
 
gtk1 code there.
 
gtk1 code there.
  
The win32 interfaces is setup with ansii widgets, so it is
+
Die win32 Schnittstelle ist mit ansii widgets eingerichtet. Daher ist es gegenwärtig nicht möglich, Unicode mit win32 zu verwenden.
currently not possible to use Unicode with win32.
 
  
 
Für weitere Informationen siehe: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
 
Für weitere Informationen siehe: [[LCL Internals#Internals of the LCL|Internals of the LCL]]

Revision as of 12:59, 1 August 2006

Deutsch (de) English (en) español (es) français (fr) 日本語 (ja) 한국어 (ko) русский (ru) 中文(中国大陆)‎ (zh_CN) 中文(台灣)‎ (zh_TW)

Einleitung

Die Lazarus Unterstützung des Unicode Standards benötigt weitere Entwicklung, hauptsächlich hinsichtlich der Windows Plattform. Hier sind einige grundlegende Informationen für diejenigen, die die Lazarus Unicode Unterstützung weiter entwickeln wollen. Bitte berichtigen, erweitern und aktualisieren sie diese Seite.

Es ist hilfreich, wenn sie bereits vom Unicode Standard gehört haben und vielleicht einige Erfahrung mit WideStrings unter Delphi haben. Previous use of non-(western)Latin scripts and their various character sets wird auch helfen.

Unicode essentials

Der Unicode Standard bildet Integers von 0 bis 10FFFF(h) auf Zeichen ab. Jede solche Abbildung wird code point genannt. Mit anderen Worten, Unicode Zeichen sind im Prinzip definiert für code points von U+000000 bis U+10FFFF (0 bis 1 114 111).

Es gibt drei Schemata für die Darstellung von Unicode code points als eindeutige Bytesequenzen. Diese Schematas werden Unicode transformation formats genannt: UTF-8, UTF-16 und UTF-32. Die Umsetzungen zwischen ihnen sind möglich. Hier sind ihre Haupteigenschaften:

                            UTF-8 UTF-16 UTF-32
Kleinster code point [hex] 000000 000000 000000
Größter code point  [hex]  10FFFF 10FFFF 10FFFF
Code unit Größe [Bits]       8     16     32
Minimal bytes/Zeichen        1      2      4
Maximal bytes/Zeichen        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 32-bit unit in UTF-32.

Für weitere Informationen siehe: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8

Lazarus Component Library Architektur essentials

(This part based on a mail by Marc Weustink) The LCL consists of two parts:

  1. A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
  2. "Interfaces" - a part that implements the interface to APIs of each target platform.

The communication between the two parts is done by an abstract class TWidgetset. Each widgetset is implemented by its own derived class form TWidgetset.

The GTK widgetset is the oldest. In this widgetset the string encoding is determined by the LANG environment var. If it is a UTF8 variant, all strings from and to native controls/widgets are UTF8 encoded. However utf8 may affect keyboard handling for gtk1. On gtk2 this problem is solved, but not implemented yet, the keyboard routines still rely on gtk1 code there.

Die win32 Schnittstelle ist mit ansii widgets eingerichtet. Daher ist es gegenwärtig nicht möglich, Unicode mit win32 zu verwenden.

Für weitere Informationen siehe: Internals of the LCL

Unicode-enabling the win32 interface

Voraussetzungen

The spirit of Lazarus is: "Write once, compile everywhere." This means that, ideally, an Unicode enabled application should have only one Unicode supporting source code version, without any conditional defines in respect to various target platforms.

The "interface" part of the LCL should support Unicode for the target platforms which support it themselves, concealing at the same time all peculiarities from the application programmer.

Windows platforms <=Win9x are based on ISO code page standards and do not support Unicode. Windows platforms starting with WinNT support Unicode. In doing that, these platforms offer two parallel sets of API functions: the old ANSII enabled *A and the new, Unicode enabled *W. *W functions accept wide strings, i.e. UTF-16 encoded strings, as parameters.

What concerns Lazarus, the internal string communication at the boundaries "Application code <--> LCL", as well as "LCL <--> Widgetsets" is based on the classical (byte oriented) strings. Logically, their contents should be encoded according to the UTF-8.

It is sound to assume that the existing WinXX application base internally does not use UTF-8 encoded strings, but the ISO code page based ones. Any Unicode enabling changes to LCL and widget sets for win32 must not break the existing application base. At the same time they should support applications which are internally based on the Unicode UTF-8 encoded strings, both on older Win9x platforms, as well as on Unicode based >=WinNT platforms.


A solution approach

Making progress