|
|
(179 intermediate revisions by 18 users not shown) |
Line 1: |
Line 1: |
| {{LCL Unicode Support}}
| | #REDIRECT [[Unicode Support in Lazarus]] |
| | |
| == Introduction ==
| |
| | |
| Lazarus support of the Unicode standard needs further
| |
| development, mostly in regard to the Windows platform. Here
| |
| are some basic information for those who would like to
| |
| further develop the Lazarus Unicode support.
| |
| Please correct, extend and update this page.
| |
| | |
| It will help if you have already heard for the Unicode
| |
| standard and if you perhaps had some experience with
| |
| WideStrings under Delphi. Previous use of
| |
| non-(western)Latin scripts and their various character sets
| |
| will help too.
| |
| | |
| == Unicode essentials ==
| |
| | |
| Unicode standard maps integers from 0 to 10FFFF(h) to
| |
| characters. Each such mapping is called a code point. In
| |
| other words, Unicode characters are in principle defined for
| |
| code points from U+000000 to U+10FFFF (0 to 1 114 111).
| |
| | |
| There are three schemes for representing Unicode code points
| |
| as unique byte sequences. These schemes are called Unicode
| |
| transformation formats: UTF-8, UTF-16 and UTF-32. The
| |
| conversions between all of them are possible.
| |
| Here are their basic properties:
| |
| UTF-8 UTF-16 UTF-32
| |
| Smallest code point [hex] 000000 000000 000000
| |
| Largest code point [hex] 10FFFF 10FFFF 10FFFF
| |
| Code unit size [bits] 8 16 32
| |
| Minimal bytes/character 1 2 4
| |
| Maximal bytes/character 4 4 4
| |
| | |
| '''UTF-8''' has several important and useful properties: It is
| |
| interpreted as a sequence of bytes, so that the concept of
| |
| lo- and hi-order byte does not exist. Unicode
| |
| characters U+0000 to U+007F (ASCII) are encoded simply as
| |
| bytes 00h to 7Fh (ASCII compatibility). This means that
| |
| files and strings which contain only 7-bit ASCII characters
| |
| have the same encoding under both ASCII and UTF-8. All
| |
| characters >U+007F are encoded as a sequence of several
| |
| bytes, each of which has the two most significant bits set. No
| |
| byte sequence of one character is contained within a longer
| |
| byte sequence of another character. This allows easy search for substrings. The first byte of a
| |
| multibyte sequence that represents a non-ASCII character is
| |
| always in the range C0h to FDh and it indicates how many
| |
| bytes follow for this character. All further bytes in a
| |
| multibyte sequence are in the range 80h to BFh. This allows
| |
| easy resynchronization and robustness.
| |
| | |
| '''UTF-16''' has the following most important properties: It uses a
| |
| single 16-bit word to encode characters from U+0000
| |
| to U+d7ff, and a pair of 16-bit words to encode any of the
| |
| remaining Unicode characters.
| |
| | |
| Finally, any Unicode character can be represented as a
| |
| single 32-bit unit in '''UTF-32'''.
| |
| | |
| For more, see:
| |
| [http://www.unicode.org/faq/basic_q.html Unicode FAQ - Basic questions],
| |
| [http://www.unicode.org/faq/utf_bom.html Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM],
| |
| [http://en.wikipedia.org/wiki/UTF-8 Wikipedia: UTF-8]
| |
| | |
| == Lazarus component library architecture essentials ==
| |
| | |
| (This part based on a mail by Marc Weustink)
| |
| The LCL consists of two parts:
| |
| # A target platform independent part, which implements a class hierarchy analogous to Delphi VCL;
| |
| # "Interfaces" - a part that implements the interface to APIs of each target platform.
| |
| | |
| The communication between the two parts is done by an
| |
| abstract class TWidgetset. Each widgetset is implemented by
| |
| its own derived class form TWidgetset.
| |
| | |
| The GTK widgetset is the oldest. In this widgetset the
| |
| string encoding is determined by the LANG environment var.
| |
| If it is a UTF8 variant, all strings from and to native
| |
| controls/widgets are UTF8 encoded. However utf8 may affect
| |
| keyboard handling for gtk1. On gtk2 this problem is solved,
| |
| but not implemented yet, the keyboard routines still rely on
| |
| gtk1 code there.
| |
| | |
| The win32 interfaces is setup with ansii widgets, so it is
| |
| currently not possible to use Unicode with win32.
| |
| | |
| For more, see: [[LCL Internals#Internals of the LCL|Internals of the LCL]]
| |
| | |
| == Unicode-enabling the win32 interface ==
| |
| | |
| === Requirements ===
| |
| | |
| The spirit of Lazarus is: "Write once, compile everywhere."
| |
| This means that, ideally, an Unicode enabled application
| |
| should have only one Unicode supporting source code version,
| |
| without any conditional defines in respect to various target
| |
| platforms.
| |
| | |
| The "interface" part of the LCL should support Unicode for
| |
| the target platforms which support it themselves, concealing
| |
| at the same time all peculiarities from the application
| |
| programmer.
| |
| | |
| Windows platforms <=Win9x are based on ISO code page
| |
| standards and do not support Unicode. Windows platforms
| |
| starting with WinNT support Unicode. In doing that, these
| |
| platforms offer two parallel sets of API functions: the old
| |
| ANSII enabled *A and the new, Unicode enabled *W. *W
| |
| functions accept wide strings, i.e. UTF-16 encoded strings,
| |
| as parameters.
| |
| | |
| What concerns Lazarus, the internal string communication at
| |
| the boundaries "Application code <--> LCL", as well as "LCL
| |
| <--> Widgetsets" is based on the classical (byte oriented)
| |
| strings. Logically, their contents should be encoded according
| |
| to the UTF-8.
| |
| | |
| It is sound to assume that the existing WinXX application
| |
| base internally does not use UTF-8 encoded strings, but
| |
| the ISO code page based ones. Any Unicode enabling changes
| |
| to LCL and widget sets for win32 must not break the existing
| |
| application base. At the same time they should support
| |
| applications which are internally based on the Unicode UTF-8 encoded
| |
| strings, both on older Win9x platforms, as well as on
| |
| Unicode based >=WinNT platforms.
| |
| | |
| | |
| === A solution approach ===
| |
| | |
| === Making progress ===
| |