Revision as of 10:44, 11 April 2015

│ Deutsch (de) │ English (en) │ español (es) │ français (fr) │ 日本語 (ja) │ 한국어 (ko) │ русский (ru) │ 中文（中国大陆）‎ (zh_CN) │ 中文（台灣）‎ (zh_TW) │

소개

0.9.25 에서 라자루스는 Gtk1을 제외한 모든 플랫폼에서 완전한 유니코드를 지원하게 되었다. 이 페이지는 FPC 2.6.4를 사용하는 라자루스 1.4 까지의 유니코드에 대해 다룬다. 2.0을 시작하면서 라자루스는 FPC 3.0+의 특징을 사용하여 유니코드 지원을 개선하였다. 자세한 내용은 다음을 보시길 : 라자루스에서 개선 된 유니코드 지원

사용자를 위한 지침

라자루스가 유니코드 위젯 세트를 가지고 있다지만 모든 것이 유니코드가 아니란 것을 알아야 한다. 문자열 인코딩의 시작이 무엇인지 알고 다른 인코딩을 필요로하는 라이브러리간에 적절히 인코딩하는 것은 개발자의 책임이다.

보통 인코딩은 라이브러리 단위로 이루어진다(예를 들면, 동적 라이브러리인 DLL이나 라자루스 패키지). 각 라이브러리는 한 종류의 인코딩만을 요구하고, 이것들은 대개 Unicode(라자루스에서는 UTF-8) 또는 ANSI 이다(이것은 실제로는 시스템 인코딩으로 UTF-8 이거나 아니라는 뜻이다). 2.6 이하의 FPC의 RTL 과 FCL 은 ANSI 문자열이 필요하다.

유니코드와 ANSI는 다음 코드를 사용하여 변환할 수 있다.

UTF8ToAnsi 와 AnsiToUTF8은 (FPC) System 유닛에서 기능한다.
또는 UTF8ToSys 와 SysToUTF8은 (Lazarus) FileUtil 유닛에서 기능한다.

두번째 것이 더 빠르고 산뜻하지만 더 많은 코드가 프로그램에 들어가게 된다.

FPC 는 유니코드를 알고 있지 않다

현재의 FPC버전(2.6.x 이하)에서 프리파스칼 런타임 라이브러리(RTL)과 프리파스칼 컴포넌트 라이브러리(FCL)은 ANSI라서 유니코드 라이브러리에서 가져오거나 유니코드 라이브러리로 보내는(예, LCL) 경우에는 문자열 변환이 필요하다.

String 을 포함하여 FPC 2.7.1 계열의 개발 버전에 중요한 개선점들이 있다. RawByteString 과 UTF8String 에 대해서는 FPC의 유니코드 지원 을 보세요.

ANSI 와 Unicode 간의 변환

  Note: AnsiToUTF8 와 UTF8ToAnsi 는 리눅스, BSD, Mac OS X 에서 사용할 경우 widestring 관리자가 필요하다. SysToUTF8 과 UTF8ToSys 함수를 사용할 수 있다.(unit FileUtil) 또는 프로그램이 사용하는 첫번째 uses 섹션에 cwstring을 추가하여 widestring 매니저를 추가할 수 있게된다.

예:

TEdit에서 문자열을 얻고 어떤 RTL 파일 루틴에 문자열을 보내고 싶다고 하자.

var
  MyString: string; // utf-8 로 인코드 되어 있다.
begin
  MyString := MyTEdit.Text;
  SomeRTLRoutine(UTF8ToAnsi(MyString));
end;

반대로 보낼 경우에는:

var
  MyString: string; // ANSI 로 인코드 되어 있다.
begin
  MyString := SomeRTLRoutine;
  MyTEdit.Text := AnsiToUTF8(MyString);
end;

Widestrings 과 Ansistrings

widestring 은 2-바이트 크기의 요소를 가지는 기본데이터의 문자열 타입이다. Widestring은 거의 대부분 UTF-16 로 인코딩된 데이터를 가진다. widestring 을 보세요

widestring의 배열로 억세스할 수 있는 각 데이터 포인트는 2 바이트를 가지며, UTF-16에서 문자(character)는 1 또는 2 데이터 포인트일 수 있어서, 2 또는 4 바이트를 차지할 수 있다. 이는 Widestring을 배열로 억세스하고 이러한 방법으로 UTF-16 문자들을 얻으려고 하는 것은 확실히 잘못된 것이어서 문자열에 4 바이트 문자가 들어있을 때는 실패하게 될 것이다. 또한 UTF-8 과 마찬가지로 UTF-16은 분리되는 문자를 가질 수도 있다. 예를 들어 문자 "Á" 는 단일 문자로 인코드할 수 있거나 두개의 문자로 인코드 될 수도 있다:"A" + 수정 액센트. 그러므로 Unicode에서 엑센트 문자를 가진 문장은 다양한 방법으로 인코드 될 수 있어서 라나루스나 FPC는 자동으로 이들을 다룰 수가 없다.

Ansistrings 를 Widestrings 로 전달할 때 인코딩을 변환해야 한다.

var 
  w: widestring;
begin
  w:='Über'; // 틀림, FPC 는 시스템 코드페이지를 UTF16 로 변경할 것이다.
  w:=UTF8ToUTF16('Über'); // 맞음
  Button1.Caption:=UTF16ToUTF8(w);
end;

UTF8 문자열 과 문자들

Lazarus 0.9.30 까지 UTF-8 핸들링 루틴은 LCLProc유닛 내의 LCL에 있었다. 라자루스 0.9.31+에서 LCLProc 내의 루틴들은 여전히 호환성 문제 때문에 쓸모가 있으나 UTF-8을 다루는 실제코드는 lazutf8 유닛내의 lazutils 패키지에 있다. UTG-8 문자열상에서 실행하기위해서는 프리파스칼에서 온 SysUtils 루틴 대신에 lazutf8 유닛에 있는 루틴을 사용하는게 좋은데 이는, SysUtils 은 유니코드를 다룰 준비가 되어있지 않으나, lazutf8은 사용할 준비가 되어있기 때문이다. 간단히 Sysutils의 루틴을 lazutf8에 있는 같은 루틴으로 바꾸면 되는데 UTF8 선행사가 붙은 것을 제외하고는 이름이 같기 때문이다.

UTF-8 String Copy, Length, LowerCase, 등등

UTF-8문자열로 실행하기 원하는 거의 모든 동작들은 lazutf유닛의 루틴으로 대치할 수 있다.(라자루스 0.9.30 이사에서는 LCLProc 유닛) 다음과 같이 lazurf8.pas 에서 나온 루틴의 리스트를 보세요

function UTF8CharacterLength(p: PChar): integer;  // 문자의 길이
function UTF8Length(const s: string): PtrInt;     // 문자열의 길이
function UTF8Length(p: PChar; ByteCount: PtrInt): PtrInt;  
function UTF8CharacterToUnicode(p: PChar; out CharLen: integer): Cardinal;
function UnicodeToUTF8(u: cardinal; Buf: PChar): integer; inline;
function UnicodeToUTF8SkipErrors(u: cardinal; Buf: PChar): integer;
function UnicodeToUTF8(u: cardinal): shortstring; inline;
function UTF8ToDoubleByteString(const s: string): string;
function UTF8ToDoubleByte(UTF8Str: PChar; Len: PtrInt; DBStr: PByte): PtrInt;
function UTF8FindNearestCharStart(UTF8Str: PChar; Len: integer;
                                  BytePos: integer): integer;
// n-번째 UTF8 문자를 찿으며,  BIDI 는 무시한다.
function UTF8CharStart(UTF8Str: PChar; Len, CharIndex: PtrInt): PChar;
// n-번째 UTF8 문자릐 바이트 인덱스를 찾되, BIDI (substr 의 바이트 길이)는 무시한다.
function UTF8CharToByteIndex(UTF8Str: PChar; Len, CharIndex: PtrInt): PtrInt;
procedure UTF8FixBroken(P: PChar);
function UTF8CharacterStrictLength(P: PChar): integer;
function UTF8CStringToUTF8String(SourceStart: PChar; SourceLen: PtrInt) : string;
function UTF8Pos(const SearchForText, SearchInText: string): PtrInt;
function UTF8Copy(const s: string; StartCharIndex, CharCount: PtrInt): string;
procedure UTF8Delete(var s: String; StartCharIndex, CharCount: PtrInt);
procedure UTF8Insert(const source: String; var s: string; StartCharIndex: PtrInt);

function UTF8LowerCase(const AInStr: string; ALanguage: string=''): string;
function UTF8UpperCase(const AInStr: string; ALanguage: string=''): string;
function FindInvalidUTF8Character(p: PChar; Count: PtrInt;
                                  StopOnNonASCII: Boolean = false): PtrInt;
function ValidUTF8String(const s: String): String;

procedure AssignUTF8ListToAnsi(UTF8List, AnsiList: TStrings);

//비교 함수

function UTF8CompareStr(const S1, S2: string): Integer;
function UTF8CompareText(const S1, S2: string): Integer;

디렉토리 및 파일이름 다루기

라자루스는 UTF-인코딩에서 파일명과 디렉토리명을 제외하고 조절 또는 동작하지만, RTL은 디렉토리 및 파일명에서는 ANSI 스트링을 사용한다.

예를 들면, TFileListBox 의 디렉토리 속성을 현재 디렉토리로 설정하는 버튼을 놓아보자. RTL 함수 GetCurrentDir 는 ANSI로 Unicode가 아니므로 변환이 필요하다.

procedure TForm1.Button1Click(Sender: TObject);
begin
  FileListBox1.Directory:=SysToUTF8(GetCurrentDir);
  // 또는 FileUtil 유닛의 함수를 사용하거나
  FileListBox1.Directory:=GetCurrentDirUTF8;
end;

FileUtil 유닛은 UTF-8 문자열에 관한 공통 파일들을 정의하고 있다.

// RTL 과 유사하지만 시스템 인코딩 대신 UTF-8 로 동작하는 기본 함수들

// AnsiToUTF8 과 UTF8ToAnsi 은 Linux, BSD, Mac OS X에서는 widestring 매니저가 필요하다.
// 하지만 이 OS들은 UTF-8 이 기본 시스템 인코딩이기 때문에 widestring 매니저가 필요하지는 않다.
function NeedRTLAnsi: boolean;// 시스템 인코딩이 UTF-8이 아니면 true
procedure SetNeedRTLAnsi(NewValue: boolean);
function UTF8ToSys(const s: string): string;// UTF8ToAnsi 와 같으나 widestring 매니저에 보다 더 독립적이다
function SysToUTF8(const s: string): string;// AnsiToUTF8 과 같으나 widestring 매니저에 보다 더 독립적이다

// 파일 동작
function FileExistsUTF8(const Filename: string): boolean;
function FileAgeUTF8(const FileName: string): Longint;
function DirectoryExistsUTF8(const Directory: string): Boolean;
function ExpandFileNameUTF8(const FileName: string): string;
function ExpandUNCFileNameUTF8(const FileName: string): string;
{$IFNDEF VER2_2_0}
function ExtractShortPathNameUTF8(Const FileName : String) : String;
{$ENDIF}
function FindFirstUTF8(const Path: string; Attr: Longint; out Rslt: TSearchRec): Longint;
function FindNextUTF8(var Rslt: TSearchRec): Longint;
procedure FindCloseUTF8(var F: TSearchrec);
function FileSetDateUTF8(const FileName: String; Age: Longint): Longint;
function FileGetAttrUTF8(const FileName: String): Longint;
function FileSetAttrUTF8(const Filename: String; Attr: longint): Longint;
function DeleteFileUTF8(const FileName: String): Boolean;
function RenameFileUTF8(const OldName, NewName: String): Boolean;
function FileSearchUTF8(const Name, DirList : String): String;
function FileIsReadOnlyUTF8(const FileName: String): Boolean;
function GetCurrentDirUTF8: String;
function SetCurrentDirUTF8(const NewDir: String): Boolean;
function CreateDirUTF8(const NewDir: String): Boolean;
function RemoveDirUTF8(const Dir: String): Boolean;
function ForceDirectoriesUTF8(const Dir: string): Boolean;

// 환경
function ParamStrUTF8(Param: Integer): string;
function GetEnvironmentStringUTF8(Index : Integer): String;
function GetEnvironmentVariableUTF8(const EnvVar: String): String;
function GetAppConfigDirUTF8(Global: Boolean): string;

윈도우에서 동아시아 언어 들

윈도우즈 XP 상의 사용자 인터페이스 컨트롤의 기본 폰트 (Tahoma)는 여러 스크립트/알파벳/언어들을 정확히 표시할 수 있다. 이 언어에는 아라비아, 러시아(Cyrillic 알파벳) 과 서양 언어(라틴/그리스 알파벳)이 해당되며, 한국어, 중국어, 일본어 같은 동아시아 언어는 해당되지 않는다.

간단하게 제어판으로가서 지역 설정을 선택하고, 언어 탭을 클릭하고 동아시아 언어펙을 설치하면, 표준 사용자 인터페이스 폰트는 이 언어들을 정확히 표현할 수 있다.이러한 언어에 지역특성화 된 윈도우즈 XP 버전은 이 언어팩을 이마 설치하고 있다. 더 많은 설명은 여기에

이후의 윈도우즈 버전은 이 언어에 대한 지원을 하고 있다.

UTF-8의 사용

자세한 것은 UTF-8 문자열과 문자를 보세요

Free Pascal Particularities

UTF8 와 소스 파일 - the missing BOM

라자루스로 소스파일을 만들거나 비-ASCII 문자들을 사용하면 파일은 UTF-8로 저장된다. 이것은 BOM (Byte Order Mark)을 사용하지 않는다. 소스 에디터 상에서 오른쪽 마우스 클릭하여 / File Settings / Encoding 을 선택하여 인코딩을 설정할 수 있다. UTF-8이 BOM을 갖지 않는 것과는 별도로, BOM이 없는 것은 FPC 가 Ansistring을 다루는 방법 때문이다. 호환성 때문에 LCL은 Ansistring 을 사용하고 이식성 때문에 LCL은 UTF8을 사용한다.

Note: 어떤 MS 윈도우즈 텍스트 에디터는 이 파일들을 시스템 코드페이지(OEM 코드페이지)로 인코드하고 부정확한 문자로 표시한다. BOM 을 추가하면 안된다 만약 BOM을 추가했다면 모든 문자열 할당을 변경해야만 한다.

예를 들면:

Button1.Caption := 'Über';

BOM 이 없다면 (그리고 코드페이지를 모른다면) 컴파일러는 문자열을 시스템 인코딩으로 다루고 각 바이트를 변환없이 문자열로 복사한다. 이것이 LCL이 문자열을 요구하는 방법이다.

// 소스파일은 BOM이 없이 UTF 로 저장
if FileExists('Über.txt') then ; // 잘못 됨, FileExists 는 시스템 인코딩을 필요로 함
if FileExistsUTF8('Über.txt') then ; // 맞음

Unicode essentials

The Unicode standard maps integers from 0 to 10FFFF(h) to characters. Each such mapping is called a code point. In other words, Unicode characters are in principle defined for code points from U+000000 to U+10FFFF (0 to 1 114 111).

There are three major schemes for representing Unicode code points as unique byte sequences. These schemes are called Unicode transformation formats: UTF-8, UTF-16 and UTF-32. Conversions between all of them are possible. Here are their basic properties:

                           UTF-8 UTF-16 UTF-32
Smallest code point [hex] 000000 000000 000000
Largest code point  [hex] 10FFFF 10FFFF 10FFFF
Code unit size [bits]          8     16     32
Minimal bytes/character        1      2      4
Maximal bytes/character        4      4      4

UTF-8 has several important and useful properties: It is interpreted as a sequence of bytes, so that the concept of lo- and hi-order byte does not exist. Unicode characters U+0000 to U+007F (ASCII) are encoded simply as bytes 00h to 7Fh (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All characters >U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy searching for substrings. The first byte of a multibyte sequence (representing a non-ASCII character) is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.

UTF-16 has the following most important properties: It uses a single 16-bit word to encode characters from U+0000 to U+d7ff, and a pair of 16-bit words to encode any of the remaining Unicode characters.

Finally, any Unicode character can be represented as a single 4 byte/32-bit unit in UTF-32.

For more, see: Unicode FAQ - Basic questions, Unicode FAQ - UTF-8, UTF-16, UTF-32 & BOM, Wikipedia: UTF-8 [1]

Implementation Details

Lazarus/LCL generally uses only UTF-8

Since the GTK1 interface was declared obsolete in Lazarus 0.9.31, all LCL interfaces are Unicode capable and the LCL uses and accepts only UTF-8 encoded strings, unless in routines explicitly marked as accepting other encodings.

Unicode-enabling the win32 interface

Overview

First, and most importantly, all Unicode patches for the Win32 interface must be enclosed by IFDEF WindowsUnicodeSupport, to avoid breaking the existing ANSI interface. After this stabilizes, all ifdefs will be removed and only the Unicode part will remain. At his moment all existing programs that use ANSI characters will need migration to Unicode.

No Unicode support on Win9x

Windows platforms <=Win9x are based on ISO code page standards and only partially support Unicode. Windows platforms starting with Windows NT (e.g. Windows 2000, XP, Vista, 7, 8) and Windows CE fully support Unicode.

Win 9x and NT offer two parallel sets of API functions: the old ANSI enabled *A and the new, Unicode enabled *W. *W functions accept wide strings - UTF-16 encoded strings - as parameters.

Windows 9x has all *W functions but they mostly have empty implementations, so they do nothing. Only some some *W functions are fully implemented in 9x; these are listed below in the section "Wide functions present on Windows 9x". This property is relevant as it allows to have one single application for both Win9x and WinNT and detect at runtime which set of APIs to use.

Windows CE only uses Wide API functions.

Wide functions present on Windows 9x

Some Wide API functions are present on Windows 9x. Here is a list of such functions: http://support.microsoft.com/kb/210341

Conversion example:

GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption),
Length(ButtonCaption), TextSize);

Becomes:

{$ifdef WindowsUnicodeSupport}
  GetTextExtentPoint32W(hdcNewBitmap, PWideChar(Utf8Decode(ButtonCaption)), Length(WideCaption), TextSize);
{$else}
  GetTextExtentPoint32(hdcNewBitmap, LPSTR(ButtonCaption), Length(ButtonCaption), TextSize);
{$endif}

Functions that need Ansi and Wide versions

First Conversion example:

function TGDIWindow.GetTitle: String;
var
 l: Integer;
begin
   l := Windows.GetWindowTextLength(Handle);
   SetLength(Result, l);
   Windows.GetWindowText(Handle, @Result[1], l);
end;

Becomes:

function TGDIWindow.GetTitle: String;
var
  l: Integer;
  AnsiBuffer: string;
  WideBuffer: WideString;
begin

{$ifdef WindowsUnicodeSupport}

if UnicodeEnabledOS then
begin
  l := Windows.GetWindowTextLengthW(Handle);
  SetLength(WideBuffer, l);
  l := Windows.GetWindowTextW(Handle, @WideBuffer[1], l);
  SetLength(WideBuffer, l);
  Result := Utf8Encode(WideBuffer);
end
else
begin
  l := Windows.GetWindowTextLength(Handle);
  SetLength(AnsiBuffer, l);
  l := Windows.GetWindowText(Handle, @AnsiBuffer[1], l);
  SetLength(AnsiBuffer, l);
  Result := AnsiToUtf8(AnsiBuffer);
end;

{$else}

   l := Windows.GetWindowTextLength(Handle);
   SetLength(Result, l);
   Windows.GetWindowText(Handle, @Result[1], l);

{$endif}

end;

Screenshots

FPC codepages

The compiler (FPC) supports specifying the code page in which the source code has been written via the command option -Fc (e.g. -Fcutf8) and the equivalent codepage directive (e.g. {$codepage utf8}). In this case, rather than literally copying the bytes that represent the string constants in your program, the compiler will interpret all character data according to that codepage. There are two things to watch out for though:

on Unix platforms, make sure you include a widestring manager by adding the cwstring unit to your uses-clause. Without it, the program will not be able to convert all character data correctly when running. It's not included by default because this unit makes your program dependent on libc, which makes cross-compilation harder.
The compiler converts all string constants that contain non-ASCII characters to widestring constants. These are automatically converted back to ansistring (either at compile time or at run time), but this can cause one caveat if you try to mix both characters and ordinal values in a single string constant:

For example:

program project1;
{$codepage utf8}
{$mode objfpc}{$H+}
{$ifdef unix}
uses cwstring;
{$endif}
var
  a,b,c: string;
begin
  a:='ä';
  b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä
  c:='ä='#$C3#$A4;
  writeln(a,b); // writes ä=ä
  writeln(c);   // writes ä=Ã¤
end.

When compiled and executed, this will write:

ä=ä
ä=Ã¤

The reason is once the ä is encountered, as mentioned above the rest of the constant string assigned to 'c' will be parsed as a widestring. As a result the #$C3 and #$A4 are interpreted as widechar(#$C3) and widechar(#$A4), rather than as ansichars.

Difference between revisions of "LCL Unicode Support/ko"