Difference between revisions of "FPC Unicode support"

From Free Pascal wiki
Jump to navigationJump to search
m
m (→‎Current support via merged cpstrrtl branch: fixed missing characters after latest edit)
Line 44: Line 44:
 
* unix:
 
* unix:
 
** made the ansistring parameters of the fp*() file system routine overloads constant, changed them to rawbytestring and added DefaultFileSystemCodePage conversions
 
** made the ansistring parameters of the fp*() file system routine overloads constant, changed them to rawbytestring and added DefaultFileSystemCodePage conversions
** unicodestring support for POpen(), and DefaultFileSystemCodePage support for Open(RawByteString)
+
** unicodestring support for POpen(), and DefaultFileSystemCodePage support for POpen(RawByteString)
  
 
* DefaultFileSystemCodePage support for dynlibs unit
 
* DefaultFileSystemCodePage support for dynlibs unit
Line 51: Line 51:
 
** system: fexpand, lowercase, uppercase, getdir, mkdir, chdir, rmdir, assign, erase, rename
 
** system: fexpand, lowercase, uppercase, getdir, mkdir, chdir, rmdir, assign, erase, rename
 
** objpas: AssignFile
 
** objpas: AssignFile
** sysutils: FileCreate, FileOpen, FileExists, DirectoryExists, FileSetDate, FileGetAttr, FileSetAttr, DeleteFile, RenameFile, FileSearch, ExeSearch, FindFirst, FindNext, FindClose, FileIsReadOnly, GetCurrentDir, SetCurrentDir, ChangeFileExt, ExtractFilePath, ExtractFileDrive, xtractFileName, ExtractFileExt, ExtractFileDir, ExtractShortPathName, ExpandFileName, xpandFileNameCase, ExpandUNCFileName, ExtractRelativepath, IncludeTrailingPathDelimiter, ncludeTrailingBackslash, ExcludeTrailingBackslash, ExcludeTrailingPathDelimiter, ncludeLeadingPathDelimiter, ExcludeLeadingPathDelimiter, IsPathDelimiter, DoDirSeparators, SetDirSeparators, GetDirs, ConcatPaths, GetEnvironmentVariable
+
** sysutils: FileCreate, FileOpen, FileExists, DirectoryExists, FileSetDate, FileGetAttr, FileSetAttr, DeleteFile, RenameFile, FileSearch, ExeSearch, FindFirst, FindNext, FindClose, FileIsReadOnly, GetCurrentDir, SetCurrentDir, ChangeFileExt, ExtractFilePath, ExtractFileDrive, ExtractFileName, ExtractFileExt, ExtractFileDir, ExtractShortPathName, ExpandFileName, ExpandFileNameCase, ExpandUNCFileName, ExtractRelativepath, IncludeTrailingPathDelimiter, IncludeTrailingBackslash, ExcludeTrailingBackslash, ExcludeTrailingPathDelimiter, IncludeLeadingPathDelimiter, ExcludeLeadingPathDelimiter, IsPathDelimiter, DoDirSeparators, SetDirSeparators, GetDirs, ConcatPaths, GetEnvironmentVariable
  
 
*** the default string type used by FindFirst/Next depends on whether the RTL was compiled with FPC_RTL_UNICODE. To force the RawByteString version pass a TRawByteSearchRec, for the UnicodeString version pass a TUnicodeSearchRec.
 
*** the default string type used by FindFirst/Next depends on whether the RTL was compiled with FPC_RTL_UNICODE. To force the RawByteString version pass a TRawByteSearchRec, for the UnicodeString version pass a TUnicodeSearchRec.

Revision as of 12:41, 24 September 2013

Introduction

Free Pascal compiler and RTL/FCL should natively support Unicode. Since several releases, Delphi supports Unicode. FPC must be compatible with Delphi in Unicode support.


FPC 2.7.x Unicode plans

Runtime Libraries

There will be a unicode RTL and an ANSI/legacy compatiblity RTL. See [1]

Current support via merged cpstrrtl branch

There is some support for Unicode in the RTL in current FPC trunk.

From http://www.mail-archive.com/fpc-devel@lists.freepascal.org/msg29827.html

  • merged cpstrrtl branch (includes unicode branch). In general, this adds support for arbitrarily encoded ansistrings to many routines related to file system access (and some others).

WARNING: while the parameters of many routines have been changed from "ansistring" to "rawbytestring" to avoid data loss due to conversions, this is not a panacea. If you pass a string concatenation to such a parameter and not all strings in this concatenation have the same code page, all strings and the result will be converted to DefaultSystemCodePage (= ansi code page by default). In particular, concatenating e.g. an Utf8String with a constant string and passing the result to a RawByteString parameter will convert the result into the DefaultSystemCodePage (unless the source code is compiler with {$modeswitch systemcodepage} or {$mode delphiunicode} *and* the ansi code page on the system you are compiling *on* happens to be UTF-8)

You can define and use alternative routines that explicitly accept Utf8String parameters to avoid this pitfall. Internally, all of these routines ensure that they never trigger this condition and ensure that no unnecessary/unwanted code page conversions occur.

  • DefaultFileSystemCodePage variable that holds the code page used for communicating with the OS single byte file system APIs, and for the strings returned by those same APIs. Initialized with
    • the result of GetACP in the system unit of Windows platforms, except for WinCE which uses UTF-8 since its file system OS API calls already use the UTF-16 versions
    • CP_UTF8 on Unix platforms with FPCRTL_FILESYSTEM_UTF8 defined, and with DefaultSystemCodePage on other Unix platforms
    • DefaultSystemCodePage on Java/Android JVM targets
  • DefaultRTLFileSystemCodePage variable that holds the code page used to encode strings returned by RTL routines that return filenames obtained from OS API calls. By default the same as DefaultFileSystemCodePage on all platforms. Separate from DefaultFileSystemCodePage for clarity on platforms that may use either utf-16 or single byte OS API calls to send/receive file names (such as most Windows platforms)
  • new scpFileSystemSingleByte enum that can be passed to GetStandardCodePage() to get the default code page for OS single byte file system APIs, with implementations for Unix and Windows
  • SetMultiByteFileSystemCodePage() procedure to override the value of DefaultFileSystemCodePage
  • ToSingleByteFileSystemEncodedFileName() function to convert a string to DefaultFileSystemCodePage (does *not* take care of OS-specific quirks like Darwin always returning file names in decomposed UTF-8)
  • support for CP_OEMCP
  • textrec/filerec now store the filename by default using widechar. It is possible to switch back to ansichars using the FPC_ANSI_TEXTFILEREC define. In that case, from now on the filename will always be stored in DefaultFileSystemEncoding
  • fixed potential buffer overflows and non-null-terminated file names in textrec/filerec
  • when concatenating ansistrings, do not map CP_NONE (rawbytestring) to CP_ACP (defaultsystemcodepage), because if all input strings have the same code page then the result should also have that code page if it's assigned to a rawbytestring rather than getting defaultsystemcodepage
  • do not consider empty strings to determine the code page of the result in fpc_AnsiStr_Concat_multi(), because that will cause a different result than when using a sequence of fpc_AnsiStr_Concat() calls (it ignores empty strings to determine the result code page) and it's also slower
  • do not consider the run time code page of the destination string in fpc_AnsiStr_Concat(_multi)() because Delphi does not do so either. This was introduced in r19118, probably to hide another bug + test
    • never change the code page of a non-empty string when calling setlength on it
  • handle the fact that GetEnvironmentStringsA returns the environment in the OEM instead of in the Ansi code page (mantis #22524, #15233)
  • don't truncate environment variable strings in GetEnvironmentString(), its result is now ansistring/unicodestring depending on whether the RTL was compiled with FPC_RTL_UNICODE
  • unix:
    • made the ansistring parameters of the fp*() file system routine overloads constant, changed them to rawbytestring and added DefaultFileSystemCodePage conversions
    • unicodestring support for POpen(), and DefaultFileSystemCodePage support for POpen(RawByteString)
  • DefaultFileSystemCodePage support for dynlibs unit
  • rawbytestring/unicodestring overloads for:
    • system: fexpand, lowercase, uppercase, getdir, mkdir, chdir, rmdir, assign, erase, rename
    • objpas: AssignFile
    • sysutils: FileCreate, FileOpen, FileExists, DirectoryExists, FileSetDate, FileGetAttr, FileSetAttr, DeleteFile, RenameFile, FileSearch, ExeSearch, FindFirst, FindNext, FindClose, FileIsReadOnly, GetCurrentDir, SetCurrentDir, ChangeFileExt, ExtractFilePath, ExtractFileDrive, ExtractFileName, ExtractFileExt, ExtractFileDir, ExtractShortPathName, ExpandFileName, ExpandFileNameCase, ExpandUNCFileName, ExtractRelativepath, IncludeTrailingPathDelimiter, IncludeTrailingBackslash, ExcludeTrailingBackslash, ExcludeTrailingPathDelimiter, IncludeLeadingPathDelimiter, ExcludeLeadingPathDelimiter, IsPathDelimiter, DoDirSeparators, SetDirSeparators, GetDirs, ConcatPaths, GetEnvironmentVariable
      • the default string type used by FindFirst/Next depends on whether the RTL was compiled with FPC_RTL_UNICODE. To force the RawByteString version pass a TRawByteSearchRec, for the UnicodeString version pass a TUnicodeSearchRec.
  • paramstr(longint):unicodestring available for {$modeswitch unicodestrings}
  • pwidechar versions in sysutils of strecopy, strend, strcat, strcomp,strlcomp, stricomp, strlcat, strrscan,strlower, strupper, strlicomp,strpos, WideStrAlloc, StrBufSize, StrDispose + tests

Other libraries

The string architecture for FCL etc libraries has not yet been decided. See [2].

Old/obsolete sections

Warning-icon.png

Warning: This section has not been updated for a long time. Since FPC 2.7 (current development version), extensive Unicode support has been implemented.

Please update this page with the latest status, e.g. from this post on the FPC dev list

These sections are kept for historical reference - please update the sections above with this information if it is still applicable.

Tiburon Unicode support

Currently we have some information about Tiburon's Unicode support implementation.

http://blogs.codegear.com/abauer/2008/01/09/38845

http://blogs.codegear.com/abauer/2008/07/16/38864

FPC Unicode support

FPC must have the following string types with transparent conversion between them (like current AnsiString <-> WideString conversion):

  • shortstring
  • ansistring
  • widestring
  • utf8string
  • utf16string
  • utf32string
  • ucs2string (?)
  • ucs4string (?)

Development and further maintenance of these string types must be as simple as possible. New string types must be easily added in future if needed.

Compiler uses generic structure and helper routines to handle all refcounted string types.

String header:

type
  TRefStringRec = packed record
    Encoding: word;    // encoding of string
    ElementSize: byte; // size in bytes of string's element (1-4)
    Ref: SizeInt;      // number of references
    Len: SizeInt;      // number of elements is string 
  end;

Helper routines will know how to handle string from its header.

Extra parameter with string type information is passed to some routines (like fpc_RefString_SetLength) to allow properly initialize new strings.

widestring type on Windows targets remains non-refcounted and OLE compatible. Minimal number of helper routines is used for it. On non-Windows targets widestring is alias to utf16string.

The compiler uses helpers for string type conversions like this:

procedure fpc_ansistring_to_utf16string(out dst: utf16string; const src: ansistring);
procedure fpc_utf32string_to_utf16string(out dst: utf16string; const src: utf32string);

The compiler generates helper procedure name from type names. The compiler does not perform any string conversion handling by itself.

Status of Unicode support in FPC so far

Currently FPC 2.3.x has a new type called UnicodeString. This is similar to a WideString type. The difference being that UnicodeString is reference counted on all platforms.

All implementation work is currently done in a separate svn branch: http://svn.freepascal.org/svn/fpc/branches/cpstrnew

User visible changes

Full support of code page aware strings is not possible without breaking some existing code. The following list tries to summarize the most important user visible changes.

  • The string header has two new fields: encoding and element size. On 32 Bit platforms this increases the header size by 4 and on 64 bit platforms by 8 bytes.
  • WideCharLenToString, UnicodeCharLenToString, WideCharToString, UnicodeCharToString and OleStrToString return an UnicodeString instead of an Ansistring before.
  • the type of the dest parameter of WideCharLenToString and UnicodeCharLenToString has been changed from Ansistring to Unicodestring
  • UTF8ToAnsi and AnsiToUTF8 take a RawByteString now

Roadmap of RTL Unicode support with UnicodeString

Topic Status Comments Assigned To
Locale Variables Not implemented Variables are all 1 byte in size and can't hold UnicodeChar size values. e.g.: The Russian thousand separator is a no-break space $00A0 which doesn't fit in the ThousandSeparator (standard Char type) variable.
TStrings Not implemented There is no UnicodeString version of TStrings
TStringList Not implemented There is no UnicodeString version of TStringList
Pos() Working

Roadmap of RTL Unicode support with UTF8String

Topic Status Comments Assigned To
UTF8String Not implemented Needs a real implementation. Is currently just an alias for ansistring.
TStrings Not implemented There is no UTF8String version of TStrings
TStringList Not implemented There is no UTF8String version of TStringList

See Also