Difference between revisions of "UTF-8"
(3 intermediate revisions by 2 users not shown) | |||
Line 24: | Line 24: | ||
| [[ASCII]] | | [[ASCII]] | ||
|- | |- | ||
− | + | | U+0080..U+07FF | |
− | + | | C2..DF | |
− | + | | 80..BF | |
− | | | + | | |
− | + | | | |
− | + | | 110 | |
| - [[UTF-8 Latin characters]] | | - [[UTF-8 Latin characters]] | ||
− | |||
− | |||
|- | |- | ||
| U+0800..U+0FFF | | U+0800..U+0FFF | ||
Line 48: | Line 46: | ||
| | | | ||
| 1110 | | 1110 | ||
− | | | + | | - [[UTF-8_subscripts_and_superscripts]] |
|- | |- | ||
| U+10000..U+3FFFF | | U+10000..U+3FFFF | ||
Line 95: | Line 93: | ||
* [[LCL_Unicode_Support|LCL Unicode Support]] - UTF8 in graphical applications | * [[LCL_Unicode_Support|LCL Unicode Support]] - UTF8 in graphical applications | ||
* [[Console_Mode_Pascal#Unicode (UTF8) output|Console mode Pascal: Unicode (UTF8) output]] - Showing UTF8 output in console mode/text mode programs | * [[Console_Mode_Pascal#Unicode (UTF8) output|Console mode Pascal: Unicode (UTF8) output]] - Showing UTF8 output in console mode/text mode programs | ||
− | + | * [[UTF8 strings and characters]] | |
[[Category:Unicode]] | [[Category:Unicode]] |
Latest revision as of 10:52, 26 June 2017
│
English (en) │
suomi (fi) │
français (fr) │
русский (ru) │
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Unicode characters U+0000 to U+007F are encoded simply as bytes 00h to 7Fh. This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All characters > U+007F are encoded as a sequence of several bytes, each of which has the two most significant bits set. No byte sequence of one character is contained within a longer byte sequence of another character. This allows easy search for substrings. The first byte of a multibyte sequence that represents a non-ASCII character is always in the range C0h to FDh and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 80h to BFh. This allows easy resynchronization and robustness.
Code points | 1st byte | 2nd byte | 3rd byte | 4th byte | most significant bits of the first byte of a multi-byte sequence | |
---|---|---|---|---|---|---|
U+0000..U+007F | 00..7F | 0 | ASCII | |||
U+0080..U+07FF | C2..DF | 80..BF | 110 | - UTF-8 Latin characters | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | 1110 | ||
U+1000..U+FFFF | E1..EF | 80..BF | 80..BF | 1110 | - UTF-8_subscripts_and_superscripts | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF | 11110 | |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF | 11110 | |
U+100000..U+10FFFF | F4 | 80..BF | 80..BF | 80..BF | 11110 |
UTF8 functions
FreePascal
The system unit contains some basic functions:
- UnicodeToUtf8
- Utf8ToUnicode
- UTF8Encode
- UTF8Decode
- AnsiToUtf8
- Utf8ToAnsi
Lazarus
Lazarus also contains UTF8 functions. For more details see LCL Unicode Support
See also
- Dealing with directory and filenames - UTF8 functions for files
- LCL Unicode Support - UTF8 in graphical applications
- Console mode Pascal: Unicode (UTF8) output - Showing UTF8 output in console mode/text mode programs
- UTF8 strings and characters