Title:

UTF-8, a transformation format of ISO 10647

Description:  This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements.
Author:The Internet Society
Publication List
deutsch
  
ISBN: 3423050012   ISBN: 3423050012   ISBN: 3423050012   ISBN: 3423050012 
 
|<< First     < Previous     Index     Next >     Last >>|
  Wir empfehlen:       
 
Network Working Group
Request for Comments: 3629
STD: 63
Obsoletes: 2279
Category: Standards Track
F. Yergeau
Alis Technologies
November 2003

UTF-8, a transformation format of ISO 10646


Status of this Memo

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

Copyright Notice

Copyright (C) The Internet Society (2003). All Rights Reserved.

Abstract

ISO/IEC 10646-1 defines a large character set called the Universal Character Set (UCS) which encompasses most of the world's writing systems. The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8, the object of this memo. UTF-8 has the characteristic of preserving the full US-ASCII range, providing compatibility with file systems, parsers and other software that rely on US-ASCII values but are transparent to other values. This memo obsoletes and replaces RFC 2279.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Notational conventions . . . . . . . . . . . . . . . . . . . . 3
3. UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . . 4
4. Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . . 5
5. Versions of the standards . . . . . . . . . . . . . . . . . . 6
6. Byte order mark (BOM) . . . . . . . . . . . . . . . . . . . . 6
7. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8. MIME registration . . . . . . . . . . . . . . . . . . . . . . 9
9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10
10. Security Considerations . . . . . . . . . . . . . . . . . . . 10
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11
12. Changes from RFC 2279 . . . . . . . . . . . . . . . . . . . . 11
13. Normative References . . . . . . . . . . . . . . . . . . . . . 12
14. Informative References . . . . . . . . . . . . . . . . . . . . 12
15. URI's . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
16. Intellectual Property Statement . . . . . . . . . . . . . . . 13
17. Author's Address . . . . . . . . . . . . . . . . . . . . . . . 13
18. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 14

1. Introduction

ISO/IEC 10646 [ISO.10646] defines a large character set called the Universal Character Set (UCS), which encompasses most of the world's writing systems. The same set of characters is defined by the Unicode standard [UNICODE], which further defines additional character properties and other application details of great interest to implementers. Up to the present time, changes in Unicode and amendments and additions to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism.

ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an encoding form, each character is represented as one or more encoding units. All standard UCS encoding forms except UTF-8 have an encoding unit larger than one octet, making them hard to use in many current applications and protocols that assume 8 or even 7 bit characters.

UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character, and nothing else.

UTF-8 encodes UCS characters as a varying number of octets, where the number of octets, and the value of each, depend on the integer value assigned to the character in ISO/IEC 10646 (the character number, a.k.a. code position, code point or Unicode scalar value). This encoding form has the following characteristics (all values are in hexadecimal):

o Character numbers from U+0000 to U+007F (US-ASCII repertoire) correspond to octets 00 to 7F (7 bit US-ASCII values). A direct consequence is that a plain ASCII string is also a valid UTF-8 string.

o US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.

o Round-trip conversion is easy between UTF-8 and other encoding forms.

o The first octet of a multi-octet sequence indicates the number of octets in the sequence.

o The octet values C0, C1, F5 to FF never appear.

o Character boundaries are easily found from anywhere in an octet stream.

o The byte-value lexicographic sorting order of UTF-8 strings is the same as if ordered by character numbers. Of course this is of limited interest since a sort order based on character numbers is almost never culturally valid.

o The Boyer-Moore fast search algorithm can be used with UTF-8 data.

o UTF-8 strings can be fairly reliably recognized as such by a simple algorithm, i.e., the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length.

UTF-8 was devised in September 1992 by Ken Thompson, guided by design criteria specified by Rob Pike, with the objective of defining a UCS transformation format usable in the Plan9 operating system in a non- disruptive manner. Thompson's design was stewarded through standardization by the X/Open Joint Internationalization Group XOJIG (see [FSS_UTF]), bearing the names FSS-UTF (variant FSS/UTF), UTF-2 and finally UTF-8 along the way.

  
Bürgerliches Gesetzbuch BGB: mit Allgemeinem Gleichbehandlungsgesetz, BeurkundungsG, BGB-Informationspflichten-Verordnung, Einführungsgesetz, ... Rechtsstand: 1. August 2012
Siehe auch:
Handelsgesetzbuch HGB: ohne Seehandelsrecht, mit …
Strafgesetzbuch StGB: mit Einführungsgesetz, …
Grundgesetz GG: Menschenrechtskonvention, …
Arbeitsgesetze
Basistexte Öffentliches Recht: Rechtsstand: 1. …
Aktiengesetz · GmbH-Gesetz: mit …
 
   
 
     
|<< First     < Previous     Index     Next >     Last >>| 

This web site is a part of the project ScientificPublication.com.

Back to the topic site:
ScientificPublication.com/Startseite/Informatik/Spezifikationen

External Links to this site are permitted without prior consent.

Publication List:
UTF-8 String Representation of Distinguished Names
UTF-8 String Representation of Distinguished Names
UTF-8, a transformation format of ISO 10646
   
  deutsch  |  Set bookmark  |  Send a friend a link  |  Impressum