<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc []>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
ipr="trust200902"
docName="draft-bray-unichars-04"
category="std" consensus="true"
submissionType="IETF" tocInclude="true"
sortRefs="true"
symRefs="true"
version="3">

<front>
<title abbrev="Specifying Unicode">Specifying Unicode Character Repertoires in RFCs</title>

<author initials="T." surname="Bray" fullname="Tim Bray">
<organization>Textuality Services</organization>
<address>
<email>tbray@textuality.com</email>
</address>
</author>

<author initials="P." surname="Hoffman" fullname="Paul Hoffman">
<organization>ICANN</organization>
<address>
<email>paul.hoffman@icann.org</email>
</address>
</author>

<date/>
<keyword>Internet-Draft</keyword>

<abstract>
<t>This document describes four subsets of Unicode characters and their use in protocols and data formats.</t>
</abstract>
</front>

<middle>

<section anchor="intro" title="Introduction">

<t>When a protocol or data format has text fields, that text is normally composed of Unicode <xref target="UNICODE"/> characters, to support use by speakers of many languages.
Because of the way the Unicode Standard defines the term "Unicode character", the "set of all Unicode characters" is not always useful for technical specifications.
Instead, subsets such as those defined in this document are typically used.</t>

<t>Protocols and data formats usually need to describe exactly which selection of the available Unicode characters are to be used.
The term "character repertoire" is a well-understood concept when applied to an encoding standard; in this document it describes selected subsets of the Unicode characters.
Authors should have a way to concisely and exactly reference a stable specification that identifies a protocol or data format's character repertoire</t>

<t>This document describes and names several subsets that have been popular choices in specification character repertoires, and suggests one new subset.  The goal is to provide a convenient target for cross-reference from other specifications which discuss character repertoires.</t>

<section anchor="notation" title="Notation">

<t>In this document, the numeric values assigned to Unicode characters are provided in hexadecimal. In the text, Unicode’s standard "U+", zero-padded to four places <xref target="RFC5137"/>, is used. For example, "A", decimal 65, would be expressed as U+0041, and "😉" (Winking Face), decimal 128,521, would be U+1F609.</t>

<t>Groups of numeric values described in <xref target="unicode-definitions"/> and <xref target="other-subsets"/> are given in ABNF <xref target="RFC5234"/>.
In ABNF, the hexadecimal values for characters are preceded by "%x" rather than "U+".</t>

<t>All the numeric ranges in this document are inclusive.</t>

</section>

</section>

<section anchor="char-concepts" title="Character Concepts">

<t>The Unicode Standard's definition of "Unicode character" is conceptual.
However, each Unicode character is assigned an integer identifier in the range U+0000-U+10FFFF. These numbers are used to represent the characters in computer memory and storage systems and, in specifications, to specify the allowed repertoires of Unicode characters.</t>

<t>The numbers assigned to Unicode characters are called “code points”; there are potentially 1,114,112 of them.
As of 2023, fewer than 150,000 characters have had code points assigned.
It is difficult to specify that unassigned code points should be avoided, because they regularly become assigned as new characters are added to Unicode.</t>

<section anchor="transformation" title="Transformation Formats">

<t>Unicode describes a variety of "transformation formats", ways to encode code points in bytes of computer memory.
A survey of transformation formats is beyond the scope of this document.
However, it is useful to note that the "UTF-16" format represents each code point with one or two 16-bit chunks, and the “UTF-8” format uses variable-length byte sequences.</t>

<t>The "IETF Policy on Character Sets and Languages", BCP 18 <xref target="RFC2277"/>, says "Protocols MUST be able to use the UTF-8 charset", which becomes a mandate to use UTF-8 for any protocol or data format that specifies a single transformation format.
UTF-8 is widely used for interoperable data formats such as JSON, YAML, and XML.</t>

</section>

<section anchor="problematic" title="Problematic Code Point Types">

<t>Definition D10a in section 3.4 of <xref target="UNICODE"/> defines seven code point types. Three types of code points are assigned to constructs which are not actually characters or whose value as Unicode characters in text fields is questionable: "Control", "Surrogate", and "Noncharacter".</t>

<section anchor="surrogates" title="Surrogates">

<t>A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively the 2,048 code points are referred to as "surrogates".
Surrogates may only be used in Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate pair represents a code point greater than U+FFFF.</t>

<t>A surrogate which occurs in text encoded in any transformation format other than UTF-16 has no meaning and may cause malfunction in software that encounters it.
In particular, it is impossible to represent a surrogate in well-formed UTF-8.</t>

</section>

<section anchor="controls" title="Control Codes">

<t>Section 23.1 of <xref target="UNICODE"/> introduces the "Control Codes", for compatibility with legacy pre-Unicode standards. They comprise 65 code points in the ranges U+0000-U+001F ("C0 Controls") and U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".</t>

<section anchor="useful-controls" title="Useful Controls">
<t>The C0 Controls include newline (U+000A), carriage return (U+000D), and tab (U+0009); this document refers to these three characters as the "useful controls".</t>
</section>

<section anchor="legacy-controls" title="Legacy Controls">

<t>Aside from the useful controls, the control codes are mostly obsolete and generally lack interoperable semantics. This document uses the phrase "legacy controls" to describe control codes that are not useful controls.</t>


<t>Since the code points for C0 Controls include the 32 smallest integers including zero, they are likely to occur in data as a result of programming errors.</t>

</section>

</section>

<section anchor="noncharacters" title="Noncharacters">

<t>Certain code points are classified as "noncharacters", and <xref target="UNICODE"/> asserts repeatedly that they are not designed or used for open interchange.</t>

<t>Code points are organized into 17 "planes", each containing 2<sup>16</sup> code points.
The last two code points in each plane are noncharacters: U+00FFFE, U+00FFFF, U+01FFFE, U+01FFFF, U+02FFFE, U+02FFFF, and so on, up to U+10FFFE, U+10FFFF.</t>

<t>The code points in the range U+FDD0-U+FDEF are noncharacters.</t>

</section>
</section>
</section>

<section anchor="unicode-definitions" title="Subsets Defined in the Unicode Standard">

<t>This section describes subsets of the code points that are defined in <xref target="UNICODE"/>.
Specifications can refer to these repertoires by the names "Unicode Code Points" and "Unicode Scalar Values".</t>

<section anchor="codepoints" title="Unicode Code Points">

<t>Definition D9 in section 3.4 of <xref target="UNICODE"/> defines the term "Unicode codespace" as "a range of integers from 0 to 10FFFF<sub>16</sub>".
Definition D10 defines the term "Code point" as "Any value in the Unicode codespace".</t>

<t>The "Unicode Code Points" subset can be expressed as an ABNF production:</t>

<sourcecode>
unicode-code-points =
   %x0-10FFFF
</sourcecode>

<t>This subset is notable for including all possible code points, including those of the problematic types discussed above. It is the default repertoire of JSON <xref target="RFC8259"/>.</t>

</section>

<section anchor="scalars" title="Unicode Scalar Values">

<t>Definition D76 in section 3.9 of <xref target="UNICODE"/> defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points."</t>

<t>The "Unicode Scalar Values" subset can be expressed as an ABNF production:</t>

<sourcecode>
unicode-scalar-values =
   %x0-D7FF / %xE000-10FFFF  ; exclude surrogates
</sourcecode>

<t>This subset is the default character repertoire for I-JSON <xref target="RFC7493"/> and CBOR <xref target="RFC8949"/>, and has the advantage of excluding surrogates. However, it includes legacy controls and noncharacters.</t>

</section>
</section>

<section anchor="other-subsets" title="Other Subsets">

<t>This section describes other ways to specify subsets of the code points beyond those provided by the Unicode Standard itself.
Specifications can refer to these repertoires by the names "XML Characters" and "Useful Assignables".</t>

<section anchor="xml" title="XML Characters">

<t>The XML 1.0 Specification <xref target="XML"/>, in its grammar production labeled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 Controls, and the noncharacters U+FFFE and U+FFFF.</t>

<t>The "XML Characters" subset can be expressed as an ABNF production:</t>

<!--
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
-->
<sourcecode>
xml-chars =
   %x9 / %xA / %xD /   ; useful controls
   %x20-D7FF /         ; exclude surrogates
   %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
   %x100000-10FFFF
</sourcecode>

<t>While this subset does not exclude all the problematic code points, the C1 Controls are less likely than the C0 Controls to appear erroneously in data, and have not been observed to be a frequent source of problems. Also, the noncharacters greater in value than U+FFFF are rarely encountered.</t>

</section>

<section anchor="useful-assignables" title="Useful Assignables">

<t>For convenience, this document defines the "Useful Assignables" subset as the Unicode code points, excluding the legacy controls, surrogates, and noncharacters. This comprises all code points that are currently assigned, or might in future be assigned, to characters that are not legacy control codes, plus the useful controls.</t>

<t>Useful Assignables can be expressed as an ABNF production:</t>

<sourcecode>
useful-assignables =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 Controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF                     ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD
</sourcecode>

<t>This subset excludes all code points whose types are identified as problematic above.</t>

</section>
</section>

<section anchor="dealing" title="Dealing With Problematic Code Points">

<t>Noncharacters and legacy controls are unlikely to cause software failures, but they cannot usefully be displayed to humans, and can be used in attacks based on misleading human readers of text that display them. <xref target="TR36"/></t>

<t>Surrogate characters have been observed to cause software failures. The behavior of software which encounters them is unpredictable and differs in programming-language implementations, even between different API calls in the same language.</t>

<t>Section 3.9 of <xref target="UNICODE"/> makes it clear that a UTF-8 byte sequence which would map to a surrogate is ill-formed. Thus, in theory, if a specification requires that input data be encoded with UTF-8, implementors should never have to concern themselves with surrogates.</t>

<t>Unfortunately, industry experience teaches that problematic code points, including surrogates, can and do occur in program input where the source of input data is not controlled by the implementor. For example, the following is a legal JSON document:</t>

<sourcecode>{"example": "\u0000\U0089\uDEAD\uD9BF\uDFFF"}</sourcecode>

<t>The value of the "example" field contains the C0 Control NUL, the C1 Control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code points.  It is unlikely to be useful as the value of a text field. It cannot be serialized into well-formed UTF-8, but the behavior of libraries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.</t>

<t>Implementors who follow the guidance of <xref target="RFC9413"/>, "Maintaining Robust Protocols", will need to deal with problematic code points. A variety of options are reasonable. RFC 9413 recommends, by default, discarding ill-formed data silently without returning an error message, unless this is required by the specification; and further, that error messages, if specified, should be explicit. <xref target="UNICODE"/> section 3.2 recommends dealing with ill-formed byte sequences by by signaling an error, or replacing problematic code points with � (U+FFFD, REPLACEMENT CHARACTER).</t>

<t>The discussion of error-handling options in <xref target="RFC9413"/> is thorough and very helpful in choosing a strategy for dealing with problematic code points.</t>

</section>

<section anchor="restricting" title="Restricting Character Repertoires">

<t>Many IETF specifications rely on well-known data formats such as JSON, I-JSON, CBOR, YAML, and XML. These formats have default character repertoires. For example, JSON allows object member names and string values to include any Unicode code points, including all the problematic types.</t>

<t>It is unlikely that anyone specifying a new data format would choose to allow the Unicode Code Points character repertoire.</t>

<t>A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to Useful Assignables (see <xref target="useful-assignables"/>).
An equivalent restriction is possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.</t>

<t>Note that escaping techniques such as those in the JSON example above cannot be used to circumvent this sort of character-repertoire restriction, which applies to data content, not textual representation in packaging formats. If a specification restricted a JSON field value to the Useful Assignables, the example would remain a legal JSON Text but the data it represents would not constitute Useful Assignable code points.</t>

</section>

<section anchor="iana-considerations" title="IANA Considerations">

<t>This document makes no requests of IANA.</t>

</section>

<section anchor="security-considerations" title="Security Considerations">

<t>Unicode Security Considerations <xref target="TR36"/> is a wide-ranging survey of the issues implementors should consider while writing software to process Unicode text.
Many of the exploits it discusses are aimed at deceiving human readers, but vulnerabilities involving issues such as surrogates and noncharacters are also covered, and in fact can contribute to human-deceiving exploits.</t>

<t>Note that the Unicode-character subsets specified in this document include a successively-decreasing number of surrogates and noncharacters, and thus should be less and less susceptible to vulnerabilities. The <xref target="useful-assignables"/> subset, "Useful Assignables", excludes all of them.</t>

</section>

</middle>

<back>
<references title="Normative References">

<!--
<?rfc include="reference.RFC.2119.xml" ?>
<?rfc include="reference.RFC.8174.xml" ?>
-->

<reference anchor="UNICODE" target="http://www.unicode.org/versions/latest/">
<front>
<title abbrev="Unicode">The Unicode Standard</title>
<author><organization>The Unicode Consortium</organization><address /></author>
</front>
<annotation>Note that this reference is to the latest version of
Unicode, rather than to a specific release. It is not expected that
future changes in the Unicode Standard will affect the referenced
definitions.</annotation>
</reference>

<reference anchor="TR36" target="https://www.unicode.org/reports/tr36/">
<front>
<title abbrev="Unicode Security Considerations">Unicode Security Considerations</title>
<author><organization>The Unicode Consortium</organization><address /></author>
</front>
<annotation>Note that this reference is to the latest version of
this document, rather than to a specific release. It is not expected that
future updates will affect the referenced discussions.</annotation>
</reference>

<?rfc include="reference.RFC.5234.xml" ?>

</references>

<references title="Informative References">

<reference anchor="XML" target="http://www.w3.org/TR/2008/REC-xml-20081126/">
<front>
<title abbrev="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth Edition)</title>
<author fullname="Tim Bray" surname="Bray"><organization>Textuality and Netscape</organization></author>
<author fullname="Jean Paoli" surname="Paoli"><organization>Microsoft</organization></author>
<author fullname="C.M. Sperberg-McQueen" initials="C.M." surname="McQueen"><organization>W3C</organization></author>
<author fullname="Eve Maler" surname="Maler"><organization>Sun Microsystems, Inc.</organization></author>
<author fullname="François Yergeau" surname="Yergeau"></author>
<date year='2008' month='November' day='26'/>
</front>
<annotation>Note that this reference is to a specific release, based on a history of previous "Edition" releases having changed this production.</annotation>
</reference>

<?rfc include="reference.RFC.2277.xml" ?>
<?rfc include="reference.RFC.5137.xml" ?>
<?rfc include="reference.RFC.8259.xml" ?>
<?rfc include="reference.RFC.7493.xml" ?>
<?rfc include="reference.RFC.8949.xml" ?>
<?rfc include="reference.RFC.9413.xml" ?>

</references>

<section numbered="false" anchor="acknowledgements" title="Acknowledgements">

<t>Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata Report against RFC 8259, The JavaScript Object Notation, noting frequent references to "Unicode characters", when in fact the RFC formally specifies the use of Unicode Code Points.</t>
<t>Thanks also to Asmus Freytag for careful review and many constructive suggestions aimed at making the language more consistent with the structure of the Unicode Standard.</t>
<t>Thanks also to James Manger for the correctness of the ABNF and JSON samples.</t>

</section>

</back>
</rfc>
