<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc []>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>

<rfc xmlns:xi="http://www.w3.org/2001/XInclude"
ipr="trust200902"
docName="draft-bray-unichars-00"
category="std" consensus="true"
submissionType="IETF" tocInclude="true"
sortRefs="true"
symRefs="true"
version="3">

<front>
<title abbrev="Specifying Unicode">Specifying Unicode Character Repertoires in RFCs</title>
<seriesInfo name="Internet-Draft" value="draft-bray-unichars-00"/>

<author initials="T." surname="Bray" fullname="Tim Bray">
<organization>Textuality Services</organization>
<address>
<email>tbray@textuality.com</email>
</address>
</author>

<author initials="P." surname="Hoffman" fullname="Paul Hoffman">
<organization>ICANN</organization>
<address>
<email>paul.hoffman@icann.org</email>
</address>
</author>

<date/>
<keyword>Internet-Draft</keyword>

<abstract>
<t>This document describes how to specify the use of Unicode characters in a helpful and unambiguous way.</t>
</abstract>
</front>

<middle>

<section anchor="intro" title="Introduction">

<t>When a protocol or data format has text fields, that text is normally composed of Unicode <xref target="UNICODE"/> characters, to support use by speakers of all the world's languages.
Unfortunately, the Unicode Standard does not define term "Unicode character" in a way that is useful for technical specifications.</t>

<t>Protocols and data formats <bcp14>SHOULD</bcp14> describe exactly which selection of the available Unicode characters are to be used.
This document uses the term "character repertoire" to describe such a subset of the Unicode characters.
Authors should have a way to concisely and exactly reference a stable specification that identifies a protocol or data format's character repertoire</t>

<t>There are several subsets that have been popular choices in code and specification character repertoires.
This document describes and names them, and suggests one new one.
The goal of this document is to provide a convenient target for cross-reference from other specifications which desire to use one of these character repertoires.</t>

<section anchor="terminology" title="Terminology">

<t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>", "<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they appear in all capitals, as shown here.</t>
</section>

<section anchor="notation" title="Notation">

<t>In this document, the numeric values assigned to Unicode characters are provided in hexadecimal. In the text, Unicode’s standard "U+" notation <xref target="RFC5137"/> is used. For example, "A", decimal 65, would be expressed as U+0041, and "😉" (Winking Face), decimal 128,521, would be U+1F609.</t>

<t>Certain groups of numeric values described in <xref target="unicode-definitions"/> and <xref target="other-definitions"/> are given in ABNF <xref target="RFC5234"/>.
In ABNF, the hexadecimal values for characters are preceded by "%x" rather than "U+".</t>

<t>All the numeric ranges in this document are inclusive.</t>

</section>

</section>

<section anchor="char-concepts" title="Character Concepts">

<t>The Unicode Standard's definition of "Unicode character" is conceptual.
However, each Unicode character is assigned an integer identifier in the range U+0000 through U+10FFFF, and these numbers are used to specify the allowed repertoires of Unicode characters in code and specifications.</t>

<t>The numbers assigned to Unicode characters are called “code points”; there are potentially 1,114,112 of them.
As of 2023, less than 150,000 characters have had code points assigned.
While the inclusion of unassigned code points in text data is undesirable, it is difficult to specify that it should be avoided, because unassigned code points regularly become assigned as new characters are added to Unicode.
Fortunately, the occurrence of unassigned code points in texts is generally unlikely to cause software to malfunction.</t>

<section anchor="transformation" title="Transformation Formats">

<t>Unicode describes a variety of "transformation formats", ways to encode code points in bytes of computer memory.
A survey of transformation formats is beyond the scope of this document.
However, it is useful to note that the "UTF-16" transformation format represents each code point with one or two 16-bit chunks, and the “UTF-8” transformation format uses variable-length byte sequences.</t>

<t>The UTF-8 transformation format is very widely used for interoperable data formats such as JSON, YAML, and XML.</t>

</section>

<section anchor="problematic" title="Problematic Code Points">

<t>Some code points are assigned to constructs which are not actually characters or whose value as Unicode characters is questionable.</t>

<section anchor="surrogates" title="Surrogates">

<t>A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively the 2,048 code points are referred to as "surrogates".
Surrogates can be used in high-surrogate/low-surrogate pairs to represent code points greater than 65,535 in the UTF-16 transformation format.</t>

<t>A surrogate which occurs as a singleton, or which is in an improperly-composed pair, or which occurs in UTF-8-encoded text, has no meaning and may cause malfunctions in software which encounters it.</t>

</section>

<section anchor="controls" title="Control Codes">

<t>Section 23.1 in chapter 23 of <xref target="UNICODE"/>, "Special Areas and Format Characters", introduces the concept of "Control Codes". They comprise 65 code points in the ranges U+0000-U+001F ("C0 Controls") and U+0080-U+009F (“C1 Controls”), plus U+007F, "DEL".</t>

<t>The C0 controls include the newline (U+000A), carriage return (U+000D), and Tab (U+0009); this document refers to these three characters as the "useful controls".
Aside from these, the control codes are mostly obsolete and generally lack interoperable semantics.
This document uses the phrase "useless controls" to describe control codes that are not useful controls.</t>

<t>Since the C0 controls include zero and the 32 smallest integers, they are likely to occur in data as a result of programming errors.</t>

</section>

<section anchor="noncharacters" title="Noncharacters">

<t>Certain code points are permanently reserved by <xref target="UNICODE"/> for internal use and are referred to as "noncharacters".</t>

<t>Code points are organized into 17 "planes", each containing 2<sup>16</sup> code points.
The last two code points in each plane are noncharacters: U+00FFFE, U+00FFFF, U+01FFFE, U+01FFF, U+02FFFE, U+02FFFF, and so on, up to U+10FFFE, U+10FFFF.</t>

<t>The code points in the range U+FDD0 to U+FDEF are noncharacters.</t>

</section>
</section>
</section>

<section anchor="unicode-definitions" title="Subsets Defined in the Unicode Standard">

<t>This section describes popular subsets of the code points that are defined in <xref target="UNICODE"/>.
Specifications can refer to these repertoires by the names "Unicode Code Points" and "Unicode Scalar Values".</t>

<section anchor="codepoints" title="Unicode Code Points">

<t>Definition D9 in chapter 3 of <xref target="UNICODE"/>, "Conformance", defines the term "Unicode codespace" as "a range of integers from 0 to 10FFFF<sub>16</sub>".
Definition D10 defines the term "Code point" as "Any value in the Unicode codespace".</t>

<t>The "Unicode Code Points" subset can be expressed as an ABNF production:</t>

<sourcecode>
unicode-code-points =
   %x0-10FFFF
</sourcecode>

<t>This subset has the advantage of including all possible code points. It has been adopted by JSON <xref target="RFC8259"/>.</t>

<t>However, this subset includes all of the problematic code points listed above, and implementors must be prepared to deal with meaningless code points such as those assigned to surrogates, useless controls, and noncharacters.</t>

</section>

<section anchor="scalars" title="Unicode Scalar Values">

<t>Definition D76 in chapter 3 of <xref target="UNICODE"/> defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points."</t>

<t>The "Unicode Scalar Values" subset can be expressed as an ABNF production:</t>

<sourcecode>
unicode-scalar-values =
   %x0-D7FF / %xE000-10FFFF  ; exclude surrogates
</sourcecode>

<t>This subset has the advantage of excluding surrogates, which can never add any value and have the potential to cause problems.
This subset has been adopted by I-JSON <xref target="RFC7493"/>.</t>

<t>However, this subset still includes the useless controls and the noncharacters.</t>

</section>
</section>

<section anchor="other-definitions" title="Other Definitions">

<t>This section lists other ways to specify subsets of the code points beyond those provided by the Unicode Standard itself.
These subsets may serve as more appropriate character repertoires for some protocols and data formats than those in <xref target="unicode-definitions"/>, depending on their needs.
Specifications can refer to these repertoires by the names "XML Characters" and "Basic Unicode Characters".</t>

<section anchor="xml" title="XML Characters">

<t>The XML 1.0 Specification <xref target="XML"/>, in its grammar production labeled "Char", specifies a range of Unicode codepoints that excludes surrogates, useless C0 control codes, and the noncharacters U+FFFE and U+FFFF.</t>

<t>THe "XML Characters" subset can be expressed as an ABNF production:</t>

<!--
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
-->
<sourcecode>
xml-chars =
   %x9 / %xA / %xD /   ; useful controls
   %x20-D7FF /         ; exclude surrogates
   %xE000-FFFD/        ; exclude FFFE and FFFF nonchars
   %x100000-10FFFF
</sourcecode>

<t>While this subset does not exclude all the problematic code points, the C1 controls are less likely than the C0 controls to appear erroneously in data, and have not been observed to be a frequent source of problems. Also, the noncharacters greater in value than U+FFFF are rarely encountered.</t>

<t>This subset may be especially appropriate for data formats which may be represented in either JSON or XML.</t>

</section>

<section anchor="basic-characters" title="Basic Unicode Characters">

<t>For convenience, this document defines the "Basic Unicode Characters" subset as the Unicode code points, excluding the useless controls, surrogates, and noncharacters.</t>

<t>Basic Unicode characters can be expressed as an ABNF production:</t>

<sourcecode>
basic-unichars =
   %x9 / %xA / %xD /             ; useful controls
   %x20-7E /                     ; exclude C1 controls and DEL
   %xA0-D7FF /                   ; exclude surrogates
   %xE000-FDCF                   ; exclude FDD0 nonchars
   %xFDF0-FFFD /                 ; exclude FFFE and FFFF nonchars
   %x1000-1FFFD / %x2000-2FFFD / ; (repeat per plane)
   %x3000-3FFFD / %x4000-4FFFD /
   %x5000-5FFFD / %x6000-6FFFD /
   %x7000-7FFFD / %x8000-8FFFD /
   %x9000-9FFFD / %xA000-AFFFD /
   %xB000-BFFFD / %xC000-CFFFD /
   %xD000-DFFFD / %xE000-EFFFD /
   %xF000-FFFFD / %x10000-10FFFD
</sourcecode>

</section>
</section>

<section anchor="iana-considerations" title="IANA Considerations">

<t>This document makes no requests of IANA.</t>

</section>

<section anchor="security-considerations" title="Security Considerations">

<t>Unicode Security Considerations <xref target="TR36"/> is a wide-ranging survey of the issues implementors should consider while writing software to process Unicode text.
Many of the exploits it discusses are aimed at deceiving human readers, but vulnerabilities involving issues such as surrogates and noncharacters are also covered, and in fact can contribute to human-deceiving exploits.</t>

<t>Note that the Unicode-character subsets specified in this document include a successively-decreasing number of surrogates and noncharacters, and thus should be less and less susceptible to vulnerabilities. The <xref target="basic-characters"/> subset, "Basic Unicode Characters", excludes all of them.</t>

</section>

</middle>

<back>
<references title="Normative References">

<?rfc include="reference.RFC.2119.xml" ?>
<?rfc include="reference.RFC.8174.xml" ?>

<reference anchor="XML" target="http://www.w3.org/TR/2008/REC-xml-20081126/">
<front>
<title abbrev="XML 1.0">Extensible Markup Language (XML) 1.0 (Fifth Edition)</title>
<author fullname="Tim Bray" surname="Bray"><organization>Textuality and Netscape</organization></author>
<author fullname="Jean Paoli" surname="Paoli"><organization>Microsoft</organization></author>
<author fullname="C.M. Sperberg-McQueen" initials="C.M." surname="McQueen"><organization>W3C</organization></author>
<author fullname="Eve Maler" surname="Maler"><organization>Sun Microsystems, Inc.</organization></author>
<author fullname="François Yergeau" surname="Yergeau"></author>
<date year='2008' month='November' day='26'/>
</front>
<annotation>Note that this reference is to a specific release, based on a history of previous "Edition" releases having changed this production.</annotation>
</reference>

<reference anchor="UNICODE" target="http://www.unicode.org/versions/latest/">
<front>
<title abbrev="Unicode">The Unicode Standard</title>
<author><organization>The Unicode Consortium</organization><address /></author>
</front>
<annotation>Note that this reference is to the latest version of
Unicode, rather than to a specific release. It is not expected that
future changes in the Unicode Standard will affect the referenced
definitions.</annotation>
</reference>

<reference anchor="TR36" target="https://www.unicode.org/reports/tr36/">
<front>
<title abbrev="Unicode Security Considerations">Unicode Security Considerations</title>
<author><organization>The Unicode Consortium</organization><address /></author>
</front>
<annotation>Note that this reference is to the latest version of
this document, rather than to a specific release. It is not expected that
future updates will affect the referenced discussions.</annotation>
</reference>

</references>

<references title="Informative References">

<?rfc include="reference.RFC.5137.xml" ?>
<?rfc include="reference.RFC.5234.xml" ?>
<?rfc include="reference.RFC.8259.xml" ?>
<?rfc include="reference.RFC.7493.xml" ?>

</references>

<section numbered="false" anchor="acknowledgements" title="Acknowledgements">

<t>Thanks are due to Guillaume Fortin-Debigaré, who filed an Errata Report against RFC8259, The JavaScript Object Notation, noting frequent references to "Unicode characters", when in fact the RFC formally specifies the use of Unicode code points.</t>

</section>

</back>
</rfc>
