<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="lib/rfc2629.xslt"?>
<?rfc toc="yes" ?>
<?rfc symrefs="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc compact="yes"?>
<?rfc subcompact="no" ?>
<?rfc linkmailto="no" ?>
<?rfc editing="no" ?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc rfcedstyle="yes"?>
<?rfc-ext allow-markup-in-artwork="yes" ?>
<?rfc-ext include-index="no" ?>

<rfc ipr="trust200902"
     category="exp"
     submissionType="IETF"
     docName="draft-thierry-bulk-05">
  <front>
    <title abbrev="BULK1">Binary Uniform Language Kit 1.0</title>

    <author initials="P." surname="Thierry" fullname="Pierre Thierry">
      <organization>Comonad Dev</organization>
      <address>
        <email>pierre@comonad.dev</email>
      </address>
    </author>

    <date day="20" month="10" year="2025" />
    <keyword>binary</keyword>

    <abstract>
      <t>
        This specification describes a uniform, decentrally extensible and efficient format for
        data serialization.
      </t>
    </abstract>

  </front>

  <middle>
    <section anchor="intro" title="Introduction">
      <section title="Rationale">
        <t>
          This specification aims at finding an original trade-off between uniformity, generality,
          extensibility, decentralization, compactness and processing speed for a data format. It is
          our opinion that every widely used existing format occupy a different position than this
          one in the solution space for formats, that none is better on all axes, and that this one
          is the current best on several axes, hence this new design. It is also our opinion that
          some of those existing formats constitute an optimal solution for their specific use case,
          either in a absolute sense, or at least at the time of their design. But the ever-changing
          field of IT now faces new challenges that call for a new approach.
        </t>
	<t>
	  In particular, whereas the previous trend for Internet and Web standards and programming
	  tools has been to create human-readable syntaxes for data and protocols, the advent of
	  technologies like <xref target="protobuf">protocol buffers</xref>, <xref
	  target="Thrift">Thrift</xref>, the various binary serializations for JSON like <xref
	  target="Avro">Avro</xref> or <xref target="Smile">Smile</xref>, or the binary <xref
	  target="HTTP2">HTTP/2</xref> seem to indicate that the time is ripe for a generalized use
	  of binary, reserved until now for the low-level protocols. The lessons about flexibility
	  learnt in the previous switch from binary to plain text can now be applied to efficient
	  binary syntaxes.
	</t>
	<section title="Definitions">
	  <t>
	    By uniformity, we mean the property of a syntax that can be parsed even by an
	    application that doesn't understand the semantics of every part of the processed
	    data. Of course, almost all syntaxes that feature uniformity contain a limited number
	    of non uniform elements. Also, uniformity really only has value in the face of
	    extension, as a fixed syntax doesn't need uniformity (it only makes the implementation
	    simpler).
	  </t>
	  <t>
	    Almost all extensible syntaxes have their extensible part uniform to a great degree. In
	    this specification, uniformity is hence evaluated on two criteria: first, the number of
	    non uniform elements (and, incidentally, their diversity), second, the fact that the
	    uniformity of the extensible part is not a limitation to the users (i.e. that the
	    temptation to extend the format in a non-uniform way is as absent as possible).
	  </t>
	  <t>
	    A good counter-example is found in most programming languages. Adding a new branching
	    construct cannot be done in a terse way without modifying the underlying
	    implementation. Such a construct either cannot be defined by user code (because of
	    evaluation rules) or can in a terribly verbose and inconvenient way (with lots of
	    boilerplate code). Notable exceptions to this limitation of programming languages are
	    Lisp, Haskell and stack programming languages.
	  </t>
	  <t>
	    On the other hand, a stack programming language is the canonical example of a
	    non-uniform language. Each operator takes a number of operands from the stack. Not
	    knowing the arity of an operator makes it impossible to continue parsing, even when its
	    evaluation was optional to the final processing. In the design space, stack programming
	    languages completely sacrifice uniformity to achieve one of the highest combination of
	    extensibility, compactness and speed of processing.
	  </t>
	  <t>
	    By generality, we mean the ability of a syntax to lend itself to describe any kind of
	    data with a reasonable (or better yet, high) level of compactness and simplicity. For
	    example, although both arrays and linked lists could be considered very general as they
	    are both able to store any kind of data, they actually are at the respective cost of
	    complexity (arrays need the embedding of data structure in the data or in the
	    processing logic) and size (in-memory linked lists can waste as much as half or two
	    third of the space for the overhead of the data structure).
	  </t>
	  <t>
	    By decentralization, we mean the ability to extend the syntax in a way that avoid
	    naming collisions without the use of a central registry. Note that the DNS, as we use
	    it, is NOT decentralized in this sense, but distributed, as it cannot work without its
	    root servers and prior knowledge of their location.
	  </t>
	</section>
	<section title="State of the art">
	  <t>
	    Uniformity, generality and extensibility are usually highly-valued traits in formats
	    design. Programming languages obviously feature them foremost, although their
	    generality usually stops at what they are supposed to express: procedures. Most of them
	    are ill-suited to represent arbitrary data, but notable exceptions include Lisp (where
	    "code is data") and Javascript, from which a subset has been extracted to exchange
	    data, JSON, which has seen a tremendous success for this purpose. JSON may lack in
	    generality and compactness, but its design makes its parsing really straightforward and
	    fast. All of them, though, lack decentralization. Some of them make it possible to
	    extend them in a distributed way if some discipline is followed (for example, by naming
	    modules after domain names), but the discipline is not mandatory (and even with domain
	    names, a change of ownership makes it possible for name collisions).
	  </t>
	  <t>
	    The SGML/XML family of formats also feature uniformity, generality and extensibility
	    and actually fare much better than programming languages on the three fronts. XML
	    namespaces also make XML naming distributed and there have been attempts at making it
	    compact (e.g. EXI from W3C, Fast Infoset from ISO/ITU or EBML).
	  </t>
	  <t>
	    All the previously cited formats clearly lack compactness, although just applying
	    standard compression techniques would sacrifice only very little processing time to
	    gain huge size reductions on most of their intended use cases, but compression may not
	    address their ineffectiveness at storing arbitrary bytes.
	  </t>
	  <t>
	    So-called binary formats pretty much exhibit the opposite trade-offs. Most of them are
	    not uniform to achieve better compactness. Some are specifically designed for a great
	    generality, but many lack extensibility. When they are extensible, it's never in a
	    decentralized way, again for reasons that have to do with compactness. They are usually
	    extremely fast to parse.
	  </t>
	  <t>
	    Actually, many binary formats are not so much formats as they are formats frameworks,
	    and exclude extensibility by design. For each use case, an IDL compiler creates a brand
	    new format that is essentially incompatible with all other formats created by the same
	    compiler (EBML specifically cites this property among its own disadvantages). If the
	    IDL compiler and framework are correctly designed, such a format usually represent an
	    optimum in compactness and speed of processing, as the compiler can also automatically
	    generate an ad-hoc optimized parser.
	  </t>
	  <t>
	    Where extensibility has been planned in existing formats, it often doesn't get used
	    that much or at all because of the complications around it. Many binary formats include
	    reserved values meant to extend them to future uses, like the <spanx
	    style="verb">CM</spanx> field in the ZIP format. A case like this one faces an
	    chicken-and-egg problem: if you don't write and get a specification officially adopted,
	    implementations might not want to include your extension, but if your extension is
	    purely theoretical and hasn't been tested in the wild, you may face resistance to get
	    it officially adopted. This is probably why even though most compression formats
	    include the ability to later encode other compression methods, each new compression
	    method usually comes with its own format.
	  </t>
	  <t>
	    When extensions are managed with any form of registry, another issue is that you
	    usually need to reserve a large set of values for free experimentation, and once an
	    extension gains any traction while in experimentation, its authors face the difficulty
	    to switch all existing implementations to the definitive values they'll get. And how
	    experimenters choose their temporary values makes them vulnerable to conflicts with
	    others.
	  </t>
	</section>
      </section>
      <section title="Format overview">
	<t>
	  A BULK stream is a stream of 8-bit bytes, in big-endian order. Parsing a BULK stream
	  yields a sequence of expressions, which can be either atoms or forms, which are sequences
	  of expressions. The syntax of forms is entirely uniform, without a single exception: a
	  starting byte marker, a sequence of expressions and an ending byte marker. Among atoms,
	  only nil (the null byte) and arrays have a special syntax, for efficiency purposes. Even
	  booleans and floating-point numbers follow the uniform syntax that every other expression
	  follows.
	</t>
	<t>
	  Non uniform atoms start with a marker byte, followed by a static or dynamic number of
	  bytes, depending on the type.
	</t>
	<t>
	  Any other atom is a reference, which consists of a namespace marker (in almost all cases,
	  a single byte) followed by an identifier within this namespace (a single byte). All in
	  all, a very little sacrifice is made in compactness for the benefit of a very simple
	  syntax: apart from nil and small integers, nothing is smaller than 2 bytes, and as most
	  forms involve a reference followed by some content, a form is usually 4 bytes + its
	  content.
	</t>
	<t>
	  A namespace marker in a BULK stream is associated to a namespace identified by some
	  identifier guaranteed to be unique without coordination (like a UUID or cryptographical
	  hash), thus ensuring decentralized extensibility. The stream can be processed even if the
	  application doesn't recognize the namespace. Parsing remains possible thanks to the
	  uniform syntax.
	</t>
	<t>
	  Combination of BULK namespaces, BULK streams and even other formats doesn't need any
	  content transformation to work. Here are some examples:
	  <list style="symbols">
	    <t>
	      The content of a BULK stream, enclosed in list starting and ending byte markers,
	      constitute a valid BULK expression. Thus BULK streams can be packed or annotated
	      within a BULK stream without modification. Annotation use cases include adding
	      metadata or cryptographic signature.
	    </t>
	    <t>
	      A BULK format could specify in its syntax the place for an expression holding
	      metadata. Whether the specification provides its own metadata forms or not, an
	      application could use a BULK serialization for MARC, TEI Header, XML or RDF for this
	      metadata expression. The vocabulary selected would be univocally expressed by the
	      namespace and every vocabulary would be parsed by the same mechanisms.
	    </t>
	    <t>
	      Whenever a content must be stored as-is instead of serialized, or a highly-optimized
	      ad hoc serialization exists for some data, anything can always be stored within an
	      array. They can contain arbitray bytes and there is no limit to their size.
	    </t>
	  </list>
	</t>
	<t>
	  Furthermore, BULK expressions can be evaluated. Most expressions evaluate to themselves,
	  but some evaluate by default to the result of a pure function call, making it possible to
	  serialize data in an even more compact form, by eliminating boilerplate data and repeated
	  patterns.
	</t>
      </section>
      <section title="Conventions and Terminology">
        <t>
          The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD
          NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as
          described in <xref target="RFC2119">RFC 2119</xref>.
        </t>
        <t>
          Literal numerical values are provided in decimal or hexadecimal as appropriate.
          Hexadecimal literals are prefixed with <spanx style="verb">0x</spanx> to distinguish them
          from decimal literals.
        </t>
	<t>
	   The text notation of the BULK stream uses mnemonics for some bytes sequences. Mnemonics
	   are series of characters, excluding all capital letters and white space, like <spanx
	   style="verb">this-is-one-mnemonic</spanx> or <spanx
	   style="verb">what-the-%§!?#-is-that?</spanx>. They are always separated by white
	   space. Outside the use of mnemonics, a sequence of bytes (of one or more bytes) can be
	   represented by its hexadecimal value as an unsigned integer prefixed by <spanx
	   style="verb">0x</spanx> (e.g. <spanx style="verb">0x3F</spanx> or <spanx
	   style="verb">0x3A0B770F</spanx>). Such a sequence of bytes can include dashes to make it
	   more readable (e.g. <spanx
	   style="verb">0xDDA37D36-85E6-4E6D-9B51-959E1CCE366C</spanx>). Some types in this
	   specification define a special syntax for their representation in the text notation.
	</t>
	<t>
	  In the grammar, a shape is a pattern of bytes, following the rules of the text notation
	  for a BULK stream. Apart from mnemonics and fixed sequences of bytes, a shape can
	  contain:
	  <list style="symbols">
	    <t>an arbitrary sequence of a fixed number of bytes, represented by its size, i.e. a
	    number of bytes in decimal immediately followed by a B uppercase letter (e.g. <spanx
	    style="verb">4B</spanx>)</t>
	    <t>a typed sequence of bytes, represented by the name of its type, a capitalized word
	    (e.g.  <spanx style="verb">Foo</spanx>); this means a sequence of bytes whose specific
	    yield (cf. <xref target="parsing"/>) has this type</t>
	    <t>a named sequence of bytes (of zero or more bytes), represented by a series of any
	    character excluding '{}' between '{' and '}' (e.g. <spanx style="verb">{quux}</spanx>);
	    a named sequence can be typed or sized, in which case it is immediately followed by ':'
	    and a type or size (e.g. <spanx style="verb">{quux}:Bar</spanx> or <spanx
	    style="verb">{quux}:12B</spanx>)</t>
	  </list>
	</t>
	<t>
	  When an entire shape describes the byte sequence of an atom, it is the normative
	  specification for parsing it, but shapes of forms are only normative with respect to
	  their default evaluation. A reference defined with a form shape can be used in different
	  shapes, albeit with different semantics and value and even when used in its default
	  shape, a processing application MAY give it alternative semantics.
	</t>
	<t>
	  For example, this specification defines a way do specify a string encoding with forms of
	  the shape <spanx style="verb">( stringenc {enc}:Expr )</spanx>. But the shapes <spanx
	  style="verb">( stringenc {arg1}:Int {arg2}:Int )</spanx> or <spanx style="verb">(
	  {arg1}:Int stringenc {arg2}:Int )</spanx> are syntactly valid. They just have unspecified
	  semantics, as far as this specification is concerned.
	</t>
	<t>
	  Some identifiers are expected to be verifiable against a byte sequence. This means that
	  there must be an algorithm that, given the byte sequence as input, produces the
	  identifier as output and, given a different byte sequence, would produce a different
	  identifier. Because this verification has security implications, the algorithm used
	  should have the same guarantees than a cryptographic hash function in terms of
	  collisions.
	</t>
      </section>
    </section>

    <section title="BULK syntax">
      <t>
	A BULK stream is a sequence of 8-bit bytes. Bits and bytes are in big-endian order. The
	result of parsing a BULK stream is a list of abstract data, called the abstract yield. BULK
	parsing is injective: a BULK stream has only one abstract yield, but different BULK streams
	can have the same abstract yield (if they associate namespaces to different markers, see
	<xref target="namespaces">namespaces</xref>).
      </t>
      <t>
	A processing application is not expected to actually produce the abstract yield, but an
	adaptation of the abstract yield to its own implementation, called the concrete yield. Also,
	some expressions in a BULK stream may have the semantics of a transformation of the abstract
	yield. A processing application MAY thus not produce or retain the concrete yield but the
	result of its transformation. This specification deals mainly with the byte sequence and the
	abstract yield and occasionnally provide guidelines about the concrete yield. Of course, a
	processing application MAY not produce any concrete yield at all but produce various data
	structures and side effects from parsing the BULK stream (for example, an event sourced
	application may read its event log from a BULK stream and build its application state by
	applying the events, discarding each of them as soon as it has been applied).
      </t>
      <t>
	The abstract yield is a list of expressions. Expressions can be atoms or forms. Forms
	are lists of expressions. If a byte sequence is parsed as an expression, this byte
	sequence is said to encode this expression.
      </t>
      <t>
	When a sequence of bytes is named in a shape, its name can be used in this specification to
	designate either the byte sequence, or the expression or list of expressions it
	encodes. When there could be ambiguity, this specification specifies which is designated.
      </t>

      <section anchor="parsing" title="Parsing algorithm">
	<t>
	  The parser operates with a context, which is a list of expressions. Each time an
	  expression is parsed, it is appended at the end of the context. The initial context is the
	  abstract yield.
	</t>
	<t>
	  At the beginning of a BULK stream and after having consumed the byte sequence encoding a
	  complete expression, the parser is at the dispatch stage. At this stage, the next byte is
	  a marker byte, which tells the parser what kind of expression comes next (the marker byte
	  is the first byte of the sequence that encodes an expression). The expression appended to
	  the context after reading a byte sequence is called the specific yield of the byte
	  sequence.
	</t>
	<t>
	  The <spanx style="verb">0x01</spanx> and <spanx style="verb">0x02</spanx> marker bytes are
	  special cases. When the parser reads <spanx style="verb">0x01</spanx>, it immediately
	  appends an empty list to the current context. This list becomes the new context. This new
	  context has the previous context as parent. Then the parser returns to its dispatch
	  stage. When the parser reads <spanx style="verb">0x02</spanx>, it appends nothing to the
	  context, but instead the parent of the current context becomes the new context and the
	  parser returns to the dispatch stage. Thus it is a parsing error to read <spanx
	  style="verb">0x02</spanx> when the context is the abstract yield.
	</t>
	<t>
	  Some forms have side-effects in their semantics. Those side-effects MUST not affect the
	  parsing of any expression. They can affect evaluation, in which case they MUST only affect
	  the evaluation of expressions in the scope of the form. The outer scope of an expression
	  is the part of its context that follows the expression. Some forms MAY define an inner
	  scope in their shape. The scope of an expression is the union of the outer and inner
	  scopes. This makes BULK lexically scoped.
	</t>
	<t>
	  Whenever a parsing error is encountered, parsing of the BULK stream MUST stop.
	</t>
	<section title="Summary of marker bytes">
	  <table>
	    <thead><tr><th>marker</th><th>shape</th></tr></thead>
	    <tbody>
	      <tr><td><spanx style="verb">00</spanx></td><td><xref target="nil"><spanx style="verb">nil</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">01</spanx></td><td><xref target="start"><spanx style="verb">(</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">02</spanx></td><td><xref target="end"><spanx style="verb">)</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">03</spanx></td><td><xref target="array"><spanx style="verb"># Nat {content}</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">04–0F</spanx></td><td><xref target="reserved"><spanx style="verb">reserved</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">10–7F</spanx></td><td><xref target="ref"><spanx style="verb">references</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">80–BF</spanx></td><td><xref target="smallint"><spanx style="verb">w6[value]</spanx></xref></td></tr>
	      <tr><td><spanx style="verb">C0–FF</spanx></td><td><xref target="smallarray"><spanx style="verb">#[size] {content}</spanx></xref></td></tr>
	    </tbody>
	  </table>
	</section>
	<section anchor="eval" title="Evaluation">
	  <t>
	    A processing application MAY implement evaluation of BULK expressions and streams. When
	    evaluating a BULK stream, when the parser gets to the dispatch stage and the context is
	    the abstract yield, the last expression in the context is replaced by what it evaluates
	    to. (of course, this description is supposed to provide the semantics of BULK
	    evaluation, but a processing application MAY implement evaluation with a different
	    algorithm as long as it provides the same semantics)
	  </t>
	  <t>
	    The default evaluation rule is that an expression evaluates to itself. A name within a
	    namespace can have a value, which is what a reference associated to this name evaluates
	    to. A reference whose marker value is associated to no namespace or whose name has no
	    value evaluates to itself. How self-evaluating BULK expressions are represented in the
	    concrete yield is application-dependent, but future specifications MAY define a
	    standard API to access it, similar to the Document Object Model for XML.
	  </t>
	  <t>
	    The evaluation of a form obeys a special rule, though: if the first expression of the
	    form has type <spanx style="verb">Function</spanx>, that function is called with an
	    argument list and the form evaluates to the return value if it's an atom or the
	    evaluation of the return value if it is a form. If the function has type <spanx
	    style="verb">LazyFunction</spanx>, the argument list is the rest of the form. If the
	    function has type <spanx style="verb">EagerFunction</spanx>, the argument list is the
	    rest of the form, where each expression is replaced by what it evaluates to. Any
	    expression that has type <spanx style="verb">LazyFunction</spanx> or <spanx
	    style="verb">EagerFunction</spanx> also has type <spanx style="verb">Function</spanx>.
	  </t>
	  <t>
	    A form whose first expression doesn't have type <spanx style="verb">Function</spanx>
	    evaluates to itself.
	  </t>
	  <t>
	    When an application evaluates a BULK expression, it MUST verify that evaluation will
	    terminate in a finite number of evaluation steps. An application MAY verify finite
	    termination statically or dynamically. For example, an application MAY stop evaluation
	    in error after a predetermined number of steps.
	  </t>
	</section>
      </section>

      <section title="Forms">
	<section anchor="start" title="starting marker byte">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x01</spanx></t>
	      <t hangText="mnemonic"><spanx style="verb">(</spanx></t>
	    </list>
	  </t>
	</section>
	<section anchor="end" title="ending marker byte">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x02</spanx></t>
	      <t hangText="mnemonic"><spanx style="verb">)</spanx></t>
	    </list>
	  </t>
	</section>

	<section title="Difference between sequence and form">
	  <t>
	    There is a difference between a byte sequence encoding several expressions among the
	    current context and a byte sequence encoding a form (i.e. a single expression that is a
	    list of expressions). As an example, let's examine several forms of the shape <spanx
	    style="verb">( foo {bar} )</spanx>.
	  </t>
	  <t>
	    <list style="symbols">
	      <t>In the form <spanx style="verb">( foo nil nil nil )</spanx>, {bar} encodes 3
	      expressions, and they are three atoms in the yield.</t>

	      <t>In the form <spanx style="verb">( foo nil )</spanx>, {bar} is a single expression
	      in the yield, and that expression is an atom.</t>

	      <t>In the form <spanx style="verb">( foo ( nil nil nil ) )</spanx>, {bar} is also a
	      single expression in the yield, and that expression is a form, a list in the
	      yield.</t>
	    </list>
	  </t>
	  <t>
	    In a shape, when a byte sequence must yield a single expression, it has the type <spanx
	    style="verb">Expr</spanx>. So the last two examples fit the shape <spanx style="verb">(
	    foo {seq}:Expr )</spanx> but not the first. When a byte sequence must yield a form, it
	    has type <spanx style="verb">Form</spanx>. Thus the shape <spanx style="verb">( foo
	    {bar}:Form )</spanx> is equivalent to <spanx style="verb">( foo ( {bar} )
	    )</spanx>. Either one MAY be used.
	  </t>
	</section>
      </section>

      <section title="Atoms">
	<section anchor="nil" title="nil">
	  <t>
	    <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
	      style="verb">nil</spanx>)</t>
	      <t hangText="shape"><spanx style="verb">nil</spanx></t>
	    </list>
	  </t>
	  <t>
	    Apart from being a possible short marker value, the fact that the <spanx
	    style="verb">0x00</spanx> byte represents a valid atom means that a series of null bytes
	    is a valid part of a BULK stream, thus making the format less fragile. In a network
	    communication, nil atoms can be sent to keep the channel open. They can also be used as
	    padding at the end of a form or between forms.
	  </t>
	</section>

	<section title="Arrays">
	  <t>
	    Arrays can be used to store arbitrary bytes.
	  </t>
	  <t>
	    An array can be interpreted either as a bits sequence or as an unsigned integer in
	    binary notation. The choice depends on the context and the application. Actually, many
	    processing applications may not need make any choice, as most programming language
	    implementations actually also confuse unsigned integers and bits sequences to some
	    extent. Expressions that are unsigned integers (that is, natural numbers) have type
	    <spanx style="verb">Nat</spanx> (whether they are encoded as an array or not).
	  </t>
	  <t>
	    Big arrays typically store the content of a file or a binary message of another
	    format. They can also be used to store a vector or matrix of fixed-size elements.
	  </t>
	  <t>
	    In any case, the semantics of the content must be inferred by the processing
	    application; where ambiguity can appear, an application SHOULD enclose the array in a
	    form that makes the semantics explicit (e.g. <xref target="string"><spanx
	    style="verb">string</spanx></xref>, <xref target="string*"><spanx
	    style="verb">string*</spanx></xref>, <xref target="blob"><spanx
	    style="verb">blob</spanx></xref>, or <xref target="unsigned-int"><spanx
	    style="verb">unsigned-int</spanx></xref>).
	  </t>
	  <t>
	    Because BULK arrays have no end markers, the payload of a BULK array can constitute the
	    end of the stream.
	  </t>
	  <t>
	    The start and end of an array are known without reading its content, which means that
	    its content can be skipped in constant time and mapped in memory (or read lazily by any
	    other means).
	  </t>
	  <t>
	    Because BULK can use integers with arbitrary size to store the size of an array, BULK
	    arrays have no limit in size.
	  </t>
	  
	  <section anchor="array" title="Generic array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
		style="verb">#</spanx>)</t>
		<t hangText="shape"><spanx style="verb"># Nat {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      Arrays have a special parsing rule. After consuming the marker byte, the parser
	      returns to the dispatch stage. It is a parser error if the parsed expression is not
	      of type <spanx style="verb">Nat</spanx> or if its value cannot be recognized. This
	      integer is not added to any context, but the parser consumes as many bytes as this
	      integer and they constitute the content of this array.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for a generic array containing
	      the encoding of that string in the <xref target="stringenc">current encoding</xref>,
	      except if the size of the encoding is below 64 bytes, cf. <xref
	      target="smallarray">small arrays</xref>.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	    <t>
	      In a shape, the type <spanx style="verb">String</spanx> is synonymous with <spanx
	      style="verb">Bytes</spanx>, but means that the content of the array is supposed to be
	      taken as a string in the current encoding.
	    </t>
	  </section>

	  <section anchor="smallarray" title="Small array">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0xC0–0xFF</spanx> (mnemonic: <spanx
		style="verb">#[size]</spanx>)</t>
		<t hangText="shape"><spanx style="verb">#[size] {content}</spanx></t>
	      </list>
	    </t>
	    <t>
	      Small arrays have a special parsing rule. The 6 least significant bits of the marker
	      byte are treated as un unsigned integer. This integer is not added to any context, but
	      the parser consumes as many bytes as this integer and they constitute the content of
	      this array.
	    </t>
	    <t>
	      In the text notation, the notation of the marker byte of a small array of size X is
	      <spanx style="verb">#[X]</spanx>. For example, <spanx style="verb">#[2]
	      0x1234</spanx> is a notation for the bytes <spanx style="verb">0xC2-1234</spanx>.
	    </t>
	    <t>
	      In the text notation, a quoted string is the notation for a small array containing the
	      encoding of that string in the current encoding if the size of the encoding is below
	      64 bytes.
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="smallint" title="Small unsigned integers">
	    <t>
	      <list style="hanging">
		<t hangText="marker"><spanx style="verb">0x80–0xBF</spanx> (mnemonic: <spanx
		style="verb">w6[value]</spanx>)</t>
		<t hangText="shape"><spanx style="verb">w6[value]</spanx></t>
	      </list>
	    </t>
	    <t>
	      Small unsigned integers have a special parsing rule. The 6 least significant bits of
	      the marker byte are the value encoded by this byte (as bits or as an unsigned integer
	      in binary notation).
	    </t>
	    <t>
	      In the text notation, the notation of the marker byte of a small unsigned integer of
	      value X is <spanx style="verb">w6[X]</spanx>. For example, <spanx
	      style="verb">w6[11]</spanx> is a notation for the byte <spanx
	      style="verb">0x8B</spanx> (as is <spanx style="verb">11</spanx>, cf. <xref
	      target="arithmetic"/>).
	    </t>
	    <t>Types: <spanx style="verb">Bytes</spanx>, <spanx style="verb">Nat</spanx></t>
	  </section>

	  <section anchor="nat-repr" title="Reprensenting natural numbers">
	    <t>
	      When the syntax of a BULK form mandates that an expression can only be a <spanx
	      style="verb">Nat</spanx>, an application SHOULD encode it as the smallest possible
	      array using one of the following sizes: 6, 8, 16, 32, or any multiple of 64 bits.
	    </t>
	  </section>
	</section>

	<section anchor="reserved" title="Reserved marker bytes">
	  <t>
	    Marker bytes <spanx style="verb">0x04−0x0F</spanx> are reserved for future major
	    versions of BULK. It is a parser error if a BULK stream with major version 1 contains
	    such a marker byte.
	  </t>
	</section>

	<section anchor="ref" title="References">
	  <t><list style="hanging">
	    <t hangText="marker"><spanx style="verb">0x10−0x7F</spanx></t>
	    <t hangText="shape">
	      <spanx style="verb">{ns}:1B {name}:1B</spanx>
	      <vspace/>
  	      <spanx style="verb">0x7F {ns'} {name}:1B</spanx>
	    </t>
	  </list>
	  </t>
	  <t>
	    The <spanx style="verb">{ns}</spanx> byte is a value associated with a namespace, called
	    the namespace marker. Values <spanx style="verb">0x10−0x13</spanx> are reserved for
	    namespaces defined by BULK specifications. Greater values can be associated with
	    namespaces identified by a unique identifier.
	  </t>
	  <t>
	    The <spanx style="verb">{name}</spanx> byte is the name within the
	    namespace. Vocabularies with more than 256 names thus need to be spread accross several
	    namespaces.
	  </t>
	  <t>
	    The specification of a namespace SHOULD include a mnemonic for the namespace and for
	    each defined name. When descriptions use several namespaces, the mnemonic of a reference
	    SHOULD be the concatenation of the namespace mnemonic, ":" and the name mnemonic if
	    there can be an ambiguity. For example, the <spanx style="verb">fp</spanx> name in
	    namespace <spanx style="verb">math</spanx> becomes <spanx style="verb">math:fp</spanx>.
	  </t>
	  <t>Type: <spanx style="verb">Ref</spanx></t>
	  <section title="Special case">
	    <t>
	      References have a special parsing rule. In case a BULK stream needs an important
	      number of namespaces, if the marker byte is <spanx style="verb">0x7F</spanx>, the
	      parser continues to read bytes until it finds a byte different than 0xFF. The sum of
	      each of those bytes taken as unsigned integers is the namespace marker. For example,
	      the reference encoded by the bytes <spanx style="verb">0x7F 0xFF 0x8C 0x1A</spanx> is
	      the name 26 in the namespace associated with 522.
	    </t>
	  </section>
	</section>

      </section>

    </section>

    <section title="Standard namespaces">
      <t>
	Standard namespaces have a fixed marker value and are not identified by a unique
	identifier.
      </t>

      <section title="BULK core namespace">
	<t>
	  <list style="hanging">
	      <t hangText="marker"><spanx style="verb">0x10</spanx> (mnemonic: <spanx
	      style="verb">bulk</spanx>)</t>
	  </list>
	</t>

	<section title="Version">
	  <t>
	    <list style="hanging">
	      <t hangText="name"><spanx style="verb">0x00</spanx> (mnemonic: <spanx
	      style="verb">version</spanx>)</t>
	      <t hangText="shape"><spanx style="verb">( version {major}:Nat {minor}:Nat
	      )</spanx></t>
	    </list>
	  </t>
	  <t>
	    When parsing a BULK stream, a processing application MUST determine explicitely the
	    major and minor version of the BULK specification that the stream obeys. This
	    information MAY be exchanged out-of-band, if BULK is used to exchange a number a very
	    small messages, where repeated headers of 6 bytes might become too big an overhead. A
	    processing application MUST NOT assume a default version.
	  </t>
	  <t>
	    If the version is expressed within a BULK stream, this form MUST be the first in the
	    stream. In any other place, this form has no semantics attached to it. This
	    specification defines BULK 1.0. When writing a BULK stream, an application MUST encode
	    <spanx style="verb">{major}</spanx> and <spanx style="verb">{minor}</spanx> by the
	    smallest byte sequence as described in <xref target="nat-repr"></xref>.
	  </t>
	  <t>
	    An application writing a BULK stream to long-term storage (e.g. in a file or a database
	    record) SHOULD include a <spanx style="verb">version</spanx> form.
	  </t>
	  <t>
	    Two BULK versions with the same major version MUST share the same parsing rules and the
	    same definitions of marker bytes. Changing the syntax or semantics of existing marker
	    bytes and using marker bytes in the reserved interval warrants a new major
	    version. Changing the syntax or semantics of existing names in standard namespaces
	    also.
	  </t>
	  <t>
	    Adding standard namespaces or adding names in existing standard namespaces warrants a
	    new minor version.
	  </t>
	</section>

	<section title="Booleans">
	  <section title="true">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x01</spanx> (mnemonic: <spanx
		style="verb">true</spanx>)</t>
		<t hangText="shape"><spanx style="verb">true</spanx></t>
	      </list>
	    </t>
	    <t>
	      Type: <spanx style="verb">Boolean</spanx>.
	    </t>
	  </section>

	  <section title="false">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x02</spanx> (mnemonic: <spanx
		style="verb">false</spanx>)</t>
		<t hangText="shape"><spanx style="verb">false</spanx></t>
	      </list>
	    </t>
	    <t>
	      Type: <spanx style="verb">Boolean</spanx>.
	    </t>
	  </section>
	</section>

	<section anchor="namespaces" title="Namespaces">
	  <section title="New namespace">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x03</spanx> (mnemonic: <spanx
		style="verb">ns</spanx>)</t>
		<t hangText="shape">
		  <spanx style="verb">( ns {marker}:Nat {id}:Expr )</spanx>
		</t>
	      </list>
	    </t>
	    <t>
	      This associates the namespace identified by <spanx style="verb">{id}</spanx> to the
	      namespace marker <spanx style="verb">{marker}</spanx>, within the scope of this
	      expression.
	    </t>
	  </section>

	  <section title="Package">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x04</spanx> (mnemonic: <spanx
		style="verb">package</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( package {id}:Expr {namespaces}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This creates a package identified by <spanx style="verb">{id}</spanx>. Packages are
	      immutable, <spanx style="verb">{id}</spanx> MUST be verifiable against the byte
	      sequence <spanx style="verb">{namespaces}</spanx>. <spanx
	      style="verb">{namespaces}</spanx> must be a series of expressions each identifying a
	      BULK namespace.
	    </t>
	  </section>

	  <section title="Import">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x05</spanx> (mnemonic: <spanx
		style="verb">import</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( import {base}:Nat {count}:Nat {id}:Expr
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the first <spanx style="verb">{count}</spanx> namespaces in the
	      package identified by <spanx style="verb">{id}</spanx> with a continuous range of
	      marker bytes starting at <spanx style="verb">{base}</spanx> within the scope of this
	      expression.
	    </t>
	    <t>
	      Example: <spanx style="verb">( import 28 3 0x0123456789ABCDEF )</spanx> associates
	      the first 3 namespaces of the package identified by <spanx
	      style="verb">0x0123456789ABCDEF</spanx> to the markers 28, 29 and 30.
	    </t>
	  </section>
	</section>

	<section title="Definitions">
	  <t>
	    To define a reference is to change the the value of its name in its namespace (as
	    identified by its unique identifier, not the marker value) within a certain scope.
	  </t>
	  <t>
	    If a BULK stream is not evaluated, the semantics of a definition are entirely
	    application-dependent.
	  </t>
	  <t>
	    When a BULK stream containing definitions for a namespace comes from a trusted source
	    (i.e. in configuration files of the application, or in the communication with an agent
	    that has been granted the relevant authority), an application MAY give those
	    definitions long-lasting semantics (i.e. keep the values of the names at the end of
	    parsing). This is the preferred mechanism for bulk namespace definition when the
	    semantics of the defined expressions can be expressed completely by BULK forms.
	  </t>

	  <section title="Simple definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x06</spanx> (mnemonic: <spanx
		style="verb">define</spanx>)</t>
		<t hangText="shape">
		  <spanx style="verb">( define {ref}:Ref {value}:Expr )</spanx>
		  <vspace/>
		  <spanx style="verb">( define nil {value}:Expr )</spanx>
		  </t>
	      </list>
	    </t>
	    <t>
	      This defines the reference <spanx style="verb">{ref}</spanx> to the yield of <spanx
	      style="verb">{value}</spanx> in the outer scope of this form.
	    </t>
	    <t>
	      In any context where there is a default namespace where definitions are made,
	      e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
	      second shape defines the smallest name that is not yet defined to <spanx
	      style="verb">{value}</spanx>.
	    </t>
	  </section>

	  <section title="Named definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x07</spanx> (mnemonic: <spanx
		style="verb">mnemonic/def</spanx>)</t>
		<t hangText="shape">
		  <spanx style="verb">( mnemonic/def {ref}:Ref {mnemonic}:String
		  {doc}:Expr {value} )</spanx>
		  <vspace/>
		  <spanx style="verb">( mnemonic/def nil {mnemonic}:String
		  {doc}:Expr {value} )</spanx>
		</t>
	      </list>
	    </t>
	    <t>
	      This suggest <spanx style="verb">{mnemonic}</spanx> as the mnemonic of the name
	      designated by <spanx style="verb">{ref}</spanx> in its namespace. If <spanx
	      style="verb">{value}</spanx> is of type Expr, this defines the reference <spanx
	      style="verb">{ref}</spanx> to <spanx style="verb">{value}</spanx> in the scope of this
	      form.
	    </t>
	    <t>
	      <spanx style="verb">{doc}</spanx> is any expression that provides a documentation for
	      this reference. If it has type Bytes, it MUST be a string. It could be any kind of
	      metadata or document type.
	    </t>
	    <t>
	      In any context where there is a default namespace where definitions are made,
	      e.g. <xref target="verifiable"><spanx style="verb">verifiable-ns</spanx></xref>, the
	      second shape defines the smallest name that is not yet defined to <spanx
	      style="verb">{value}</spanx>.
	    </t>
	  </section>

	  <section title="Namespace description">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x08</spanx> (mnemonic: <spanx
		style="verb">ns-mnemonic</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( ns-mnemonic {ns}:Expr {mnemonic}:String
		{doc} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This suggest <spanx style="verb">{mnemonic}</spanx> as the mnemonic of the namespace
	      designated by <spanx style="verb">{ns}</spanx> (which can be the integer to which
	      this namespace is associated, a reference in this namespace or the unique identifier
	      of this namespace).
	    </t>
	  </section>

	  <section anchor="verifiable" title="Verifiable namespace definition">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x09</spanx> (mnemonic: <spanx
		style="verb">verifiable-ns</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( verifiable-ns {id}:Expr {marker}:Nat
		{data}:Expr {mnemonic}:Expr {doc}:Expr {definitions} )</spanx></t>
		<t hangText="inner scope"><spanx style="verb">{id} {data} {mnemonic} {doc} {definitions}</spanx></t>
	      </list>
	    </t>
	    <t>
	      This associates the namespace identified by <spanx style="verb">{id}</spanx> to the
	      namespace marker <spanx style="verb">{marker}</spanx>, within the scope of this
	      form. Verifiable namespaces are immutable, <spanx style="verb">{id}</spanx> MUST be
	      verifiable against the byte sequence <spanx style="verb">{marker} {data} {mnemonic}
	      {doc} {definitions}</spanx>. The semantics of this form is to define in its scope any
	      definition made in the designated namespace within <spanx
	      style="verb">{definitions}</spanx>.
	    </t>
	    <t>
	      If <spanx style="verb">{mnemonic}</spanx> is of type <spanx
	      style="verb">String</spanx>, then this suggests it as the mnemonic of the
	      namespace. Else it MUST be <spanx style="verb">nil</spanx>.
	    </t>
	    <t>
	      If more data than <spanx style="verb">{id}</spanx> is needed to verify <spanx
	      style="verb">{id}</spanx> against <spanx style="verb">{definitions}</spanx> (like the
	      salt of a hash function, or the namespace of a UUID), this data should be provided by
	      <spanx style="verb">{data}</spanx>. Else <spanx style="verb">{data}</spanx> MUST be
	      <spanx style="verb">nil</spanx>.
	    </t>
	    <t>
	      A verifiable namespace wouldn't really be immutable if it used definitions from other
	      namespaces that aren't immutable. To that effect, an application SHOULD stop
	      processing this form with an error when <spanx style="verb">{definitions}</spanx>
	      contain references from namespaces that cannot be determined to be immutable
	      themselves. The goal is to prevent a user or system to be unwittingly vulnerable, so
	      an application MAY provide an option to accept a specific verifiable namespace, but an
	      application MUST NOT provide an option to accept any vulnerable verifiable
	      namespace. That is, an option like <spanx style="verb">--accept-ns
	      8f82849556d74466</spanx> is acceptable but <spanx
	      style="verb">--disable-immutability-check</spanx> is not.
	    </t>
	  </section>

	  <section title="Array concatenation">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x0A</spanx> (mnemonic: <spanx
		style="verb">concat</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( concat {array1}:Bytes {array2}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <list style="hanging">
		<t hangText="Name's type">EagerFunction</t>
		<t hangText="Form's type">Bytes</t>
		<t hangText="Form's value">the concatenation of {array1} and {array2}.</t>
	      </list>
	    </t>
	    <t>
	      The value of this form is an array that contains the bytes in array1 followed by the
	      bytes in array2.
	    </t>
	  </section>

	  <section title="Substituton">
	    <section title="Substitution function">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x0B</spanx> (mnemonic: <spanx
		  style="verb">subst</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( subst {code} )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">LazyFunction</t>
		  <t hangText="Form's type">EagerFunction</t>
		  <t hangText="Form's value">A substitution function whose return value is the
		  value of <spanx style="verb">{code}</spanx>. Within <spanx
		  style="verb">{code}</spanx>'s specific yield, the names <spanx
		  style="verb">arg</spanx> and <spanx style="verb">rest</spanx> are defined:</t>
		</list>
	      </t>
	    </section>
	    <section title="Argument">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x0C</spanx> (mnemonic: <spanx
		  style="verb">arg</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( arg {n}:Nat )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">EagerFunction</t>
		  <t hangText="Form's type">Expr</t>
		  <t hangText="Form's value">the element number <spanx style="verb">{n}</spanx>
		  (starting at zero) of the substitution function's arguments list</t>
		</list>
	      </t>
	    </section>
	    <section title="Rest of arguments list">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x0D</spanx> (mnemonic: <spanx
		  style="verb">rest</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( rest {n}:Nat )</spanx></t>
		</list>
	      </t>
	      <t>
		<list style="hanging">
		  <t hangText="Name's type">EagerFunction</t>
		  <t hangText="Form's type">Expr</t>
		  <t hangText="Form's value">the substitution function's arguments list without its
		  first <spanx style="verb">{n}</spanx> elements.</t>
		</list>
	      </t>
	    </section>
	    <section title="Examples">
	      <t>Here is a definition of the inverse followed by the numbers 1/2, 1/3 and 1/4:</t>
	      <t><spanx style="verb">( define inverse ( subst ( frac 1 ( arg 0 ) ) ) ) ( inverse 2
	      ) ( inverse 3 ) ( inverse 4 )</spanx></t>
	      <t>Substitution will splice multiple expressions in place:</t>
	      <t>The evaluation of <spanx style="verb">( ( subst 1 ( rest 0 ) 2 ) 3 4 )</spanx>
	      must yield the same as <spanx style="verb">( 1 3 4 2 )</spanx></t>
	    </section>
	  </section>
	</section>

	<section title="Strings and other typed byte arrays">
	  <section anchor="stringenc" title="Current encoding">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x10</spanx> (mnemonic: <spanx
		style="verb">stringenc</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( stringenc {enc}:Encoding )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This tells the processing application that, in the scope of this expression, all
	      expressions that are understood by the application as character strings will be
	      encoded with the encoding designated by <spanx style="verb">{enc}</spanx>.
	    </t>
	    <t>
	      As the abstract yield doesn't contain strings but expressions that will be used as
	      strings by the application, it is not a parsing error if the application doesn't
	      recognize <spanx style="verb">{enc}</spanx>. In this situation, it is a parsing error
	      when the application actually needs to decode a byte sequence as a string. It is not
	      a parsing error when a processing application only transmits a byte sequence encoding
	      a string, if it can accurately convey the encoding to the receiving application.
	    </t>
	  </section>

	  <section title="IANA registered character set">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x11</spanx> (mnemonic: <spanx
		style="verb">iana-charset</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( iana-charset {id}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This designates the string encoding registered among the <xref
	      target="IANA-Charsets">IANA Character Sets</xref> whose MIBenum is <spanx
	      style="verb">{id}</spanx>.
	    </t>
	    <t>
	      Type: <spanx style="verb">Encoding</spanx>.
	    </t>
	  </section>

	  <section title="Windows code page">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x12</spanx> (mnemonic: <spanx
		style="verb">code-page</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( code-page {id}:Nat )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This designates the string encoding among Windows code pages whose identifier is
	      <spanx style="verb">{id}</spanx>.
	    </t>
	    <t>
	      Type: <spanx style="verb">Encoding</spanx>.
	    </t>
	  </section>


	    <section anchor="string" title="String">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x13</spanx> (mnemonic: <spanx
		  style="verb">string</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( string {string:}Bytes )</spanx></t>
		</list>
	      </t>
	      <t>This form indicates that the bytes encoded by <spanx style="verb">{string}</spanx>
	      are meant to be interpreted as a string encoded with the current string encoding.
	      </t>
	    </section>

	    <section anchor="string*" title="String with explicit encoding">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x14</spanx> (mnemonic: <spanx
		  style="verb">string*</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( string* {enc}:Encoding {string:}Bytes )</spanx></t>
		</list>
	      </t>
	      <t>This form indicates that the bytes encoded by <spanx style="verb">{string}</spanx>
	      are meant to be interpreted as a string encoded with the encoding designated by the
	      expression <spanx style="verb">{enc}</spanx>.
	      </t>
	    </section>

	    <section anchor="blob" title="Blob">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x15</spanx> (mnemonic: <spanx
		  style="verb">blob</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( blob {blob:}Bytes )</spanx></t>
		</list>
	      </t>
	      <t>
		This form indicates that the bytes encoded by <spanx style="verb">{blob}</spanx>
		are meant be interpreted as just a raw sequence of bytes, not to be decoded.
	      </t>
	    </section>

	    <section title="Nested BULK stream">
	      <t>
		<list style="hanging">
		  <t hangText="name"><spanx style="verb">0x16</spanx> (mnemonic: <spanx
		  style="verb">nested-bulk</spanx>)</t>
		  <t hangText="shape"><spanx style="verb">( nested-bulk {embedded}:Boolean {bulk}:Bytes )</spanx></t>
		</list>
	      </t>
	      <t>
		This form indicates that the bytes encoded by <spanx style="verb">{bulk}</spanx>
		are meant to be interpreted as a BULK stream. If the stream doesn't start with a
		<spanx style="verb">version</spanx> form, the stream MUST be assumed to have the
		same version as the parent stream.
	      </t>
	      <t>
		This form can be useful to let the application reading a BULK stream skip parsing a
		large section.
	      </t>
	      <t>
		If <spanx style="verb">{embedded}</spanx> is <spanx style="verb">true</spanx>, the
		default semantics of this form is the same as the semantics of the BULK stream in
		<spanx style="verb">{bulk}</spanx>, with the following exception. For example,
		these two forms have the same semantics:
		<list style="symbols">
		  <t><spanx style="verb">( 4 5 )</spanx></t>
		  <t><spanx style="verb">( nested-bulk true #[2] 4 5 )</spanx></t>
		</list>
	      </t>
	      <t>
		It could be a security risk if a single BULK stream could be parsed into two
		different abstract yields by two conformant applications, so the semantics of the
		whole stream cannot change whether <spanx style="verb">{bulk}</spanx> is decoded or
		not. For that reason, any effects in the nested stream that affect how BULK
		expressions are parsed or evaluated (like namespace associations or definitions)
		MUST be isolated within that form.
	      </t>
	      <t>
		For the same security reason, there isn't a <spanx style="verb">( bulk-with-size
		Nat Expr )</spanx> form because it would open up the same risk when the size given
		is not the size of the enclosed expression, accidently or maliciously.
	      </t>
	    </section>
	</section>

	<section anchor="arithmetic" title="Arithmetic">
	  <t>
	    A processing application must recognize the type of all expressions defined in this
	    specification that have the type Number, but an application MAY consider a number as
	    having an unknown value if it has no adequate data type to store it.
	  </t>
	  <t>
	    In the text notation of a BULK stream, a decimal integer is the notation for the
	    smallest byte sequence that yields this integer as described in <xref
	    target="nat-repr"></xref>. For example, <spanx style="verb">( 31 256 )</spanx> is a
	    notation for the bytes <spanx style="verb">0x01 0x9F 0xC2-0100 0x02</spanx>.
	  </t>

	  <section anchor="unsigned-int" title="Unsigned integer">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x20</spanx> (mnemonic: <spanx
		style="verb">unsigned-int</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( unsigned-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in <spanx style="verb">{bits}</spanx> is the value of this integer
	      in binary notation. This form exists in case disambiguation of the semantics is
	      necessary.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>, <spanx
	      style="verb">Nat</spanx>.
	    </t>
	  </section>

	  <section title="Signed integer">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x21</spanx> (mnemonic: <spanx
		style="verb">signed-int</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( signed-int {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      The bits contained in <spanx style="verb">{bits}</spanx> is the value of this integer
	      in two's-complement notation.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>, <spanx style="verb">Int</spanx>.
	    </t>
	  </section>

	  <section title="Fraction">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x22</spanx> (mnemonic: <spanx
		style="verb">frac</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( frac {num}:Int {div}:Int )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is the number <spanx style="verb">{num}</spanx>/<spanx
	      style="verb">{div}</spanx>.
	    </t>
	    <t>
	      Type: <spanx style="verb">Number</spanx>.
	    </t>
	  </section>

	  <section title="Binary floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x23</spanx> (mnemonic: <spanx
		style="verb">binary-float</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( binary-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 binary interchange
	      format. <spanx style="verb">{bits}</spanx> can be of size 16, 32, 64, 128 or any
	      bigger multiple of 32 bits, as per IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal floating-point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x24</spanx> (mnemonic: <spanx
		style="verb">decimal-float</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( decimal-float {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a floating-point number expressed in IEEE 754-2008 decimal interchange
	      format. <spanx style="verb">{bits}</spanx> can be of size 32, 64, 128 or any bigger
	      multiple of 32 bits, as per IEEE 754-2008 rules.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Binary fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x25</spanx> (mnemonic: <spanx
		style="verb">binary-fixed</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( binary-fixed {point}:Nat {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point binary number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 2^point is the value of this
	      form. For example, <spanx style="verb">( binary-fixed 2 15 )</spanx> has value <spanx
	      style="verb">3.75<sub>10</sub></spanx> (<spanx
	      style="verb">11.11<sub>2</sub></spanx>).
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal fixed point number">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x26</spanx> (mnemonic: <spanx
		style="verb">decimal-fixed</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( decimal-fixed {point}:Nat {bits}:Bytes )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a fixed point decimal number. <spanx style="verb">{bits}</spanx> contains an
	      integer in two's complement. That integer divided by 10^point is the value of this
	      form. For example, <spanx style="verb">( decimal-fixed 2 123 )</spanx> has value
	      <spanx style="verb">1.23</spanx>.
	    </t>
	    <t>
	      Types: <spanx style="verb">Number</spanx>, <spanx style="verb">Float</spanx>.
	    </t>
	  </section>

	  <section title="Decimal fixed point number with 2 decimal places">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x27</spanx> (mnemonic: <spanx
		style="verb">decimal2</spanx>)</t>
		<t hangText="value"><spanx style="verb">( subst ( decimal-fixed 2 ( arg 0 ) ) )</spanx></t>
	      </list>
	    </t>
	  </section>
	</section>

	<section title="Compact formats">
	  <t>
	    This specification and other specifications in the official BULK suite take the option
	    to use as their basic building block a form with a distinguishing reference as first
	    element (basically, they are a binary representation of an abstract syntax tree). As
	    noted previously, this means that most representations weigh 4 bytes plus their actual
	    content, which will in turn have some overhead because of one or several marker bytes.
	  </t>
	  <t>
	    But when there is a special need for compactness, BULK makes it possible to design
	    protocols and formats with different trade-offs, while retaining its property of being
	    parseable by processing applications not knowing the protocol in its entirety.
	  </t>
	  <t>
	    On one end of the spectrum, a format might choose to use an array to encapsulate an ad
	    hoc binary format. An extreme use of this scheme would be to use BULK just to make
	    explicit the binary format used. With a known profile (for example with a file extension
	    and/or media type for such explicitly typed BLOBs), such a BULK stream can consist
	    solely of the version form, a reference that describes the binary format and an array,
	    which would amount to an overhead between 11 bytes and 20 bytes depending on the size of
	    the content (11, 13, 14, 16 and 20 bytes for contents of no more than 63B, 255B, 65kB,
	    4GB and 18EB respectively). Without a profile, with the namespaces associations, the
	    overhead is between 28 and 37 bytes (the difference is a single <spanx
	    style="verb">import</spanx> form to import two namespaces: the one providing a form used
	    to identify namespaces, and the one for the format used in the stream).
	  </t>
	  <t>
	    Still, even this extreme in the design space retains the ability to insert expressions
	    in the BULK stream, whatever their type. Thus metadata can be added about data that is
	    represented in a format that doesn't allow for metadata or for limited metadata.
	  </t>
	  <t>
	    In-between these two extremes, several options are available to produce a format that
	    leverages the BULK parser a lot more while being more compact than a basic BULK
	    format. The following forms provide a standard way to create such formats.
	  </t>
	  <t>
	    A flat list of operators and operands is called a BULK bytecode. Prefix bytecodes are
	    those where operators come before operands, postfix bytecodes are those where operators
	    come after operands. In the following forms, operators MUST be references.
	  </t>
	  <t>
	    The default semantics of a bytecode form is to transform it to an abstract syntax tree
	    of its content and then evaluate the resulting expression with the normal BULK
	    evaluation rules. When evaluating a bytecode form that doesn't provide arities, a
	    processing application MUST abort this transformation as soon as it encounters a
	    reference for which it cannot determine if it is an operator or its arity. When
	    evaluating a bytecode form that provides arities, any reference that is not known to be
	    an operator MUST be determined to be an operand.
	  </t>
	  <t>
	    To transform a prefix bytecode form, a processing application creates an alternate
	    context. If the first expression of the bytecode can be determined to be an operand, it
	    is removed from the beginning of the bytecode and appended at the end of the alternate
	    context. If the first expression of the bytecode is a reference that can be determined
	    to be an operator, it is removed from the beginning of the bytecode and a list is
	    created with the operator as the first expression, then as many next expressions as its
	    arity are removed from the beginning of the bytecode and appended at the end of this
	    list. Then that resulting list is appended at the end of the alternate context. The
	    transformation continues until the bytecode is empty, in which case the alternate
	    context replaces the bytecode form and the transformation is complete. The resulting
	    form can then be evaluated in turn.
	  </t>
	  <t> Example: the default semantics of </t>
	  <t><spanx style="verb">( prefix* ( ( 2 go:black ) ) go:game go:black 1 2
	  go:black 3 4 go:black 5 6 )</spanx> </t>
	  <t>is that it's transformed into</t>
	  <t><spanx style="verb">( go:game ( go:black 1 2 ) ( go:black 3 4 ) ( go:black 5 6 )
	  )</spanx> </t>
	  <t>
	    To transform a postfix bytecode form, a processing application creates a data stack. If
	    the first expression of the bytecode can be determined to be an operand, it is removed
	    from the beginning of the bytecode and pushed on top of the stack. If the first
	    expression of the bytecode can be determined to be an operator, it is removed from the
	    beginning of the bytecode and a list is created with the operator as the first
	    expression, then as many next expressions as its arity are popped from the stack and
	    appended at the end of this list (with the top of the stack as the last element). Then
	    that resulting list is pushed on top of the stack. The transformation continues until
	    the bytecode is empty, in which case the list of the elements on the stack (with the
	    top of the stack as the last element) replaces the bytecode form and the transformation
	    is complete. The resulting form can then be evaluated in turn.
	  </t>
	  <t> Example: the default semantics of </t>
	  <t><figure><artwork>( bulk:postfix*
  ( ( 2 go:black go:white go:comment go:alternative ) )
  go:game
  1 2 go:black
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go:alternative
  2 3 go:black
  4 5 go:white )</artwork></figure></t>
	  <t>is that it's transformed into</t>
	  <t><figure><artwork>( go:game
  ( go:black 1 2 )
  ( go:alternative
    ( go:comment "white tried an unorthodox opening" ( go:white 3 4 ) )
    ( go:comment "a more classical opening would be" ( go:white 8 9 ) ) )
  ( go:black 2 3 )
  ( go:white 4 5 ) )</artwork></figure>
	  </t>
          <t>
	    The obivous advantage of postfix bytecode is that it makes it possible to compact
	    nested forms when they have a known arity. When a reference in a vocabulary can be used
	    in a form containing a variable number of expressions, if some arity is used frequently
	    enough, an application can define a specific form for it. The trade-offs for this are
	    explained in <xref target="arityForm"/>
	  </t>
          <t>
	    If the overhead of several marker bytes in the operands of some operators is too much,
	    even more compactness can be achieved by packing together small operands. For example,
	    instead of an operator with two integers as its operands, one could specify an operator
	    to take a single word as operand and extract the integers from it (while still retaining
	    the ability to operate on many sizes of integers, because it can still deduce the size
	    of the integers by dividing the size of the word by two).
	  </t>
	  <t>
	    For example, a BULK format representing player moves with a pair of coordinates on a
	    large board might represent a single move with the following shapes:
	  </t>
	  <t>
	    <list style="hanging">
	      <t hangText="basic (8 bytes)"><spanx style="verb">( game:black/2 #[1] 0x41 #[1] 0x5A
	      )</spanx></t>
	      <t hangText="packed basic (7 bytes)"><spanx style="verb">( game:black/1 #[2] 0x41 0x5A
	      )</spanx></t>
	      <t hangText="bytecode (6 bytes)"><spanx style="verb">game:black/2 #[1] 0x41 #[1]
	      0x5A</spanx></t>
	      <t hangText="packed bytecode (5 bytes)"><spanx style="verb">game:black/1 #[2] 0x41
	      0x5A</spanx></t>
	    </list>
	  </t>
	  <t>
	    The transformation defined for the bytecode forms makes it possible to mix literal
	    expressions and operations represented by a sequence of operators and operands. In the
	    previous scenario, for example, one might represent each alternating move by the two
	    players as two integers, lowering the weight of each move to 2 bytes as coordinates are
	    below 64:
	  </t>
	  <t><figure><artwork>( bulk:postfix*
  ( ( 2 go:white go:comment go:alternative ) )
  go:game
  1 2
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go:alternative
  2 3
  4 5 )</artwork></figure></t>
          <t>
            The difference between all these schemes and an array is that you keep the ability to
            insert other forms, for example here to represent comments on the game or variants.
	  </t>
	  <t>
	    The cost of the bytecode format is that if it contains operators whose arity is unknown
	    to a processing application, the whole list after the first occurrence of them is
	    unreadable to that processing application, whereas in the basic format, the processing
	    application can still process all the forms it understands, and that requires no
	    anticipation by the application creating the BULK stream.
	  </t>

	  <section title="Prefix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x30</spanx> (mnemonic: <spanx
		style="verb">prefix</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( prefix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a prefix bytecode form that doesn't provide arities.
	    </t>
	  </section>

	  <section title="Prefix bytecode with arities">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x31</spanx> (mnemonic: <spanx
		style="verb">prefix*</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( prefix* {arities}:Expr {bytecode}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a prefix bytecode form that provides arities.
	    </t>
	    <t>
	      <spanx style="verb">{arities}</spanx> MUST be a list of shapes <spanx style="verb">(
	      {arity}:Nat {refs} )</spanx>. <spanx style="verb">{refs}</spanx> MUST be a series of
	      references. It indicates that all references in this series are operators of arity
	      <spanx style="verb">{arity}</spanx>. <spanx style="verb">{arities}</spanx> can be a
	      form or a reference defined to a list.
	    </t>
	    <t>
	      Within the prefix bytecode of this form, if there is a <spanx
	      style="verb">prefix</spanx> form, the arities declared in the outside form still
	      apply.
	    </t>
	  </section>

	  <section title="Postfix bytecode">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x32</spanx> (mnemonic: <spanx
		style="verb">postfix</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( postfix {bytecode} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a postfix bytecode form that doesn't provide arities.
	    </t>
	  </section>

	  <section title="Postfix bytecode with arities">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x33</spanx> (mnemonic: <spanx
		style="verb">postfix*</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( postfix* {arities}:Expr {bytecode}
		)</spanx></t>
	      </list>
	    </t>
	    <t>
	      This is a postfix bytecode form that provides arities.
	    </t>
	    <t>
	      <spanx style="verb">{arities}</spanx> MUST be a list of shapes <spanx style="verb">(
	      {arity}:Nat {refs} )</spanx>. <spanx style="verb">{refs}</spanx> MUST be a series of
	      references. It indicates that all references in this series are operators of arity
	      <spanx style="verb">{arity}</spanx>. <spanx style="verb">{arities}</spanx> can be a
	      form or a reference defined to a list.
	    </t>
	    <t>
	      Within the postfix bytecode of this form, if there is a <spanx
	      style="verb">postfix</spanx> form, the arities declared in the outside form still
	      apply.
	    </t>
	  </section>

	  <section title="Arity declaration">
	    <t>
	      <list style="hanging">
		<t hangText="name"><spanx style="verb">0x34</spanx> (mnemonic: <spanx
		style="verb">arity</spanx>)</t>
		<t hangText="shape"><spanx style="verb">( arity {arity}:Nat {refs} )</spanx></t>
	      </list>
	    </t>
	    <t>
	      <spanx style="verb">{refs}</spanx> MUST be a series of references. It indicates that
	      all references in this series are operators of arity <spanx
	      style="verb">{arity}</spanx>.
	    </t>
	    <t>
	      Whenever arities have been provided by this form for some references in a namespace,
	      all references in that namespace whose arities aren't provided MUST be determined to
	      be operands by a processing application.
	    </t>
	  </section>

	</section>
      </section>
    </section>

    <section title="Extension namespaces">
      <t>
	Extension namespaces are defined with a unique identifier, to be associated to a marker
	value.
      </t>
      <t>
	By is decentralized nature, as far as a processing application is concerned, apart from
	standard namespaces, there is no difference between a namespace defined as part of the
	official BULK suite and a user-defined one.
      </t>
    </section>

    <section title="Profiles">
      <t>
	A profile is a byte sequence parsed by a processing application just after the <spanx
	style="verb">version</spanx> form or before the first expression if there is no <spanx
	style="verb">version</spanx> form. Thus a parser SHOULD look ahead at the beginning of a
	stream to see if the first three bytes are <spanx style="verb">( bulk:version</spanx>. With
	respect to the BULK stream, the profile is an out-of-band information, usually implicit.
      </t>
      <t>
	A processing application doesn't need to include the profile in the concrete yield, as long
	as the semantics of the abstract yield are maintained.
      </t>
      <t>
	The same BULK stream might be processed with different profiles.
      </t>
      <t>
	A processing application MUST NOT deduce the profile from the content of a BULK stream.
      </t>

      <section title="Profile redundancy">
	<t>
	  A processing application SHOULD only rely on the use of a profile when it is a safe
	  assumption that the profile is known, for example within a communication where the
	  protocol dictates the profile.
	</t>
	<t>
	  In particular, long-term storage of a BULK stream SHOULD preserve profile information, for
	  example with a media type that dictates the profile.
	</t>
	<t>
	  Otherwise, an application writing a BULK stream in a long-term storage SHOULD include the
	  profile after the version form. For this reason, the expressions in a profile SHOULD have
	  idempotent semantics.
	</t>
      </section>

      <section title="Standard profile">
	<t>
	  This specification defines the default profile that a processing application MUST use when
	  it is not using a specific profile:
	</t>
	<t>
	  <spanx style="verb">( bulk:stringenc ( bulk:iana-charset 106 ) )</spanx>
	</t>
	<t>
	  This means that the default string encoding in a BULK stream is UTF-8.
	</t>
      </section>
    </section>

    <section title="Security Considerations" anchor="sec">
      <section title="Parsing">
	<t>
	  Parsing a BULK stream is designed to be free of side-effects for the processing application,
	  apart from storing the parsed results.
	</t>
	<t>
	  Arrays in BULK carry their size, so as for the application to know in advance the size of
	  the data to read and store, thus making it easier to build robust code. A malicious
	  software, however, may announce an array with a size choosen to get an application to
	  exhaust its available memory. When a BULK stream has been completely received, an array
	  bigger than the remaining data SHOULD trigger an error. When a BULK stream's size is not
	  known in advance, the application SHOULD use a growable data structure.
	</t>
	<t>
	  Evaluation opens up some known attacks that appear whenever a format provides a way to
	  express abstraction, like the billion laughs attack. As it is explained in <xref
	  target="eval" format="title"/>, an implementation MAY stop evaluation after a predefined
	  number of evaluation steps. As this has been demonstrated not to be sufficient to prevent
	  attacks based on expansion, an implementation SHOULD also put predefined limits on the
	  space that the abstract yield can take on disk or in memory.
	</t>
	<t>
	  Applications MAY use out-of-band information to select size limits (like HTTP
	  attributes), or a BULK namespace MAY provide hints. An implementation SHOULD emit
	  warnings when the size of the abstract yield would exceed the size limits set by such
	  out-of-band or in-band information.
	</t>
      </section>
      <section title="Forwarding">
	<t>
	  When a processing application forwards all or part of the data in a BULK stream to another
	  application, care must be taken if part of the forwarded data was not entirely recognized,
	  as it could be used by an attacker to benefit from the authority the forwarding
	  application has on the recipient of the data.
	</t>
      </section>
      <section title="Definitions">
	<t>
	  The architecture of a processing application SHOULD ensure that a malicious agent cannot
	  abuse authority given to it to define a namespace in order to modify associations in other
	  namespaces. Depending on the use of data structures storing BULK expressions, this could
	  amount to giving an attacker a way to manipulate the application's state. See <xref
	  target="robustNS"/> for an example of architecture that is resistant to that kind of
	  attack.
	</t>
      </section>
    </section>

    <section title="IANA Considerations">
      <t>
	This specification defines a new media type, application/bulk. Here are the informations for
	its registration to IANA:
      </t>
      <t>
	<list style="hanging">
	  <t hangText="Type name">application</t>
	  <t hangText="Subtype name">bulk</t>
	  <t hangText="Required parameters">none</t>
	  <t hangText="Optional parameters">none</t>
	  <t hangText="Encoding considerations">none, content is self-describing</t>
	  <t hangText="Security considerations">cf. <xref target="sec"/></t>
	  <t hangText="Interoperability considerations">the constraint to start any BULK file with a
	  version form has the side-effect that classes of BULK streams can be identified by a
	  sequence of bytes acting as "magic number", at offset 0:
	  <list style="hanging">
	    <t hangText="0x011000">any BULK stream</t>
	    <t hangText="0x01100081">a BULK stream of major version 1</t>
	    <t hangText="0x011000818002">a BULK stream of version 1.0</t>
	  </list>
	  </t>
	  <t hangText="Published specification">this document</t>
	  <t hangText="Applications that use this media type">none so far</t>
	  <t hangText="Fragment identifier considerations">this specification defines no semantics
	  for addressing the data with a fragment identifier; a future specification MAY define
	  fragment identifier syntaxes to address the content by byte offset or the parsed results
	  by their position in the yielded list</t>
	  <t hangText="Additional information">a future specification MAY define a naming convention
	  for media types based on bulk with a +bulk suffix, as for XML with +xml</t>
	</list>
      </t>
    </section>

    <section title="Acknowledgements">
      <t>
	The original author of this specification read <eref
	target="http://www.schnada.de/grapt/eriknaggum-xmlrant.html">Erik Naggum's famous rant
	about XML</eref> several years before, and while forgotten as such for a time, it clearly
	was the seed that slowly bloomed into the design of BULK. This format is dedicated to Erik.
      </t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <reference anchor="RFC2119">
        <front>
          <title>
            Key words for use in RFCs to Indicate Requirement Levels
          </title>
          <author initials="S." surname="Bradner" fullname="Scott Bradner">
            <organization>Harvard University</organization>
            <address><email>sob@harvard.edu</email></address>
          </author>
          <date month="March" year="1997"/>
        </front>
        <seriesInfo name="BCP" value="14"/>
        <seriesInfo name="RFC" value="2119"/>
      </reference>

      <reference anchor="IANA-Charsets" target="http://www.iana.org/assignments/character-sets">
        <front>
          <title>
	    IANA Charset Registry (archived at):
          </title>
	  <author/>
	  <date/>
        </front>
      </reference>

    </references>


    <references title="Informative references">

      <reference anchor="HTTP2">
	<front>
	  <title>Hypertext Transfer Protocol version 2 (HTTP/2)</title>
	  <author initials="M." surname="Belshe" fullname="Mike Belshe">
	    <organization>BitGo</organization>
	    <address>
	      <email>mike@belshe.com</email>
	    </address>
	  </author>

	  <author initials="R." surname="Peon" fullname="Roberto Peon">
	    <organization>Google, Inc</organization>
	    <address>
	      <email>fenix@google.com</email>
	    </address>
	  </author>

	  <author initials="M." surname="Thomson" fullname="Martin Thomson" role="editor">
	    <organization>Mozilla</organization>
	    <address>
	      <postal>
		<street>331 E Evelyn Street</street>
		<city>Mountain View, CA</city>
		<code>94041</code>
		<country>US</country>
	      </postal>
	      <email>martin.thomson@gmail.com</email>
	    </address>
	  </author>
	  <date month="May" year="2015"/>
	</front>
        <seriesInfo name="RFC" value="7540"/>
      </reference>

      <reference anchor="Avro" target="http://avro.apache.org/docs/1.7.4/spec.html">
	<front>
	  <title>Apache Avro™ 1.7.4 Specification</title>
	  <author initials="D." surname ="Cutting" fullname="Doug Cutting">
            <organization>Cloudera</organization>
	  </author>
	  <date month="February" year="2013"/>
	</front>
      </reference>

      <reference anchor="protobuf" target="https://developers.google.com/protocol-buffers/">
	<front>
	  <title>Protocol Buffers</title>
	  <author/>
	  <date month="July" year="2008"/>
	</front>
      </reference>

      <reference anchor="Smile" target="http://wiki.fasterxml.com/SmileFormat">
	<front>
	  <title>Smile Data Format</title>
	  <author initials="T." surname ="Saloranta" fullname="Tatu Saloranta">
	    <address><email>tsaloranta@gmail.com</email></address>
	  </author>
	  <date month="September" year="2010"/>
	</front>
      </reference>

      <reference anchor="Thrift" target="http://thrift.apache.org/static/files/thrift-20070401.pdf">
	<front>
	  <title>Thrift: Scalable Cross-Language Services Implementation</title>
	  <author initials="M." surname ="Slee" fullname="Mark Slee">
	    <organization>Facebook</organization>
	    <address><email>mcslee@facebook.com</email></address>
	  </author>
	  <author initials="A." surname ="Agarwal" fullname="Aditya Agarwal">
	    <organization>Facebook</organization>
	    <address><email>aditya@facebook.com</email></address>
	  </author>
	  <author initials="M." surname ="Kwiatkowski" fullname="Marc Kwiatkowski">
	    <organization>Facebook</organization>
	    <address><email>marc@facebook.com</email></address>
	  </author>
	  <date month="April" year="2007"/>
	</front>
      </reference>

    </references>

    <section anchor="robustNS" title="Robust namespace definition">
      <t>
	This constitutes a suggestion of architecture for a BULK processing application. It has the
	advantage that an agent cannot modify the values of names to which it has not specifically
	been given authority. This architecture doesn't ensure this property by checking the
	validity of definitions but by adhering to the Principle Of Least Authority, thus ensuring
	no false positives or TOCTOU race conditions.
      </t>
      <t>
	For each new context (including the abstract yield when parsing starts), the parser creates
	a new copy of each known namespace. These copies are available in this context to retrieve
	and define values. It implements the lexical scoping of definitions on top of providing the
	robustness properties discussed here.
      </t>
      <t>
	By default, all namespaces created in a context are discarded at the end of this context.
      </t>
      <t>
	Of course, an implementation of the architecture presented here can be optimized compared to
	the abstract algorithm, for example by using copy-on-demand.
      </t>
      <t>
	Any namespace that is not a copy for its context but the object retained by the application
	afterwards, gives authority to make long-lasting definitions. Such a namespace is called
	lasting here.
      </t>
      <section title="Selective authority">
	<t>
	  A number of lasting namespaces are included for the abstract yield. Their unique
	  identifiers are agreed out-of-band. The disadvantage of this solution is that it needs
	  prior agreement on the definable namespaces.
	</t>
      </section>
      <section title="Open authority">
	<t>
	  Any <spanx style="verb">ns</spanx> form for a unique identifier unknown to the processing
	  application triggers the creation of a lasting namespace.
	</t>
	<t>
	  The disadvantage of this solution is that it opens a denial of service vulnerability. If
	  Bob is a processing application and Carol and Dave are agents communicating with Bob with
	  an open authority, Dave can prevent Carol from defining a namespace if it manages to know
	  the unique identifier and starting a communication with Bob before Carol.
	</t>
	<t>
	  If an agent uses a secure way to create unique identifiers, this solution is both flexible
	  and safe (the burden is not on the BULK processing application).
	</t>
      </section>
    </section>

    <section anchor="arityForm" title="Arity-carrying forms">
      <t>
	Sometimes a vocabulary will include forms that can contain an arbitrary number of
	expressions. When such a form is used in postfix bytecode, the simplest solution is just to
	use a nested <spanx style="verb">postfix</spanx> form:
      </t>
      <t><figure><artwork>( bulk:postfix*
  ( ( 2 go:black go:white go:comment ) )
  go:game
  1 2 go:black
  ( bulk:postfix go:alternative
    "white tried an unorthodox opening" 3 4 go:white go:comment
    "a more classical opening would be" 8 9 go:white go:comment )
  2 3 go:black
  ( bulk:postfix go:alternative 
    "white played a bad move" 4 5 go:white go:comment
    "white could have played a decent move" 5 6 go:white go:comment
    "white could have played a great move" 5 7 go:white go:comment ) )</artwork></figure></t>
      <t>
        The nested <spanx style="verb">postfix</spanx> form costs 4 bytes, compared to an
	equivalent postfix bytecode.
      </t>
      <t>
	If those 4 bytes add up to too much space through repetition, an application could define a
	form for the sole purpose of assigning it an arity, while the evaluation of the
	arity-carrying form would just replace it with the original one. For example, after
	evaluating the postfix bytecode transformation and the resulting form of the last
	expression of
      </t>
      <t><figure><artwork>( bulk:ns-mnemonic 0x1800 "go2" )
( bulk:mnemonic/def 0x1801 "alt/2" nil ( bulk:subst ( alternative ( rest 0 ) ) ) )
( bulk:mnemonic/def 0x1801 "alt/3" nil ( bulk:subst ( alternative ( rest 0 ) ) ) )

( bulk:postfix*
  ( ( 2 go:black go:white go:comment go2:alt/2 ) ( 3 go2:alt/3 ) )
  go:game
  1 2 go:black
  "white tried an unorthodox opening" 3 4 go:white go:comment
  "a more classical opening would be" 8 9 go:white go:comment
  go2:alt/2
  2 3 go:black
  "white played a bad move" 4 5 go:white go:comment
  "white could have played a decent move" 5 6 go:white go:comment
  "white could have played a great move" 5 7 go:white go:comment
  go2:alt/3
  )</artwork></figure></t>
      <t>it would be transformed into</t>
      <t><figure><artwork>( go:game
  ( go:black 1 2 )
  ( go:alternative
    ( go:comment "white tried an unorthodox opening" ( go:white 3 4 ) )
    ( go:comment "a more classical opening would be" ( go:white 8 9 ) ) )
  ( go:black 2 3 )
  ( go:alternative
    ( go:comment "white played a bad move" ( go:white 4 5 ) )
    ( go:comment "white could have played a decent move" ( go:white 5 6 ) )
    ( go:comment "white could have played a great move" ( go:white 5 7 ) ) )
  ( go:white 4 5 ) )decent</artwork></figure>
      </t>
      <t>
	Without the mnemonics, such an arity-carrying form basically costs 24 or 27 bytes to be
	usable. Which means that compared to the nested <spanx style="verb">postfix</spanx> form,
	it pays for itself if it is used at least 7 times.
      </t>
    </section>

  </back>
</rfc>

