<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>
<?xml-stylesheet type="text/xsl" href="rfc2629.xslt" ?>
<!-- generated by https://github.com/cabo/kramdown-rfc version 1.7.19 (Ruby 3.0.2) -->
<rfc xmlns:xi="http://www.w3.org/2001/XInclude" ipr="trust200902" docName="draft-illyes-repext-01" category="info" consensus="true" submissionType="IETF" tocInclude="true" sortRefs="true" symRefs="true" version="3">
  <!-- xml2rfc v2v3 conversion 3.23.2 -->
  <front>
    <title abbrev="REPext for URI level">Robots Exclusion Protocol Extension for URI Level Control</title>
    <seriesInfo name="Internet-Draft" value="draft-illyes-repext-01"/>
    <author fullname="Gary Illyes">
      <organization>Google LLC.</organization>
      <address>
        <email>garyillyes@google.com</email>
      </address>
    </author>
    <date year="2024" month="October" day="18"/>
    <keyword>robots.txt</keyword>
    <abstract>
      <?line 42?>

<t>This document extends RFC9309 by specifying additional URI level controls through application level header and HTML meta tags originally developed in 1996. Additionally it moves the response header out of the experimental header space (i.e. "X-") and defines the combinability of multiple headers, which was previously not possible.</t>
    </abstract>
    <note removeInRFC="true">
      <name>About This Document</name>
      <t>
        The latest revision of this draft can be found at <eref target="https://garyillyes.github.io/ietf-rep-ext/draft-illyes-repext.html"/>.
        Status information for this document may be found at <eref target="https://datatracker.ietf.org/doc/draft-illyes-repext/"/>.
      </t>
      <t>Source for this draft and an issue tracker can be found at
        <eref target="https://github.com/garyillyes/ietf-rep-ext"/>.</t>
    </note>
  </front>
  <middle>
    <?line 46?>

<section anchor="introduction">
      <name>Introduction</name>
      <t>While the Robots Exclusion Protocol enables service owners to control how, if at all, automated clients known as crawlers may access the URIs on their services as defined by <xref target="RFC8288"/>, the protocol doesn't provide controls on how the data returned by their service may be used upon allowed access.</t>
      <t>Originally developed in 1996 and widely adopted since, the use-case control is left to URI level controls implemented in the response headers, or in case of HTML in the form of a meta tag. This document specifies these control tags, and in case of the response header field, brings it to standards compliance with <xref target="RFC9110"/>.</t>
      <t>Application developers are requested to honor these tags. The tags are not a form of access authorization however.</t>
    </section>
    <section anchor="conventions-and-definitions">
      <name>Conventions and Definitions</name>
      <t>The key words "<bcp14>MUST</bcp14>", "<bcp14>MUST NOT</bcp14>", "<bcp14>REQUIRED</bcp14>", "<bcp14>SHALL</bcp14>", "<bcp14>SHALL
NOT</bcp14>", "<bcp14>SHOULD</bcp14>", "<bcp14>SHOULD NOT</bcp14>", "<bcp14>RECOMMENDED</bcp14>", "<bcp14>NOT RECOMMENDED</bcp14>",
"<bcp14>MAY</bcp14>", and "<bcp14>OPTIONAL</bcp14>" in this document are to be interpreted as
described in BCP 14 <xref target="RFC2119"/> <xref target="RFC8174"/> when, and only when, they
appear in all capitals, as shown here.</t>
      <?line -18?>

</section>
    <section anchor="specification">
      <name>Specification</name>
      <section anchor="robots-control">
        <name>Robots control</name>
        <t>The URI level crawler controls are a key-value pair that can be specified two ways:</t>
        <ul spacing="normal">
          <li>
            <t>an application level response header.</t>
          </li>
          <li>
            <t>in case of HTML, one or more meta tags as defined by the HTML specification.</t>
          </li>
        </ul>
        <section anchor="application-layer-response-header">
          <name>Application Layer Response Header</name>
          <t>The application level response header field name is "robots-tag" and contains rules applicable to either all accessors or specifically named ones in the value. For historical reasons, implementors should support the experimental field name also — "x-robots-tag".</t>
          <t>The value is a semicolon (";", 0x3B, 0x20) separated list of key-value pairs that represent a comma separated list of rules. The rules are specific to a single product token as defined by <xref target="RFC9309"/> or a global identifier — "*". The global identifier may be omitted. The product token is separated by a "=" from its rules.</t>
          <t>Duplicate product tokens must be merged and the rules deduplicated.</t>
          <artwork><![CDATA[
; key-values definition for the robots-tag response header.
robots-tag = "robots-tag" ":" robots-tag-values
robots-tag-values = *(value ";")
value = ( global-product-token / ( product-token "=" ) ) [rule]
global-product-token = "*" / OWS
product-token =  1*( %x2D / %x41-5A / %x5F / %x61-7A )
rule = "noindex" / "nosnippet"
OWS = *( SP / HTAB )
]]></artwork>
          <t>For example, the following response header field specifies "noindex" and "nosnippet" rules for all accessors, however specifies no rules for the product token "ExampleBot":</t>
          <artwork><![CDATA[
Robots-Tag: *=noindex, nosnippet; ExampleBot=;
]]></artwork>
          <t>The global product identifier "*" in the value may be omitted; for example, this field is equivalent to the previous example:</t>
          <artwork><![CDATA[
Robots-Tag: noindex, nosnippet; ExampleBot=;
]]></artwork>
          <t>Implementors <bcp14>SHOULD</bcp14> impose a parsing limit on the field value to protect their systems. The parsing limit <bcp14>MUST</bcp14> be at least 8 kibibytes <xref target="KiB"/>.</t>
        </section>
        <section anchor="html-meta-element">
          <name>HTML meta element</name>
          <t>For historical reasons the robots-tag header may be specified by service owners as an HTML meta tag. In case of the meta tag, the name attribute is used to specify the product token, and the content attribute to specify the comma separated robots-tag rules.</t>
          <t>As with the header, the product token may be a global token, "robots", which signifies that the rules apply to all requestors, or a specific product token applicable to a single requestor. For example:</t>
          <artwork><![CDATA[
<meta name="robots" content="noindex">
<meta name="examplebot" content="nosnippet">
]]></artwork>
          <t>Multiple robots meta elements may appear in a single HTML document. Requestors must obey the sum of negative rules specific to their product token and the global product token.</t>
        </section>
        <section anchor="robots-controls-rules">
          <name>Robots controls rules</name>
          <t>The possible values of the rules are:</t>
          <ul spacing="normal">
            <li>
              <t>noindex - instructs the parser to not store the served data in its publicly accessible index.</t>
            </li>
            <li>
              <t>nosnippet - instructs the parser to not reproduce any stored data as an excerpt snippet.</t>
            </li>
          </ul>
          <t>The values are case insensitive. Unsupported rules must be ignored.</t>
          <t>Implementors may support other rules as specified in Section 2.2.4 of <xref target="RFC9309"/>.</t>
        </section>
        <section anchor="caching-of-values">
          <name>Caching of values</name>
          <t>The rules specified for a specific product token must be obeyed until the rules have changed. Implementors <bcp14>MAY</bcp14> use standard cache control as defined in <xref target="RFC9110"/> for caching robots-tag rules. Implementors <bcp14>SHOULD</bcp14> refresh their caches within a reasonable time frame.</t>
        </section>
      </section>
    </section>
    <section anchor="security-considerations">
      <name>Security Considerations</name>
      <t>The robots-tag is not a substitute for valid content security measures. To control access to the URI paths in a robots.txt file, users of the protocol should employ a valid security measure relevant to the application layer on which the robots.txt file is served — for example, in the case of HTTP, HTTP Authentication as defined in <xref target="RFC9110"/>.</t>
      <t>The content of the robots-tag header field is not secure, private or integrity-guaranteed, and due caution should be exercised when using it. Use of Transport Layer Security (TLS) with HTTP (<xref target="RFC9110"/> and <xref target="RFC2817"/>) is currently the only end-to-end way to provide such protection.</t>
      <t>In case of a robots-tag specified in a HTML meta element, implementors should consider only the meta elements specified in the head element of the HTML document, which is generally only accessible to the service owner.</t>
      <t>To protect against memory overflow attacks, implementers should enforce a limit on how much data they will parse; see section N for the lower limit.</t>
    </section>
    <section anchor="iana-considerations">
      <name>IANA Considerations</name>
      <t><tt>
TODO(illyes):
https://www.rfc-editor.org/rfc/rfc9110.html#name-field-name-registry
</tt></t>
    </section>
  </middle>
  <back>
    <references anchor="sec-combined-references">
      <name>References</name>
      <references anchor="sec-normative-references">
        <name>Normative References</name>
        <reference anchor="RFC2817">
          <front>
            <title>Upgrading to TLS Within HTTP/1.1</title>
            <author fullname="R. Khare" initials="R." surname="Khare"/>
            <author fullname="S. Lawrence" initials="S." surname="Lawrence"/>
            <date month="May" year="2000"/>
            <abstract>
              <t>This memo explains how to use the Upgrade mechanism in HTTP/1.1 to initiate Transport Layer Security (TLS) over an existing TCP connection. [STANDARDS-TRACK]</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="2817"/>
          <seriesInfo name="DOI" value="10.17487/RFC2817"/>
        </reference>
        <reference anchor="RFC8288">
          <front>
            <title>Web Linking</title>
            <author fullname="M. Nottingham" initials="M." surname="Nottingham"/>
            <date month="October" year="2017"/>
            <abstract>
              <t>This specification defines a model for the relationships between resources on the Web ("links") and the type of those relationships ("link relation types").</t>
              <t>It also defines the serialisation of such links in HTTP headers with the Link header field.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="8288"/>
          <seriesInfo name="DOI" value="10.17487/RFC8288"/>
        </reference>
        <reference anchor="RFC9110">
          <front>
            <title>HTTP Semantics</title>
            <author fullname="R. Fielding" initials="R." role="editor" surname="Fielding"/>
            <author fullname="M. Nottingham" initials="M." role="editor" surname="Nottingham"/>
            <author fullname="J. Reschke" initials="J." role="editor" surname="Reschke"/>
            <date month="June" year="2022"/>
            <abstract>
              <t>The Hypertext Transfer Protocol (HTTP) is a stateless application-level protocol for distributed, collaborative, hypertext information systems. This document describes the overall architecture of HTTP, establishes common terminology, and defines aspects of the protocol that are shared by all versions. In this definition are core protocol elements, extensibility mechanisms, and the "http" and "https" Uniform Resource Identifier (URI) schemes.</t>
              <t>This document updates RFC 3864 and obsoletes RFCs 2818, 7231, 7232, 7233, 7235, 7538, 7615, 7694, and portions of 7230.</t>
            </abstract>
          </front>
          <seriesInfo name="STD" value="97"/>
          <seriesInfo name="RFC" value="9110"/>
          <seriesInfo name="DOI" value="10.17487/RFC9110"/>
        </reference>
        <reference anchor="RFC9309">
          <front>
            <title>Robots Exclusion Protocol</title>
            <author fullname="M. Koster" initials="M." surname="Koster"/>
            <author fullname="G. Illyes" initials="G." surname="Illyes"/>
            <author fullname="H. Zeller" initials="H." surname="Zeller"/>
            <author fullname="L. Sassman" initials="L." surname="Sassman"/>
            <date month="September" year="2022"/>
            <abstract>
              <t>This document specifies and extends the "Robots Exclusion Protocol" method originally defined by Martijn Koster in 1994 for service owners to control how content served by their services may be accessed, if at all, by automatic clients known as crawlers. Specifically, it adds definition language for the protocol, instructions for handling errors, and instructions for caching.</t>
            </abstract>
          </front>
          <seriesInfo name="RFC" value="9309"/>
          <seriesInfo name="DOI" value="10.17487/RFC9309"/>
        </reference>
        <reference anchor="RFC2119">
          <front>
            <title>Key words for use in RFCs to Indicate Requirement Levels</title>
            <author fullname="S. Bradner" initials="S." surname="Bradner"/>
            <date month="March" year="1997"/>
            <abstract>
              <t>In many standards track documents several words are used to signify the requirements in the specification. These words are often capitalized. This document defines these words as they should be interpreted in IETF documents. This document specifies an Internet Best Current Practices for the Internet Community, and requests discussion and suggestions for improvements.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="2119"/>
          <seriesInfo name="DOI" value="10.17487/RFC2119"/>
        </reference>
        <reference anchor="RFC8174">
          <front>
            <title>Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words</title>
            <author fullname="B. Leiba" initials="B." surname="Leiba"/>
            <date month="May" year="2017"/>
            <abstract>
              <t>RFC 2119 specifies common key words that may be used in protocol specifications. This document aims to reduce the ambiguity by clarifying that only UPPERCASE usage of the key words have the defined special meanings.</t>
            </abstract>
          </front>
          <seriesInfo name="BCP" value="14"/>
          <seriesInfo name="RFC" value="8174"/>
          <seriesInfo name="DOI" value="10.17487/RFC8174"/>
        </reference>
      </references>
      <references anchor="sec-informative-references">
        <name>Informative References</name>
        <reference anchor="KiB" target="https://simple.wikipedia.org/wiki/Kibibyte">
          <front>
            <title>KibiByte</title>
            <author>
              <organization/>
            </author>
            <date year="2022" month="October" day="14"/>
          </front>
        </reference>
      </references>
    </references>
    <?line 166?>

<section numbered="false" anchor="acknowledgments">
      <name>Acknowledgments</name>
      <t>TODO acknowledge.</t>
    </section>
  </back>
  <!-- ##markdown-source:
H4sIAAAAAAAAA5VZ63IbtxX+v0+Bbiat5OFSoqIkNhW5oSzZ1kS2XEkeN+PR
TMBdkIvRcsECWFFsJp0+RB+gz9JH6ZP0OwD2RjLJNJ5Iu1jgXL9zg5Ikiay0
hRiz+EZNlTXs4iktKiNVyT5oZVWqCixZUbqlmdLs480luxKPomCvVGm1KuKI
T6daPBKNiw/iyTbbCtoWRym3Yq70esxkOVNRlKm05AvwzDSf2UQWxVqYRIsl
ziaHo8hU04U0xNCul9h2eXH3mrEvGC+MAhNZZtiKH6WNBywWmbRKS17Qy+Xk
DL/APr68uXsdR2W1mAo9jjKIMGZHh0fHyegwGT2PUlUaKFWZMbO6EhGk/yoC
Cy34mE1uLiZ4WSn9MNeqWo7ZpzfsE95kOWdvaCV6EGt8zsYRS5h2lhvaJxs9
irISY5xlzUF68Xr0KWB5wWVBW74XT3yxLMQwVQta5zrNxyy3dmnGBwedjwcg
B9LS5tUUlphzvfbWO5DCzsiECWwYY08BhY3FnppKu3fozw+l6p062OGNYW4X
cGDEK5srTcqCNGOzqii8B+M3IMsu3ZnYfVN6zkv5d27hvjF7o9S8EOzq6tXQ
fRVe5Y7k38/dFtIOjEqlFzj6CBsydvP61dHz0bfh8fnR8+fh8cVodFg/fnX4
YhwRrjoHf5BnY8fOcj0XtjWlkc6SK/kgl8ANH0LaA3o7+EFO5XRthT/mQ4LW
zuq1BkFHDkHHURQlScL41FjNUxtFd7k0DNCuFgAmExQzmalFZNM1M0uRytma
EMAzgBYG4kUbJyz14WSYzQGQec74clnI1FkybMkFz4RmvMzY27t3V2whLIeS
cwOzy7kEvWLNMtqqoB/CjY1evPhmyCYNP3yXli3UoyA+gmlhlhQKNWlVWaZm
7pN4WgotSRvecDZLngq2J4diyOK/JvG+kyUTM1kGgvDjFIJMZSHtmkgtqsJK
WD2QMAO2ymWasxU3bIm0IVVlIFWpLFsqRP0UDvK2XcgsK0SEiLgky2RVSjpE
0adcghwx+/WcJSBCAZGM0I8SIqtVCd7MqtrMLFerAZMzxi0SSzFgwLgChGC2
tJBQ2rCHEqcYxEw1XxV0fMHXjKepMF5X+A6WL+lZ6pqVoRPeIhm5/XPA7v3A
nVnWEmZKmPJPlhYeZSZa94MgZHObAToOF9lKB2I9Tk6cqWCVwccKbiRF1Aov
XkaY8fo3YOFctwJrfOSZWpLqRpap8IKCapJy0wjGAO9CzCzZcAdoXWQRWDyD
HdgyLjHjm6MKZDgIh70Uv7TGG0wPWT+ifPhID7OOWAT/gdOlQ3oXtHG2yAZs
qhGBhqIAihiLg1wjUIFbRBuH+rCJzZ3bKM/cw4qTTiDWZgQaUCzA5G8VEi2U
BrVcIX8F8UgsUsE/ub0Ecd5q6nHkU2tImOR40NdDAj3KK8oJLRun3jlhyoWx
oWwjGGoQ1SgIH7/7eHtH9Y9+s/fX7vnm4i8fL28uzun59u3k6qp5iMKO27fX
H6/O26f25Kvrd+8u3p/7w1hlvaUofjf5MfZGj68/3F1ev59cxd6VXZeR0jAL
ICqBC41oJ0NxE2XCpFpOPVTOXn34z79Hx+znn/9AGX80evHLL+EF2f8YL6tc
lJ6bKgFW/wozryOkSMEdqIBxeH8pkawIDoj8nKI3F5rSybPPZJn7Mftumi5H
xy/DAincW6xt1lt0Ntte2TrsjbhjaQebxpq99Q1L9+Wd/Nh7r+3eWfzuzwWy
DkN/8+eXEUHo1seMhy4WvqgzZggej6NONPtE10Y1eZAT0JJHXlTIXlwSwJEz
U16SY+uoBP5XCjl9bcawNny1o3ptBOQQ+zayATIE5EcMLRQYt9Wtn1EpuF3u
MF31KGagYDdWr/gaytzUbN86tl7n35XOpwtGbQ5lvtg3eQnEiR0SyURcIjR1
RXUm0JsWDvECGYTqNDDpw1xpKtGtwJSQiTQhGqdDEnQ2HrLX2Ik4oq4WOyEY
NxBs0KZYogZ4V5DPVMul0na7YHfEp76Z/XGRcZOfsPgp6agy9NbwzoWaHLVl
IVGeYJa9+AQxfvj01Rn9PDrcx7cl165EFhCPnNZHhvHQQO8IY7oEQGl1wXcc
dFbz+TEYUDdgSsmEnGoR9Y5LX/mx9iDKHbWV2qt7Mi5n80JNobqk0YBAqdl/
//kvFj+LPaPtz6GAqoW0EM7v6vOTpiM8OHIWn8ZsptUCFSQ4H0Y8rzycNo6j
Y6ig75SwjE40c8ixjc6ZyOpzGYj8I/wXnbR2Deq6tO/mKne6ceB2THW+nfZR
G4/jzslAPtpawbFne96nAMB+5B9P2V6wXxI0TLyBDvChv0IG2se/z6TkfbTz
1Cl5BWevP91Gm1/Y6Nke+/Lp6Bzfv3w6HiVfT9zT16/dr29GybcTth8RdaJT
KhoIn4gank0pURIwAoGy04TdfsCXt3eTM5xpLBxRjIXBahAaEOqdqDnfnQja
9qPl6OpfyzN4lbzUi/xBXdY7RErV2W23YBdfeNnOlI3HHWT49J3c8fmYPTsN
ggxYI8MJaw+ennT07eC/5tSJA3JGNwdtBMaJk7JjLkSFtwoe0ABJHKJwR9h6
VXxXX5/4NQX+D/Evu6kvFFWkQ2WoPCE8KVcgt0Dc0I8H+bw2kIvabpHauoFe
o2VbhPzTP+7aAqiOPFYg8Vr2nD2E2dCwz5gs70OhaQcw4UXzoNpO3JshG2AV
TNwWUBoR+9MKp8avP+kNMQv1utz6g0exz/fWormqrMvobjigTtdPn9tYGzRJ
iUqay9rN+Y1zm8m8m4dCIpwY3z7Tdq/oYAe8g+5Nxg6ChHQV1yOikfOybvm5
7SROqrZrVyWKom7BVRgweFtGNkpHr0Q39aU57gvvDsx+52xMpj2tJaxtddok
g5e9bYEK9va21pniZQfa7+r52JPuQSpMnG2TWwvtMFH32UO0OLUJfMFRU+E9
Zio3a5Ri7i5Hgvm6ddYHxIalAiA2Eob7GNDf7yNDIfRppp7iWagn9ThWl3nX
IAazsQRqGatB3scJBSOCA4LRtEQq+UmfIgOIcwMxDEG1d1lN4c+insgdS0d0
6OgHW/8OB+pVSDuAsVx7foGLDz7xlGJwgSSeWrdf8j2Li0XpLhMlmXjIPpah
KaMIcUrXTQDgTPSHG/mMfFz3cco1jsFWppMdoPStcBcg7Gh4NDwmszbtT3DK
K57mlMnwKVT3qG2wWlKz34qSWlaCEF0roEQUHf/lHChKc17OqV/qqYEBhbJN
M1PDNGnezumdzg26NOO1kyYNgm8lFLYr8WsxQ43OA3QdF591XIT4rOvDHM0w
ejVEpBuoYb9K060UJmuD6qd5Z5jusJYmjOqmmhorLWVCkhI2lVmTJU1NbQGG
lXbdbHu7VN8SqfqiCLizufFB3F4Yo0xRRYXZdBMozQ1RaPEFTKCo8/T8N/lC
YQwwvC2/vdHGTUB48Am1rUQNc9/huuCqR4ResQ9tQTul3X0YuJ9sUuED4BFY
/Yp/Q8TUVquTwVY5bPoJF/ekI5gvNVoLK/y9kRVz0juZVyg/eBOZL11ZReJV
TohgsikNQ0KnkoofXRfAwoQviVz50StyBxrGhZyfEhtw7N1d3e77IubU3Guh
Stw+h1vp+30SFmc01Cp8tnW3E6LM0Momgi7X+Dr0Hu6Gz1RwQWhE/MDaqeW8
a5Ne2PPtVmP3LJgGWHs5muagqSU9qnWBrj/XnulVlroQQ9O5QEvi5lZHvZNz
A+x6vQt5vW26+JwmZQtxMNeDAFrhGZptajR4+tCdbEWrjaA7fcrLbV9H16IL
MqJL0HQBBD+hA3Ap/QQSkBQ+R75vGmu6EdWehssCl5P3k60M8NNPP0V31+fX
e/6PEvvjqP6TwWq1GupZmvg/M7k/GOCV/idMuD+QfEE1P3EATtyjFnN0gXrt
yPq77CkUJe6TlG6VC5HNnU+in8f+T1QiO41nGNRF/EvkJIGF653IXv8DE7oe
X6kbAAA=

-->

</rfc>
