Internet-Draft SMTPUTF8 address syntax September 2024
Gulbrandsen & Yao Expires 22 March 2025 [Page]
Workgroup:
mailmaint
Internet-Draft:
draft-gulbrandsen-smtputf8-syntax-00
Published:
Intended Status:
Standards Track
Expires:
Authors:
A. Gulbrandsen
ICANN
J. Yao
CNNIC

SMTPUTF8 address syntax

Abstract

This document specifies rules for email addresses that are flexible enough to express the addresses typically used with SMTPUTF8, while avoiding confusing or risky elements.

This is one of a pair of documents: This is simple to implement, contains only globally viable rules and is intended to be usable for software such an MTA. Its companion defines has more complex rules, takes regional usage into account and aims to allow only addresses that are readable and cut-and-pastable to some audience.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 22 March 2025.

Table of Contents

1. Introduction

[RFC6530]-[RFC6533] and [RFC6854]-[RFC6858] extend various aspects of the email system to support non-ASCII both in localparts and domain parts. In addition, some email software supports unicode in domain parts by using encoded domain parts in the SMTP transaction ("RCPT TO:info@xn--dmi-0na.fo") and presenting the unicode version (dømi.fo in this case) in the user interface.

The email address syntax extension is in [RFC6532], and allows almost all UTF8 strings as localparts. While this certainly allows everything users want to use, it is also flexible enought to allow many things that users and implementers find surprising and sometimes worrying.

The flexibility has caused considerable reluctance to support the full syntax in contexts such as web form address validation.

This document attempts to describe rules that:

  1. includes the addresses that users generally want to use for themselves and organizations want to provision for their employees.

  2. excludes things that have been described as security risks.

  3. Looks safe at first glance to implementers (including ones with little unicode expertise) and are fairly easy to use in unit tests.

  4. Contain no regional rules.

These goals are somewhat aspirational.

2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Terminology

Script, in this document, refers to the unicode script property (see [UAX24]). Each code point is assigned to one script ("a" is Latin), except that some are assigned to "Common" or a few other special values. Fraktur and /etc/rc.local aren't scripts in this document, but Latin is.

Latin refers those code points that have the script property "Latin" in Unicode. Orléans in France and Münster in Germany both have Latin names in this document. It also refers to combinations of those code points and combining characters, and to strings that contain no code points from other scripts.

Han, Cyrillic etc. refer to those code points that have the respective script property in Unicode, as well as to strings that contain no code points from other scripts.

ASCII refers to the first 128 code points within unicode, which includes the letters A-Z but not É or Ü. It also refers to strings that contain only ASCII code points.

Non-ASCII refers to unicode code points except the first 128, and also to strings that contain at least one such code point.

By way of example, the address info@dømi.fo is latin and non-ASCII, its localpart is latin and ASCII, and its domain part is latin and non-ASCII. 中国 is a Han string in this document, but 阿Q正传 is neither a Latin string nor a Han string, because it contains a Latin Q and three Han code points.

4. Rules

Based on the above goals, the following rules are formulated:

  1. An address MUST NOT contain an a-label (e.g. xn--dmi-0na).

  2. An address MUST contain only code points in the PRECIS IdentifierClass.

  3. An address MUST consist entirely of a sequence of composite characters, ZWJ and ZWNJ. ("c" followed by "combining hook below" is an example of a composite character, "d" is another example; see [RFC6365] for the definition.)

  4. An address MOT NOT contain more than one script, disregarding ASCII. (Disregarding ASCII, the word Orléans contains only an é, which is one script, namely Latin.)

5. Examples

example@example.com is legal, because 1) it does not contain any a-label, 2) it consists entirely of permissible code points, 4) it consists of 19 composite characters, and 4) it contains no non-ASCII code points at all.

The address dømi@dømi.fo is nice, because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of 12 composite characters, 5) does not apply and 6) it consists entirely of 'Latin' and 'Common' code points (and ./@).

The address U+200E '@' U+200F '.' U+200E is not nice, because 4) U+200E and U+200F are not parts of composite characters.

阿Q正传@阿Q正传.example is legal because it contains ASCII and Han, dømi@dømi.fo is legal because it contains ASCII and Latin, but 阿Q正传@dømi.fo is illegal becasue it contains Han 阿 and the Latin non-ASCII letter ø.

TODO: add more examples and rationales again.

6. IANA Considerations

This document does not require any actions from the IANA.

7. Security Considerations

When a program renders a unicode string on-screen or audibly and includes a substring supplied by a potentially malevolent source, the included substring can affect the rendering of a surprisingly large part of the overall string.

This document describes rules that make it difficult for an attacker to use email addresses for such an attack. Implementers should be aware of other possible vectors for the same kind of attack, such as subject fields and email address display-names.

If an address is signed using DKIM and (against the rules of this document) mixes left-to-right and right-to-left writing, parts of both the localpart and the domain part can be rendered on the same side of the '@'. This can create the appearance that a different domain signed the message.

The rules in this document permit a number of code points that can make it difficult to cut and paste.

8. References

8.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC5322]
Resnick, P., Ed., "Internet Message Format", RFC 5322, DOI 10.17487/RFC5322, , <https://www.rfc-editor.org/rfc/rfc5322>.
[RFC6365]
Hoffman, P. and J. Klensin, "Terminology Used in Internationalization in the IETF", BCP 166, RFC 6365, DOI 10.17487/RFC6365, , <https://www.rfc-editor.org/rfc/rfc6365>.
[RFC6530]
Klensin, J. and Y. Ko, "Overview and Framework for Internationalized Email", RFC 6530, DOI 10.17487/RFC6530, , <https://www.rfc-editor.org/rfc/rfc6530>.
[RFC6532]
Yang, A., Steele, S., and N. Freed, "Internationalized Email Headers", RFC 6532, DOI 10.17487/RFC6532, , <https://www.rfc-editor.org/rfc/rfc6532>.
[RFC6533]
Hansen, T., Ed., Newman, C., and A. Melnikov, "Internationalized Delivery Status and Disposition Notifications", RFC 6533, DOI 10.17487/RFC6533, , <https://www.rfc-editor.org/rfc/rfc6533>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

8.2. Informative References

[RFC3490]
Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, DOI 10.17487/RFC3490, , <https://www.rfc-editor.org/rfc/rfc3490>.
[RFC5891]
Klensin, J., "Internationalized Domain Names in Applications (IDNA): Protocol", RFC 5891, DOI 10.17487/RFC5891, , <https://www.rfc-editor.org/rfc/rfc5891>.
[RFC6854]
Leiba, B., "Update to Internet Message Format to Allow Group Syntax in the "From:" and "Sender:" Header Fields", RFC 6854, DOI 10.17487/RFC6854, , <https://www.rfc-editor.org/rfc/rfc6854>.
[RFC6858]
Gulbrandsen, A., "Simplified POP and IMAP Downgrading for Internationalized Email", RFC 6858, DOI 10.17487/RFC6858, , <https://www.rfc-editor.org/rfc/rfc6858>.
[UAX24]
Whistler, K., "Unicode Script Property", n.d., <https://unicode.org/reports/tr24>.
[UMLAUT]
"Metal Umlaut", n.d., <https://en.wikipedia.org/wiki/Metal_umlaut>.
[TYPE_EMAIL]
"WHATWG input type=email", n.d., <https://html.spec.whatwg.org/multipage/input.html#email-state-(type=email)>.

Appendix A. Acknowledgments

The authors wish to thank John C. Klensin, [your name here, please] [oh wow, the ack section is already outdated]

Dømi.fo and 例子.中国 are reserved by nic.fo and CNNIC for use in examples and documentation.

阿Q正传@ is a famous Chinese novella, 阿Q is the main character.

Appendix B. Instructions to the RFC editor

Please remove all mentions of the Protocol Police before publication (including this sentence).

Please remove the Open Issues section.

Appendix C. Open issues

  1. PRECIS IdentifierClass?

  2. More examples.

  3. Wording to identify destiny; I think this should probably become a proposed standard and modify a couple of RFCs, but I'm uncertain about some details and left that open now.

  4. More words on the relationship between this and the companion. There are several parallel differences, maybe this warrants a section of its own.

  5. Should this even mention the requirements placed on domains by IDNA, ICANN, web browsers and others?

Authors' Addresses

Arnt Gulbrandsen
ICANN
6 Rond Point Schumann, Bd. 1
1040 Brussels
Belgium
Jiankang Yao
CNNIC
No.4 South 4th Zhongguancun Street
Beijing
100190
China