Great Circle Associates List-Managers
(June 2003)
 

Indexed By Date: [Previous] [Next] Indexed By Thread: [Previous] [Next]

Subject: Re: standards for iso encoding subject lines?
From: Nick Simicich <njs @ scifi . squawk . com>
Date: Tue, 10 Jun 2003 12:19:54 -0400
To: List Managers <list-managers @ greatcircle . com>
In-reply-to: <v0421010fbb06888a9c64 @ [66 . 92 . 48 . 201]>
References: <877k7zayxc . fsf @ windlord . stanford . edu> <5 . 2 . 1 . 1 . 0 . 20030606035859 . 07cf7d88 @ 199 . 74 . 151 . 1> <877k7zayxc . fsf @ windlord . stanford . edu>

At 11:09 AM 2003-06-06 -0700, Brent Chapman wrote:

At 10:38 AM -0700 6/6/03, Russ Allbery wrote:
Nick Simicich <njs @
scifi .
squawk .
com> writes:

> Just as a point:  This is a really poorly thought out RFC.  You might
> want to decode those in your MTA or mailing list manager before
> forwarding them to your subscribers.  You *can't* safely do so.

Why do you want to do that?  Certainly you're not allowed to do that
within that RFC because doing so would break the e-mail protocol.  The
whole reason why RFC 2047 exists is because 8-bit characters are not
allowed in RFC 822 and RFC 2822 headers.

Yes, but that does not mean that you have to put the encoding of them there. Were I writing the standard, I would have required that the old headers contain the best possible encoding in seven bit characters. If no such encoding existed, leave the header out or leave the clause out. The stuff that was encoded should have been elsewhere. Have a special header for it, or (probably better) put it in a new body section, or something. Mail headers have traditionally been human readable. This RFC violated that basic principle, and that, in and of itself, made it a bad idea.

Let's say I have an archive of email messages to a list, and want to create an index by subject, or to enable searches by subject. How am I supposed to do this if the Subject header is encoded, and I'm not supposed to decode it except for display?

I'm not sure. Consider that all handling of the data must be done in a binary safe manner. The encoder could encode a binary zero byte, for example, which will screw up most typical string handling. Also, the data could well be in a double byte character set. It might well not be represented by any 7 bit character.

Your parser that assumes characters that fit in a normal (say) representation of a C string is probably already broken. The parser that tokenizes into words is also likely broken. A space is not a space, a punctuation is not a punctuation, and a word-character is not a word-character and a non-word character is not a non-word character. Except in the context of the character set...which is not constant for the document, if I remember right.

I have considered the simple scheme of looking for this type of encoding in the headers and returning them to their origin for the lists I run which are supposed to be English. The thing that stops me is that, just as in the cases of html body sections, the people writing the e-mail didn't always know what they were doing that caused the mime bodies to be generated.

This was the original straw that caused me to write demime, and this is why I would want to do something similar for encoded headers. But the standard makes it impossible.

If your parser can handle all of these things (including, if I remember correctly, character changes within the line) then it can index things encoded in this manner.

Similar problem with Base64-encoded message bodies (jeez, I hate Outlook).

There you have a different solution, in that you can decode them and re-encode in QP. Outlook is not the only guilty party there, though.

I agree with Nick here: ISO-encoded subject lines are a "solution" to a non-problem, where the people putting for the solution apparently didn't think through most of the consequences of what they were proposing.

I believe that the problem was real enough, especially for the people who use character sets that are multi-byte. However, I do not talk to people who communicate in those character sets with me, because I don't read any language that requires them, so I don't see it as a problem.

What I am extremely affronted by is the encoding of ordinary subject lines simply because (I've seen this) someone selected 8859-1 and then didn't use any characters that were not represented in ASCII. More frequently, someone simply picks the wrong tic mark, and this throws the whole header into encoding.

I feel that the standard should have forbidden the encoding of non-printable characters. There is no good reason to encode a newline, for example, or a control character, or a binary zero, nor are there good reasons to have same in subject lines -- in ASCII, these characters may not be in the text of the subject for a simple reason: They are used to delimit the line by the parsers. If the character does not resolve to printable point in the character set used, the standard should permit the message to be bounced/trashed/non-delivered, at the least. And if someone has a handler that locally can deal with eight bit characters in the headers, the decoded header should not break header parsing that is based on CR/LF or LF ending a "line", a space at the beginning of the line showing a continuation, and a blank line ending the headers. It should be possible to store a decoded headed in an eight-bit safe message store without breaking reparsing of the header. (In other words, there were good reasons for forbidding these things, and the standard should not have made this a free for all.)

--
He said: "There are people from Baath here reporting everything that
goes on. There are cameras here recording our faces. If the Americans
were to withdraw and everything were to return to the way it was before,
we want to make sure that we survive the massacre that would follow
as Baath go house to house killing anyone who voiced opposition to
Saddam. In public, we always pledge our allegiance to Saddam, but in
our hearts we feel something else."
Nick Simicich - njs @
scifi .
squawk .
com
References:
Indexed By Date Previous: Re: standards for iso encoding subject lines?
From: Russ Allbery <rra @ stanford . edu>
Next: Error codes
From: Bob Bish <bobbish @ earthlink . net>
Indexed By Thread Previous: Re: standards for iso encoding subject lines?
From: Russ Allbery <rra @ stanford . edu>
Next: Error codes
From: Bob Bish <bobbish @ earthlink . net>

Google
 
Search Internet Search www.greatcircle.com