About two weeks ago I entertained myself a bit with nested trees in databases. To get some experience with it, I imported a mbox file containing postings to a high-traffic mailing list into MySQL. (With the help of some PEAR packages.)

I soon ran into trouble when it came to the point where I had to decide by which criterias the threading tree should be built. The most obvious one is of course to use the Message ID and the In-Reply-To: header. But this turned out to be impractical, because some mail clients abstain from setting In-Reply-To: correctly when sending a reply.

As a result I additionally took the References: header into account, which usually contains a bunch of Message IDs, of which the last one is equivalent to the Message ID that should be in In-Reply-To:. But even this didn't work out perfectly, because there are clients that don't set In-Reply-To: and don't set References:. (Not to mention the people that don't hit the "Reply" button to write a response but us the "New mail" button instead.)

Ok, I know that both of these fields are optional (according to RFC 822), but what's so hard about using them nonetheless? It looks like they really do expect everybody to use a combination of In-Reply-To:, References: and the subject to perform threading. And while this is of course technically feasible, it will inevitable result into malformed threads when there are two different threads with the same subject. Great, great, great.
Written on 16 Jul 03 10:46 PM.