Calls: Send in your ideas. Deadline December 1st, 2021.

Last update: 2004-01-12

Project Mail::Box final report

[Improving Mail::Box — concluded in 2003]

[Mark Overmeer, MARKOV Solutions, 4 January 2004]

Introduction

Early 2002, foundation Stichting NLnet offered Mark Overmeer a grant to improve the development and user support for the Mail::Box software he had developed. The project started in May 2002, and was extended into a follow-up in January 2003. In December 2003, NLnet's involvement ends with this project final paper.

Mail::Box is an Open Source software library for the Perl programming language. It is designed to help (mainly system administrators) implement automatic e-mail processing. For instance, it can be used to create automatic replies on incoming e-mail, do spam and virus filtering, and implement web-based e-mail clients. The software is released under GPL.

This document describes the contributions to Mail::Box realized with the generous grant of Stichting NLnet. After completion of this sponsored project, the module will continue to be supported on voluntary basis.

Software Development

One of the areas covered by the project plan was pure software development, targeted to provide more "weight" to the library. Development took place on different aspects, of which the most important are listed below.

Main concerns during the development were

  • What do people want to do with automated e-mail scripts?
  • How to provide the user with an interface which is so simple that mistakes against the rules (the RFCs) are hard to make.
  • How do people want to do things? There are many different ways to construct a reply message, but in the library they should be handled in one place with a uniform and simple interface. Designing this interface is one of the most complex aspects of software development.
  • All actions are without any user interaction: it must be able to run in stand-alone scripts. This requires a flexible approach on the input. For instance, the user expects to create an "in-lined" reply on an arriving text message, however a nested multi-part message with only binary information comes in... Still, the reply should be correct and look nice.
  • The implementation must be platform independent and secure.
  • Everything must be documented, and the source code understandable.

These criteria are the main difference between the other e-mail related Perl modules and Mail::Box: other software only provides low level operations on simple messages, and expects users to create the smarter functionality.

Releases

The software is released under the GNU Public License (GPL), which means that everybody is free to use and redistribute it. Fixes and extensions made to the software must be made public, which in this case actually resulted in some contributions.

As release model was chosen for a "release often" policy: new releases are made frequently. Each release contains bug-fixes and new features. Feedback about the release is swiftly handled to improve the next release.

For a project with only one developer, it is not practicle to use a system with a development branch and a stable branch. When a new release doesn't break existing code, the need for a stable branch is even lower. Mail::Box therefore used a single track release schedule.

When the NLnet project took off, Mail::Box 2.012 was current. At the end of the supported period, release 2.052 was reached. This gives an average of one release every two weeks, forty releases in total. Only a few of these releases were emergency fixes, which is a consequence of the release policy.

New features

The next sections will give some details about major developments during the time span of the project funded by NLnet.

New folder types

Folders (also named mailboxes), are used to store groups of e-mail messages. When the project started, only the Mbox, MH were supported. These types are the most popular types on Unix systems. However, the target of Mail::Box is to create a folder type independent interface, which is usable on as many different platforms as possible.

Maildir support was the first one to be added. It was already in a rough implementation at the start of the project, and finished in one of the early releases: version 2.015, June 2002.

In a joined effort with Liz Mattijsen, a POP3 connector was implemented. This full implementation of the POP3 protocol has nice extra features, like invisibly reconnecting lost connections to the server. POP3 was first released with version 2.027, October 2002.

Likewise, based on software from Tassilo von Parseval, access to Outlook DBX folders was implemented. The closed nature of this Microsoft format makes it impossible to provide write access to these folders, but read access is enough to move e-mail archives away from Outlook into Open Source folder types. Included since version 2.042, May 2003.

David J. Kernen wrote an IMAP4 protocol handler. That handler was used to create a back-end to that popular folder type, were the messages resides on a remote server. Not all features are implemented yet, but an alpha version was released in December 2003.

The main complication of the implementation is the level of portability and auto-configuration which should take place. In general, with Mail::Box the user's program doesn't need to know how the messages are stored. At the same time, much effort is put in optimizing performance in folder type dependent ways.

Message construction

New ways to create and process messages were added. In all these extensions, the implementation tries to protect the user from mistakes against the official rules of the RFCs. It is hard to grasp the content of the (at least) five e-mail related standards, so people should be kept away from that when possible.

Next to the reply, forward, and build --which already existed but saw some improvements--, new read and rebuild actions were added. "Rebuild" can be used to add plain text alternatives to html message parts, remove structural complexity, and such. "Rebuild" was added in release 2.041, May 2003.

Unicode

Due to its origin in the United States, e-mail is using 7 bits ASCII. However, the current e-mail standards support the encoding of character sets used in other parts of the world. Mail::Box received a simple way to decode and encode these "foreign" characters, although this has a negative effect on performance. Unicode features have been added over time, and can not be pin-pointed to a certain release.

Field groups

Each (MIME compliant) e-mail message starts with a set of lines describing the content of the message and the transportation process. Mail::Box got ways to handle sets of these lines which are related.

For instance, a header may contain multiple Received} lines, showing the intermediate computers which were used to transport the message from the sender to the destination. These lines can be accompanied by a few more lines. The lines are not very useful, and can hence be removed per group or all together, saving disk space. But they can also easily be inspected and constructed. ResentGroups are included since release 2.023, September 2002.

Header fields added by mailing-list software can be recognized (and removed) as well as fields produced by spam-fighting software. A connection to the popular SpamAssassin software was made. More use in the area of spam-fighting software is expected, as result of the ever growing amount of unsolicited e-mail. ListGroups were introduced with release 2.044 in July 2003. SpamGroups were added in 2.048, released in August 2003.

Web-based clients

Maybe the most important application for Perl based e-mail scripting has to do with web-based mail applications. To simplify this task for system developers, the module HTML::FromMail was created. It provides template systems to implement a web-mail client, taking care of many complicated tasks. The thing left for the mail-client developer is to add interactivity and layout. HTML::FromMail saw daylight in October 2003.

Performance

Various approaches were taken to improve performance. For one: many methods in the library implement alternatives, and the user can decide which alternative is fastest in his or her case. Usually, the automatic decision is the best.

A good example of this smart behavior, is the way a message's body is stored when the program is run. It can be stored as one large string, a set of lines, or as temporary file on disk. For the functionality it's all the same, but it depends on the size of the message and the way it is used which version of storage performs best. The best way to keep the body data, is simply remember where to find it, and not get it at all unless needed. This lazy behavior has been implemented in many of the library's features.

Besides the lazy implementation, there have been some real performance improvements. The two best gains (up to 20% gain each) were reached by rewriting existing Perl components. An optional message parser, implemented in C, has a benefit as well. The C parser's first release became available in December 2002.

Mail::Box was developed for automated e-mail handling, but is in the current state and on new PCs, fast enough for user applications. Some functionality required for interactive applications could be added to simplify the task of the user client developers.

Documentation

The best way to improve the acceptance of software by a community, is with better documentation. Mail::Box has grown quite large, and it will never be easy to learn how to use such a large library. But without good documentation you are lost for sure.

The standard way of documenting Perl was not sufficient for a code base of this size. Therefore, Perl's documentation system was extended with new syntax and a new tool to produce an homogeneous set of manual pages. These pages are indexed in various ways, and contain many examples and detailed explanations of concepts. The statistics at the end of the project:

Classes (and manuals) 128
Documented methods 931
Documented diagnostics 165
Shown examples 228

Probably the easiest way to find the right methods to do a certain task, is by browsing through the HTML version of the documentation. This was all realized during the project.

The documentation tool has been released as separate product, named OODoc: Object Oriented Documentation.

Promotion

Providing a good implementation is one side of the story, but growing an user community is even more important for a successful product. Therefore, quite some promotional activities were initiated.

Mailing-list

In an attempt to get people reacting on each other's use of Mail::Box a mailing-list was started. There is quite some activity on the list, but this is mainly related to mistakes in user code and bugs in the library. It did not really contribute to joint development.

The mailing-list started in May 2002. In February 2003, it had 70 members. At the end of the project in December 2003, 118 people followed the list.

    Jan 2003 132 messages
    Feb 2003 138 messages
    Mar 2003 92 messages
  (from May 20) Apr 2003 104 messages
May 2002 7 messages May 2003 69 messages
Jun 2002 73 messages Jun 2003 22 messages
Jul 2002 25 messages Jul 2003 68 messages
Aug 2002 79 messages Aug 2003 45 messages
Sep 2002 53 messages Sep 2003 34 messages
Oct 2002 76 messages Oct 2003 26 messages
Nov 2002 57 messages Nov 2003 47 messages
Dec 2002 64 messages Dec 2003 33 messages
    (till Dec 6)
Messages posted on the mailing-list.

The usage statistics of the list's archive , show a very irregular pattern, mainly driven by major changes in the library.

On the other hand, the mailing-list statistics are only showing a part of all traffic: to avoid boring other list members, bug hunts usually took place off-list. The mailing-list received 1244 messages, while Mark's personal archive adds up to 3500 (of which about half written by me). It is clear that answering messages consumed a lot of time... for the 19 months of the project, this means 8.8 message per workday, or 22 per sponsored workday.

Conferences

Mail::Box was promoted in various ways, but mainly by giving talks on various (Perl) conferences. Preparations of the abstracts, papers, and slides consumed more time than planned for the project. NLnet sponsored mainly the travel and stay.

SANE 2002 in Maastricht, The Netherlands
a 45 minutes contribution on the use of Mail::Box for system administration, entitled "E-mail with Perl". (UNIX system and network administrators conference)
YAPC::Europe 2002 in Munich, Germany
A three hours tutorial "E-mail programming with Mail::Box", 45 minutes talk on software development of large libraries, and 7 minutes about the Mail::Box spin-off module Object::Realize::Later. (European Perl conference)
German Perl Workshop 2003 in Bonn
45 minutes talk on the Mail::Box spin-off User::Identity, 15 minutes about Unicode e-mail headers.
YAPC::NA 2003 in Boca Raton, Florida, USA
90 minutes tutorial on Mail::Box, 20 minutes for OODoc, and 5 minutes for Object::Realize::Later. (North-American Perl conference)
YAPC::EU 2003 in Paris, France
new 95 minutes tutorial on Mail::Box.

Attracting external developers

Op

en Source projects must have a community to be successful. They do not only require a group of users --supporting each other with the use of software--, but should also have a group of developers which can supplement each other in the development process. A project which is developed by only one person, like Mail::Box, may collapse when that one person stops development, for instance by illness or lack of time.

In the ideal situation, a group of active developers with comparable influence are in control. This can be found in FreeBSD, Gnome, and KDE development teams. However, the Linux kernel development has only a very small group of people on top, which do not call themselves {\it a team}. But that also shows to work. Many smaller applications depend on the effort of one person. When that developer stops its work, the product starts to faint away, which may take many years. For instance, the XV image displayer hasn't been changed since 1994, but is still distributed with the latest SuSE Linux.

During this Mail::Box project, effort was made to attract developers for the module, to try to shape {\it a team}. Time was reserved to encourage people to participate in development. However, this did not succeed.

One way to get people helping a hand, is by explicitly tackling their problems with the existing code. That way, a person relation is built, which may grow active developers. Every few months, the members of the mailing-list were asked for their needs. This always brought some life to the list, and some ideas to work on, but no code contributions.

Furthermore, each time someone spoke about their own application using Mail::Box, that person was invited to contribute the code as part of the module. People were not unwilling, but the conversion from an application which suites personal needs into code which is usable for other people is huge: much higher requirements on configuration, documentation, and automated testing. In some cases, the employer did not permit the contribution.

As test-case, it was planned to find someone to implement IMAP4 support. No less than four people offered to implement this, over time. Still, each time the good intentions faded when the complexity of the required code came clear to the volunteer. The POP3 protocol was much easier. Liz Mattijsen offered to implement it, and (once started) there was an full implementation within two weeks.

An other complication to get spontaneous code contributions, is the size of Mail::Box. Combined with its Object Oriented coding style, with up to 5 levels of inheritance, it is not easy to get a good feeling about the internals. It is hard to figure-out what the best spot for a new functionality is, and often some existing functionality has to be rewritten, redesigned, or relocated.

Many programmers do not feel capable enough to write code which is usable by other people: they hesitate to show their programs. To be honest: usually they are right. Getting them to release code requires a lot of guidance; many long e-mails explaining how to produce better code. Only a few reach a publishable level.

After 19 months, the number of received code patches has increased, but these are all quite small patches: never more than a few lines. No-one has offered to join core code development. Which is a shame.

Deployment

Mail::Box has found deployment in different areas. Most of these applications are hidden to the outside world: it is in most cases part of a company's internal infrastructure. Often, it is used to clean-up e-mail archives or handle databases containing messages.

To name a few applications:

  • In Taiwan, a Pen-Pal mailing system has been created to connect secondary school pupils. Mail::Box is used to shape virtual groups.
  • The YMB Antispam Project is one of many experimental spam filter tools which are based on Mail::Box.
  • PerlWebmail implements (as its name implies) a web-based e-mail client. It does not use HTML::FromMail, for one because that module did not exist at the moment of its design.
  • Conversion is on its way for tkMail, a Perl/Tk graphical e-mail client. It was based on older Perl e-mail libraries, but needed the Unicode features --new versions of the Tk library have full support for Unicode. (The tkMail command is distributed as part of Perl's Tk library release.)
\end{itemize}

Spin-offs

The Mail::Box development has resulted in a few modules which can also be used with other applications than purely e-mail related. These modules are

MIME::Types
is a collection of knowledge about MIME types, which can be used to map file-name extensions to types, vice versa.
OODoc
is a system to document complex (probably large, often Object Oriented) modules.
Object::Realize::Later
is a tricky module to implement lazy (delayed) creation of objects, which improves performance. This attracted a lot of attention from hard-core Perl programmers,
User::Identity
plays smart about user information, like deriving someone's probable language preference from an e-mail address. Or discovering a person's gender from a full name description in multiple languages.

Acknowledgements

Special gratitude to Stichting NLnet for offering me the chance to work on this free software package. With the help of NLnet, the Perl software base is enriched with a powerful library, which in time, may become the basis of the next generation e-mail applications.

The following people contributed. Some contributed documentation, other send in patches or bug reports. Major contributors are marked with (*).

Adam Augustine Gilles Darold Mike Cudmore
Adam Byrtek 'alpha' Greg Matheson* Mike Mimic
Alan Kelm* James Sanford Nick Ing-Simmons
Albert Schueller Jan Stapel Nik Clayton
Alex Liberman Jason Woodward Paul Simons
alex Jeff Squyres Phil Hagen
Alexander Bauer Jeffrey Friedl Phil Holden
Andre Schultze Jeremy Banks Pjotr Prins
Andreas Fitzner Jerrad Pierce Rob Holland
Andreas M. Riechert Joe Junkin Robin Berjon
Andreas Piper John B Batzel Ron Savage
Anthony D. Urso Jon Thomason Ronnie Paskin
asta Jost Krieger Sebastian Krahmer
Beirne Konarski Karen Craven Sebastian Willert
Benjamin Pineau Kees Dekker Shagren
Bernd Patolla Kingpin Simon Cozens
Bill Moseley Liz Mattijsen* Simon von Janowsky
Blair Zajac* Lutz Gehlen Slaven Rezic
Brian Grossman Marcel Gruenauer Stefan Wolfsheimer
Christoph Dahl Marcel de Boer Steve Lewis
Conrad Heiney Mark Ethan Trostler Steven Benson
Constantin Khatsckevich Mark Weiler Supriya Jagadeesh
Cory Johns Martin Thurn Swapnil Khabiya
Darrell Fuhriman Marty J. Riley Tassilo von Parseval*
David A. Golden* Marty Pauley Terrence Brennon
David Coppit* Matthew Darwin Tim Sellar
David Favor Matthew Lockner Todd Richmond
Dimitris Glynos Matthew Walker Tom Allison
Edward Wildgoose* Max Maischein Tony Bowden
Emmet Cailfield Max Poduhoroff Walery Studennikov
Eric Wheeler Melvyn Sopacua Wiggins d'Anconia
Eugene Eric Kim Michael D Richards Yuval Kojman
Evan Borgstrom Michael Reece
Francois Petillon Michael de Beer