Project Mail::Box final report
software for e-mail handling in Perl
[Mark Overmeer, MARKOV Solutions, 4 January 2004]
Introduction
Early 2002, foundation Stichting NLnet offered Mark Overmeer a grant to improve the development and user support for the Mail::Box
software he had developed. The project started in May 2002, and was extended into a follow-up in January 2003. In December 2003, NLnet's involvement ends with this project final paper.
Mail::Box
is an Open Source software library for the Perl programming language. It is designed to help (mainly system administrators) implement automatic e-mail processing. For instance, it can be used to create automatic replies on incoming e-mail, do spam and virus filtering, and implement web-based e-mail clients. The software is released under GPL.
This document describes the contributions to Mail::Box
realized with the generous grant of Stichting NLnet. After completion of this sponsored project, the module will continue to be supported on voluntary basis.
Software Development
One of the areas covered by the project plan was pure software development, targeted to provide more "weight" to the library. Development took place on different aspects, of which the most important are listed below.
Main concerns during the development were
- What do people want to do with automated e-mail scripts?
- How to provide the user with an interface which is so simple that mistakes against the rules (the RFCs) are hard to make.
- How do people want to do things? There are many different ways to construct a reply message, but in the library they should be handled in one place with a uniform and simple interface. Designing this interface is one of the most complex aspects of software development.
- All actions are without any user interaction: it must be able to run in stand-alone scripts. This requires a flexible approach on the input. For instance, the user expects to create an "in-lined" reply on an arriving text message, however a nested multi-part message with only binary information comes in... Still, the reply should be correct and look nice.
- The implementation must be platform independent and secure.
- Everything must be documented, and the source code understandable.
These criteria are the main difference between the other e-mail related Perl modules and Mail::Box
: other software only provides low level operations on simple messages, and expects users to create the smarter functionality.
Releases
The software is released under the GNU Public License (GPL), which means that everybody is free to use and redistribute it. Fixes and extensions made to the software must be made public, which in this case actually resulted in some contributions.
As release model was chosen for a "release often" policy: new releases are made frequently. Each release contains bug-fixes and new features. Feedback about the release is swiftly handled to improve the next release.
For a project with only one developer, it is not practicle to use a system with a development branch and a stable branch. When a new release doesn't break existing code, the need for a stable branch is even lower. Mail::Box
therefore used a single track release schedule.
When the NLnet project took off, Mail::Box
2.012 was current. At the end of the supported period, release 2.052 was reached. This gives an average of one release every two weeks, forty releases in total. Only a few of these releases were emergency fixes, which is a consequence of the release policy.
New features
The next sections will give some details about major developments during the time span of the project funded by NLnet.
New folder types
Folders (also named mailboxes), are used to store groups of e-mail messages. When the project started, only the Mbox, MH were supported. These types are the most popular types on Unix systems. However, the target of Mail::Box
is to create a folder type independent interface, which is usable on as many different platforms as possible.
Maildir support was the first one to be added. It was already in a rough implementation at the start of the project, and finished in one of the early releases: version 2.015, June 2002.
In a joined effort with Liz Mattijsen, a POP3 connector was implemented. This full implementation of the POP3 protocol has nice extra features, like invisibly reconnecting lost connections to the server. POP3 was first released with version 2.027, October 2002.
Likewise, based on software from Tassilo von Parseval, access to Outlook DBX folders was implemented. The closed nature of this Microsoft format makes it impossible to provide write access to these folders, but read access is enough to move e-mail archives away from Outlook into Open Source folder types. Included since version 2.042, May 2003.
David J. Kernen wrote an IMAP4 protocol handler. That handler was used to create a back-end to that popular folder type, were the messages resides on a remote server. Not all features are implemented yet, but an alpha version was released in December 2003.
The main complication of the implementation is the level of portability and auto-configuration which should take place. In general, with Mail::Box
the user's program doesn't need to know how the messages are stored. At the same time, much effort is put in optimizing performance in folder type dependent ways.
Message construction
New ways to create and process messages were added. In all these extensions, the implementation tries to protect the user from mistakes against the official rules of the RFCs. It is hard to grasp the content of the (at least) five e-mail related standards, so people should be kept away from that when possible.
Next to the reply
, forward
, and build
--which already existed but saw some improvements--, new read
and rebuild
actions were added. "Rebuild" can be used to add plain text alternatives to html message parts, remove structural complexity, and such. "Rebuild" was added in release 2.041, May 2003.
Unicode
Due to its origin in the United States, e-mail is using 7 bits ASCII. However, the current e-mail standards support the encoding of character sets used in other parts of the world. Mail::Box
received a simple way to decode and encode these "foreign" characters, although this has a negative effect on performance. Unicode features have been added over time, and can not be pin-pointed to a certain release.
Field groups
Each (MIME compliant) e-mail message starts with a set of lines describing the content of the message and the transportation process. Mail::Box
got ways to handle sets of these lines which are related.
For instance, a header may contain multiple Received
} lines, showing the intermediate computers which were used to transport the message from the sender to the destination. These lines can be accompanied by a few more lines. The lines are not very useful, and can hence be removed per group or all together, saving disk space. But they can also easily be inspected and constructed. ResentGroups are included since release 2.023, September 2002.
Header fields added by mailing-list software can be recognized (and removed) as well as fields produced by spam-fighting software. A connection to the popular SpamAssassin software was made. More use in the area of spam-fighting software is expected, as result of the ever growing amount of unsolicited e-mail. ListGroups were introduced with release 2.044 in July 2003. SpamGroups were added in 2.048, released in August 2003.
Web-based clients
Maybe the most important application for Perl based e-mail scripting has to do with web-based mail applications. To simplify this task for system developers, the module HTML::FromMail
was created. It provides template systems to implement a web-mail client, taking care of many complicated tasks. The thing left for the mail-client developer is to add interactivity and layout. HTML::FromMail
saw daylight in October 2003.
Performance
Various approaches were taken to improve performance. For one: many methods in the library implement alternatives, and the user can decide which alternative is fastest in his or her case. Usually, the automatic decision is the best.
A good example of this smart behavior, is the way a message's body is stored when the program is run. It can be stored as one large string, a set of lines, or as temporary file on disk. For the functionality it's all the same, but it depends on the size of the message and the way it is used which version of storage performs best. The best way to keep the body data, is simply remember where to find it, and not get it at all unless needed. This lazy behavior has been implemented in many of the library's features.
Besides the lazy implementation, there have been some real performance improvements. The two best gains (up to 20% gain each) were reached by rewriting existing Perl components. An optional message parser, implemented in C, has a benefit as well. The C parser's first release became available in December 2002.
Mail::Box
was developed for automated e-mail handling, but is in the current state and on new PCs, fast enough for user applications. Some functionality required for interactive applications could be added to simplify the task of the user client developers.
Documentation
The best way to improve the acceptance of software by a community, is with better documentation. Mail::Box
has grown quite large, and it will never be easy to learn how to use such a large library. But without good documentation you are lost for sure.
The standard way of documenting Perl was not sufficient for a code base of this size. Therefore, Perl's documentation system was extended with new syntax and a new tool to produce an homogeneous set of manual pages. These pages are indexed in various ways, and contain many examples and detailed explanations of concepts. The statistics at the end of the project:
Classes (and manuals) | 128 |
Documented methods | 931 |
Documented diagnostics | 165 |
Shown examples | 228 |
Probably the easiest way to find the right methods to do a certain task, is by browsing through the HTML version of the documentation. This was all realized during the project.
The documentation tool has been released as separate product, named OODoc: Object Oriented Documentation.
Promotion
Providing a good implementation is one side of the story, but growing an user community is even more important for a successful product. Therefore, quite some promotional activities were initiated.
Mailing-list
In an attempt to get people reacting on each other's use of Mail::Box
a mailing-list was started. There is quite some activity on the list, but this is mainly related to mistakes in user code and bugs in the library. It did not really contribute to joint development.
The mailing-list started in May 2002. In February 2003, it had 70 members. At the end of the project in December 2003, 118 people followed the list.
Jan 2003 | 132 messages | ||
Feb 2003 | 138 messages | ||
Mar 2003 | 92 messages | ||
(from May 20) | Apr 2003 | 104 messages | |
May 2002 | 7 messages | May 2003 | 69 messages |
Jun 2002 | 73 messages | Jun 2003 | 22 messages |
Jul 2002 | 25 messages | Jul 2003 | 68 messages |
Aug 2002 | 79 messages | Aug 2003 | 45 messages |
Sep 2002 | 53 messages | Sep 2003 | 34 messages |
Oct 2002 | 76 messages | Oct 2003 | 26 messages |
Nov 2002 | 57 messages | Nov 2003 | 47 messages |
Dec 2002 | 64 messages | Dec 2003 | 33 messages |
(till Dec 6) |
The usage statistics of the list's archive , show a very irregular pattern, mainly driven by major changes in the library.
On the other hand, the mailing-list statistics are only showing a part of all traffic: to avoid boring other list members, bug hunts usually took place off-list. The mailing-list received 1244 messages, while Mark's personal archive adds up to 3500 (of which about half written by me). It is clear that answering messages consumed a lot of time... for the 19 months of the project, this means 8.8 message per workday, or 22 per sponsored workday.
Conferences
Mail::Box
was promoted in various ways, but mainly by giving talks on various (Perl) conferences. Preparations of the abstracts, papers, and slides consumed more time than planned for the project. NLnet sponsored mainly the travel and stay.
- SANE 2002 in Maastricht, The Netherlands
- a 45 minutes contribution on the use of
Mail::Box
for system administration, entitled "E-mail with Perl". (UNIX system and network administrators conference) - YAPC::Europe 2002 in Munich, Germany
- A three hours tutorial "E-mail programming with Mail::Box", 45 minutes talk on software development of large libraries, and 7 minutes about the
Mail::Box
spin-off moduleObject::Realize::Later
. (European Perl conference) - German Perl Workshop 2003 in Bonn
- 45 minutes talk on the
Mail::Box
spin-offUser::Identity
, 15 minutes about Unicode e-mail headers. - YAPC::NA 2003 in Boca Raton, Florida, USA
- 90 minutes tutorial on
Mail::Box
, 20 minutes forOODoc
, and 5 minutes forObject::Realize::Later
. (North-American Perl conference) - YAPC::EU 2003 in Paris, France
- new 95 minutes tutorial on
Mail::Box
.
Attracting external developers
Open Source projects must have a community to be successful. They do not only require a group of users --supporting each other with the use of software--, but should also have a group of developers which can supplement each other in the development process. A project which is developed by only one person, like Mail::Box
, may collapse when that one person stops development, for instance by illness or lack of time.
In the ideal situation, a group of active developers with comparable influence are in control. This can be found in FreeBSD, Gnome, and KDE development teams. However, the Linux kernel development has only a very small group of people on top, which do not call themselves {\it a team}. But that also shows to work. Many smaller applications depend on the effort of one person. When that developer stops its work, the product starts to faint away, which may take many years. For instance, the XV image displayer hasn't been changed since 1994, but is still distributed with the latest SuSE Linux.
During this Mail::Box
project, effort was made to attract developers for the module, to try to shape {\it a team}. Time was reserved to encourage people to participate in development. However, this did not succeed.
One way to get people helping a hand, is by explicitly tackling their problems with the existing code. That way, a person relation is built, which may grow active developers. Every few months, the members of the mailing-list were asked for their needs. This always brought some life to the list, and some ideas to work on, but no code contributions.
Furthermore, each time someone spoke about their own application using Mail::Box
, that person was invited to contribute the code as part of the module. People were not unwilling, but the conversion from an application which suites personal needs into code which is usable for other people is huge: much higher requirements on configuration, documentation, and automated testing. In some cases, the employer did not permit the contribution.
As test-case, it was planned to find someone to implement IMAP4 support. No less than four people offered to implement this, over time. Still, each time the good intentions faded when the complexity of the required code came clear to the volunteer. The POP3 protocol was much easier. Liz Mattijsen offered to implement it, and (once started) there was an full implementation within two weeks.
An other complication to get spontaneous code contributions, is the size of Mail::Box
. Combined with its Object Oriented coding style, with up to 5 levels of inheritance, it is not easy to get a good feeling about the internals. It is hard to figure-out what the best spot for a new functionality is, and often some existing functionality has to be rewritten, redesigned, or relocated.
Many programmers do not feel capable enough to write code which is usable by other people: they hesitate to show their programs. To be honest: usually they are right. Getting them to release code requires a lot of guidance; many long e-mails explaining how to produce better code. Only a few reach a publishable level.
After 19 months, the number of received code patches has increased, but these are all quite small patches: never more than a few lines. No-one has offered to join core code development. Which is a shame.
Deployment
Mail::Box
has found deployment in different areas. Most of these applications are hidden to the outside world: it is in most cases part of a company's internal infrastructure. Often, it is used to clean-up e-mail archives or handle databases containing messages.
To name a few applications:
- In Taiwan, a Pen-Pal mailing system has been created to connect secondary school pupils.
Mail::Box
is used to shape virtual groups. - The YMB Antispam Project is one of many experimental spam filter tools which are based on
Mail::Box
. - PerlWebmail implements (as its name implies) a web-based e-mail client. It does not use
HTML::FromMail
, for one because that module did not exist at the moment of its design. - Conversion is on its way for tkMail, a Perl/Tk graphical e-mail client. It was based on older Perl e-mail libraries, but needed the Unicode features --new versions of the Tk library have full support for Unicode. (The
tkMail
command is distributed as part of Perl's Tk library release.)
Spin-offs
The Mail::Box
development has resulted in a few modules which can also be used with other applications than purely e-mail related. These modules are
MIME::Types
- is a collection of knowledge about MIME types, which can be used to map file-name extensions to types, vice versa.
OODoc
- is a system to document complex (probably large, often Object Oriented) modules.
Object::Realize::Later
- is a tricky module to implement lazy (delayed) creation of objects, which improves performance. This attracted a lot of attention from hard-core Perl programmers,
User::Identity
- plays smart about user information, like deriving someone's probable language preference from an e-mail address. Or discovering a person's gender from a full name description in multiple languages.
Acknowledgements
Special gratitude to Stichting NLnet for offering me the chance to work on this free software package. With the help of NLnet, the Perl software base is enriched with a powerful library, which in time, may become the basis of the next generation e-mail applications.
The following people contributed. Some contributed documentation, other send in patches or bug reports. Major contributors are marked with (*).
Adam Augustine | Gilles Darold | Mike Cudmore |
Adam Byrtek 'alpha' | Greg Matheson* | Mike Mimic |
Alan Kelm* | James Sanford | Nick Ing-Simmons |
Albert Schueller | Jan Stapel | Nik Clayton |
Alex Liberman | Jason Woodward | Paul Simons |
alex | Jeff Squyres | Phil Hagen |
Alexander Bauer | Jeffrey Friedl | Phil Holden |
Andre Schultze | Jeremy Banks | Pjotr Prins |
Andreas Fitzner | Jerrad Pierce | Rob Holland |
Andreas M. Riechert | Joe Junkin | Robin Berjon |
Andreas Piper | John B Batzel | Ron Savage |
Anthony D. Urso | Jon Thomason | Ronnie Paskin |
asta | Jost Krieger | Sebastian Krahmer |
Beirne Konarski | Karen Craven | Sebastian Willert |
Benjamin Pineau | Kees Dekker | Shagren |
Bernd Patolla | Kingpin | Simon Cozens |
Bill Moseley | Liz Mattijsen* | Simon von Janowsky |
Blair Zajac* | Lutz Gehlen | Slaven Rezic |
Brian Grossman | Marcel Gruenauer | Stefan Wolfsheimer |
Christoph Dahl | Marcel de Boer | Steve Lewis |
Conrad Heiney | Mark Ethan Trostler | Steven Benson |
Constantin Khatsckevich | Mark Weiler | Supriya Jagadeesh |
Cory Johns | Martin Thurn | Swapnil Khabiya |
Darrell Fuhriman | Marty J. Riley | Tassilo von Parseval* |
David A. Golden* | Marty Pauley | Terrence Brennon |
David Coppit* | Matthew Darwin | Tim Sellar |
David Favor | Matthew Lockner | Todd Richmond |
Dimitris Glynos | Matthew Walker | Tom Allison |
Edward Wildgoose* | Max Maischein | Tony Bowden |
Emmet Cailfield | Max Poduhoroff | Walery Studennikov |
Eric Wheeler | Melvyn Sopacua | Wiggins d'Anconia |
Eugene Eric Kim | Michael D Richards | Yuval Kojman |
Evan Borgstrom | Michael Reece | |
Francois Petillon | Michael de Beer |