[TAG] mbox selective deletion

Ben Okopnik ben at linuxgazette.net
Tue Feb 16 03:21:47 MSK 2010


On Fri, Feb 12, 2010 at 11:48:28PM -0500, Samuel Bisbee-vonKaufmann wrote:
> On Fri, Feb 05, 2010 at 11:01:37PM -0500, Ben Okopnik wrote:
> > 
> > I do have to note, though, that this isn't great for large mailboxes
> > with lots of messages; it's not the fastest thing in the world. As a
> > baseline, it takes about 3 seconds to process a 10MB mailbox that has 36
> > messages in it, but it takes 22 seconds to process one of the same size
> > but with ~600 messages in it. I suppose you could speed it up by
> > sticking the tempfile into memory (assuming you have enough memory), but
> > you're still spawning _some_ interpreter or parser ~600 times, and that
> > ain't cheap.
> 
> Yeah, coding against "keep everything on one gigantic file" isn't very fun,
> though it makes administration a lot easier.

But it makes processing a lot slower - and it's an asymptotic curve. In
my experience/best judgement, whenever you expose a static data source
to multiple users, anything over a meg or so in size is a disaster
waiting to happen. At that point, either a database or some sort of a
pointer-based index scheme is a requirement.
 
> I was trying to keep my program all in one language, but the Bash solution you
> provided simply choked and died with large mbox's (ex., currently mine is 365
> megs).

Sam, y'know how I said "anything over a couple of meg"? I think 365MB
sorta, um, qualifies. :)

If you just wanted to select and return various emails, there's a bunch
of stuff that allows you to do that (e.g., mairix and hyperestraier are
stunningly good at what they do.) However, you actually want to delete
stuff... in my mind, that pretty much defines it as either a database or
a customized caching and indexing solution. 

> So, with that and needing to match more than one header pair, I give you
> this: http://github.com/ravidgemole/mailp/blob/master/deleteMessage.plx
> 
> [Don't worry, check the THANKS file to see that you're not forgotten.] 
> 
> The tests showed much better results. FYI, this test was with a non-sane e-mail
> so I could make sure the program was doing AND matching, not OR.

Sure. Do note that Mail::MboxParser allows you to create an index file:
take a look at the 'make_index' option in the docs.
 
> ``
> sbisbee at orbital:~/src/mailp$ time ./deleteMessage.plx ./mbox to ".*sbisbee at computervip\.c0om.*" x-mailp ad9d8e35e69f9547a9b3c4a8fb06ad0edbe56d9b > test
> 
> real    0m23.135s
> user    0m20.065s
> sys     0m2.760s
> ''

That's certainly a machine with lots more horsepower than my little
netbook - and with lots more memory. In any case, you could speed it up
significantly with an index.

> Now my main program (a Bash script, though I may convert to Perl for
> homogeneousness) can remove messages from the mbox that have a specific To
> address and a certain header key/value pair.
> 
> Some more things I want to add:
> 
>  - An arg to run through the mbox file in reverse, with the theory that people
>    will often want to deal with recent e-mails at the end of the file instead
>    of old ones. Ex., my program would run this command _a lot_ faster if it
>    could combine this arg with the next one...

If you invert your index, this would be automatic.
 
>  - An arg to stop running through the mbox file when one match is found.
>    Haven't played with Mail::MboxParser enough yet to know whether I can tell
>    it to just dump the rest of the file's contents.

Wouldn't be a problem. The nuclear-powered mechanical dwarves beneath
the surface of this module will do the right thing if you only ask them.
:)


-- 
* Ben Okopnik * Editor-in-Chief, Linux Gazette * http://LinuxGazette.NET *





More information about the TAG mailing list