20041013

In the year two thousand and two...

Hugh Anchor cross the ocean. Blue.

Yesterday was Columbus Day, on which we celebrate the fact that Columbus rediscovered the American continent by not doing anything differently. But for once this holiday has some meaning for me, since it was two years ago (ish) that I landed in New Jersey, on Columbus Day: October 14, 2002. Consequently, I am reminded of the anniversary of my arrival, and I think back on all the pleasures and accomplishments of the past two years.

Well, that didn't take long. So it's time to indulge further in another passtime: the dredging of the email logs. Long suffering readers will recall that about this time last year I posted a tedious breakdown of the contents of my email. Well, guess what: I'm doing it again this year.

I can't really be bothered to do the full analysis I went into last year, as that requires too much effort. Instead, I'll just use the scripts that I hacked together to process my procmail logs to come up with some similar results.

Last year, I plotted a text graph of real mail versus spam. Here it is, extended to this year:

* 3000
* 2900
* 2800
* 2700
* 2600
* 2500
* 2400
* 2300
* 2200
* 2100
** 2000
** 1900
#** * 1800
#** * * 1700
* #**** * 1600
* *#**** * 1500
** *#****** 1400
** *#****** 1300
* #* **#****** 1200
# *#****##***** 1100
# *#****##***** 1000
**# *#****#####*# 900
**#* * *#***######*# 800
#### **** ***##**######*# 700
####* ***** ***##**######## 600
#####* *#**#***###*#########* 500
***######***####*###############* 400
*#**######**###################### 300
################################## 200
################################## 100
################################## 0
JFMAMJJASONDJFMAMJJASONDJFMAMJJASON
2002 2003 2004


Again, # represents a 100 real messages, * represents 100 spam. Categorization is based on the simple, but wrong, assumption that any message from an email address that only occurs once is spam.

The main observation is that, while spam skyrockets (and is safely caught and dealt with by my carefully crafted spam filters), my real email is also increasing significantly, up to baout 1000 messages a month. Could just be that the spam is using more recycled addresses, but I could just be becoming more popular. Or because I'm on more mailing lists, that could also be a factor (maybe I should try to remove those). The spike in March this year could be from the fact that I was in the middle of my job search, although that doesn't really seem like an adequate explanation. It was also reasonably busy, work-wise, but not really any more so than June.

This year, I made some plots of the skewness of the frequency distribution of the counts of messages from each senders. Fans of such things will be delighted that my email exhibits a very clear power law, and that my email this year was skew with Zipf parameter 1.3, whereas last year it was skew with Zipf parameter 1.1. This delights me, since I am a big fan of Zipf distributions, especially ones with parameter > 1.

Perhaps more later. I'm starving now, and I want my dinner.

No comments: