I started an experiment Friday to try categorizing tweets (Twitter messages) using a Bayesian classifier. That’s a fancier version of the software that’s separating your junk e-mail (“spam”) from the stuff you want to read. It uses word frequency to figure the probability that a given message belongs in one of the categories it knows about.

I’m not like Google or Bing, who get access to the whole firehose of real-time Twitter messages. Instead, I had to settle for the 20 latest public tweets from a publicly-accessible API, and I only ran my “learn” script every minute (every other minute toward the end) to be nice to Twitter’s servers. Still, at the time I decided to call it quits, 18,520 tweets had been processed.

I didn’t want to assign categories myself, so I decided to make use of the hashtags that are so popular on Twitter. 3456 of those 18,000+ messages (~19%) contained hashtags, and they consisted of 995 unique hashtags which my program was able to turn into categories. I had some success initially. I got a big kick when it started putting posts with bad grammar in the “cheezeburger” category. Posts with the words “gay” and “movie” were assumed to be about “New Moon”, the new Twilight sequel. A post that said “#weloveyoujustin #weloveyoujustin #weloveyoujustin…” was correctly categorized as “weloveyoujustin”.

Aside from a few outliers like that, it rarely was more than 20% certain about the categories it was assigning. The same words appeared in all sorts of posts, and Twitter’s 140 character limit discourages big, specific words. I started running it on a Friday, which meant there were a lot of #followfriday and #ff posts adding noise to the mix — it really liked assigning this category when it didn’t know what else to do. And, cross-posting tags like #fb and #in to duplicate posts on Facebook and LinkedIn led to some bad categorizing as well. Aside from #ff and #fb, other tags weren’t used very often. It seems hashtags are mostly used by specific communities like an inside joke or their own internal categorizing system.

A few posts that my program tried to learn from:

  • Eeeeeeeeeeeeee!! (Doctor Who) #pudsey
  • #Whatdoyoudo when ur not on twitter?
  • I hate wal-mart #fb
  • Thanks @swinmill back at ya! #ff
  • @HankYeomans : – ( #FAIL

As you can see, there wasn’t much to build a database of word frequency from.

I ended up killing off the “learn” program last night when it hit a scaling limit. The classifying library I was using was taking forever to assign categories because it had so many to choose from. Don’t forget, programs like this are usually just dealing with two categories: “spam” and “not spam”. 995 categories required too many calculations for my poor Mac Mini to handle.

In conclusion, I don’t consider this exercise a failure. I got to play with a Twitter library and a Bayesian classifier, which was cool. I got to play voyeur and read all sorts of strangers’ tweets. And I got a few chuckles out of it along the way. But, if a Bayesian strategy is going to be used to categorize Twitter messages, it’s going to need some serious hardware behind it.

I used these libraries: