I started an experiment Friday to try categorizing tweets (Twitter messages) using a Bayesian classifier. That’s a fancier version of the software that’s separating your junk e-mail (“spam”) from the stuff you want to read. It uses word frequency to figure the probability that a given message belongs in one of the categories it knows about.
I’m not like Google or Bing, who get access to the whole firehose of real-time Twitter messages. Instead, I had to settle for the 20 latest public tweets from a publicly-accessible API, and I only ran my “learn” script every minute (every other minute toward the end) to be nice to Twitter’s servers. Still, at the time I decided to call it quits, 18,520 tweets had been processed.
I didn’t want to assign categories myself, so I decided to make use of the hashtags that are so popular on Twitter. 3456 of those 18,000+ messages (~19%) contained hashtags, and they consisted of 995 unique hashtags which my program was able to turn into categories. I had some success initially. I got a big kick when it started putting posts with bad grammar in the “cheezeburger” category. Posts with the words “gay” and “movie” were assumed to be about “New Moon”, the new Twilight sequel. A post that said “#weloveyoujustin #weloveyoujustin #weloveyoujustin…” was correctly categorized as “weloveyoujustin”.
Aside from a few outliers like that, it rarely was more than 20% certain about the categories it was assigning. The same words appeared in all sorts of posts, and Twitter’s 140 character limit discourages big, specific words. I started running it on a Friday, which meant there were a lot of #followfriday and #ff posts adding noise to the mix — it really liked assigning this category when it didn’t know what else to do. And, cross-posting tags like #fb and #in to duplicate posts on Facebook and LinkedIn led to some bad categorizing as well. Aside from #ff and #fb, other tags weren’t used very often. It seems hashtags are mostly used by specific communities like an inside joke or their own internal categorizing system.
A few posts that my program tried to learn from:
Eeeeeeeeeeeeee!! (Doctor Who) #pudsey
#Whatdoyoudo when ur not on twitter?
I hate wal-mart #fb
Thanks @swinmill back at ya! #ff
@HankYeomans : – ( #FAIL
As you can see, there wasn’t much to build a database of word frequency from.
I ended up killing off the “learn” program last night when it hit a scaling limit. The classifying library I was using was taking forever to assign categories because it had so many to choose from. Don’t forget, programs like this are usually just dealing with two categories: “spam” and “not spam”. 995 categories required too many calculations for my poor Mac Mini to handle.
In conclusion, I don’t consider this exercise a failure. I got to play with a Twitter library and a Bayesian classifier, which was cool. I got to play voyeur and read all sorts of strangers’ tweets. And I got a few chuckles out of it along the way. But, if a Bayesian strategy is going to be used to categorize Twitter messages, it’s going to need some serious hardware behind it.
O’Reilly Media uploaded a video of Jeff Veen’s presentation at the Web 2.0 Expo, on how designers can tell stories through visualizing data, and presenting the case for letting users find their own stories in the data.
Among the comments at Jeff’s blog, there was a request from someone named BG, who is hard of hearing. He wanted captions…even just some notes…so he could follow along with the presentation. I figured I’m a fast typist, and I had a few free hours this weekend, so I typed up all twenty minutes of Jeff’s presentation for BG and anyone else who could benefit from this presentation.
[Note: Jeff very obviously grew up in California. It wasn't just the picture of him in front of the Golden Gate Bridge that gave it away. His speech includes lots of "like", "sort of", and "right?" used as punctuation. I've left a lot of that out, as it distracts from the content, except for a few times when it felt right to leave it in for stylistic or grammatical reasons.]
Hi everybody. I was just standing backstage looking at Chartbeat. Is that not the coolest thing? I can’t wait to go try my 30 day trial. That was really impressive. Which is good, I want to talk about designing, and the charts, and the stuff that you’ve seen today. We have been talking about data here for the last couple of days: open data in government, data up in the cloud…Tim keeps saying that data is the “Intel Inside” for the web operating system. If that’s the case, I want to sort of pose a question to all of you, which is “what is the ‘Macintosh Human Interface Guidelines’ for this new web-based computer?
I want to talk about that today by rewinding a little bit and talking about the most important year, I think, in the history of Web 2.0, and that is 1974. How many of you were actually not yet born in 1974? Oh, no. Alright, we’ve got a lot to cover here. So, to me, I think 1974 is interesting ‘cuz it’s kind of conceptually the end of the 60s, right? It didn’t really line up with the calendar that well. But, a lot of stuff that was happening in the counterculture of the 60s became mainstream in the 70s, kind of around 1974. I’ll give you a couple of examples.
Environmental sustainability, that was sort of a hippie thing in the 60s, but really became a mainstream thing in ‘74 when we had the big oil energy crisis, and gas prices…if you can see that on the slide…shot up to 81 cents/gallon. So, it brought home to a lot of people that maybe we wont have this unlimited supply of energy, right?
Another one was this sort of centralized authority, and this trust in the institutions. That was sort of anti-establishment in the 60s, but 1974: Watergate hearings. The President of the United States, I think here with one of his trusted advisors…the President of the United States quit, and stepped down. It really kind of rocked the faith in a lot of the institutions that we have.
I think entertainment was changing, right? You think 60s, you think Beatles, Bob Dylan. 1974 Kiss came out with their first album. Any of you have this poster in your room when you were a kid? I certainly did. But, in business as well, we see the same sort of emerging trends. AT&T was this sort of naturalized monopoly centralized telecommunication infrastructure in this country. This is when Congress, or actually the House of Representatives, started to try to see if they could break it apart. Which, ten years later they did. And I guess 20 years after that, we kind of put it all back together.
But, it started in 1974, the same year that Vint Cerf and Robert Kahn wrote their seminal article mentioning the Internet for the first time. So, something was happening in that year.
I want to tell you just a brief story about something that happened to me in that year. This is a picture of me and my brother in 1974, when I was six years old, making fun of his ridiculous red plaid pants. But, when I was six, I remember my mom one day taking us shopping to a department store, where we went to have lunch in a cafeteria, which was thrilling for me at that time. The cafeteria was very, very busy, and there weren’t any tables, but we saw one in the back, kind of a weird looking table, and we sat down at it, and it had this glass tabletop. And, inside this glass tabletop was a television, and on the screen of that television was this, and I had never seen a video game before. This was the first time I ever saw it. And, I remember having this little epiphany, as a six year old, thinking “wow, I can control what’s on the screen!” And, you know it wasn’t very meaningful for me then, but it really sort of played out over the course of my life.
This, in fact, is what the machine looked like (I found this on eBay). This is what we were sitting at. But, this is just the first step in a series of increasingly powerful tools that I got to use as a kid for controlling what was on the screen, or participating in sort of the stuff that was happening, the media that I was consuming. And that is why I think that, sort of, 1974 was a big shift. A shift from all of us as being consumers, and trusting authority, and things like that, and this shift into participation and not just consumption.
The other thing that happened in 1974 was the first commercially succesful hard drive was shipped by IBM, the Winchester hard drive here. If you were to…they leased them to companies, you couldn’t actually buy one…but if you sort of worked out the lease, and the dollars, and the inflation and stuff, it comes out to about $100,000 per Gigabyte, per month, is what this machine cost. And you can see it was really effective, because look at how productive those people are being there, with all their data and their charts!
If we sort of now zoom forward to where we are now, the price has changed a bit, right? You can lease…and in fact I think it’s now 3 cents/Gigabyte/month for the next three months as Amazon celebrates a birthday of their online storage in the cloud. It’s remarkable, right? Well, if we put these two things together, we can see some trends happening, right? We have now the tools for participation and the scale of data, and these things are coming together, and kind of changing how we make stuff on the web today. I want to talk about that, and I actually want to give you a very simple example of how we deal with this overwhelming amound of data that is in out lives every day.
I’m a designer. I design things, that’s generally my approach to…when things become overwhelming, I see that as an opportunity to make them simpler. And, so if we were to take some data like this and have a look at it, we might not really know what we’re looking at, right? I could, as a designer, add a little metadata to it, right? I could put some labels, and now we can see that this is “Rainfall Total for North American Cities” That’s good, but it doesn’t make it very accessible. So, I could use some techniques, like for example, typography. I could highlight some things, pull some things back, use a friendlier font, make it a little more accessible. Or, I could take another step, and actually make the data come alive a little bit, and maybe use the value of the colors of the table cells to represent the values of the data that’s in them.
Now, even if you’re in the back of the room, you can get the sense of patterns that are emerging in this data. I could maybe take that one step farther, and do this, maybe. It really…now we’re starting to talk about “what is our audience really expecting?”, right? If you were a bunch of meteorologists, you would probably say “actually, go back — the numbers were a lot better.” But perhaps if we’re just like trying to plan a trip, and deciding whether we should go to San Francisco or Miami, see that maybe in the summertime it’s a little drier in San Francisco.
But, that’s dangerous, right? We get into this situation where we might just be decorating the data which, you know, USA Today has sort of made a career out of. I’ve looked at this chart for months now. I have no idea what they’re talking about with their sprinkler going off. In fact, The Onion picks up on this and does this kind of stuff all the time, like this is…these numbers add up to 143%. I mean, there’s examples like this all over the web, right? This is one of my favorite charts, which is the “percentage of the chart that looks like PacMan”. So it’s dangerous, right? It’s pretty easy to slip into, you know, decorating stuff and not actually finding the meaning of the data and trying to make that accesible to people. And this has been a concern of ours in the design projects that we’ve done over the last few years. We worked on a redesign of Google Analytics that came out a couple years ago. Recently, we just launched this thing called Wikirank, which shows you usage patterns and trends and comparisons of Wikipedia popularity, and things like that
Our concern was always that we would start to decorate, and that we wouldn’t be making it accessible. So, we did a little research, and let me now rewind all the way back to 1854 and the cholera outbreak in London’s Soho neighborhood, when people were literally getting cholera and dying on the streets. From this came some of the best information design that the world had ever seen at the time. And that was done by a doctor named John Snow, who also had a sort of proclivity for statistical information. And he basically looked at what was happening, and started to make some geo-mashups, taking the data that he saw and applying it to a map to see what was going on. Now, this had been done before but it was really…this is a map from about 50 years earlier, when they believed that disease came through the air, and you can see it looks sort of like a big black cloud swarming over London, when he in fact had a totally different theory, that it was waterborne, and perhaps it was coming from one source. And, he used this map, with this data overlaid on it, to convince the city council that the pump at the corner of Broad Street was the thing that was killing people. And, he used data rather than superstition, or religion, or classism, or any of those other things. He used data and analysis to prove to them…to convince them to take the handle off the pump. And you can still see this pump today if you want to go on sort of a weird data visualization pilgrimage. The pump is still there in Soho. Yes, I’ve totally done this. And, actually, a nice little connection with this, the pub that you see right behind that pump is called the John Snow, named after him for saving the neighborhood.
So data can change things, if the stories are told correctly. This is one of the most famous visualizations of all time, Charles Joseph Minard showing Napolean’s march to Russia during the winter. There are so many different data points happening all at once here. The size of the line shows the number of troops that he has, and you can see it dwindling as, at the bottom there, the temperature is dropping and soldiers are falling over dead. It’s a really remarkable chart. If you drill in, you see stories throughout this chart. For example, when he forced his troops across the Berezina river when it was about 20 degrees below zero, you can see right there where the arrow is pointing, twelve thousand men died. And that’s a story that is in this chart, right? If you imagine a giant spreadsheet of troop logistics, you’re not going to find those kind of stories as easily as looking at things like that. There are other ways to tell a story like that, right? This is the same event happening, but you can see in the data, by doing the visualization, we can pull that story out.
One final historical example is Harry Beck who, as you can see, was the designer of the London tube system map and — I love this picture ‘cuz he’s got his chart and he’s like “I made this. I’m so happy about this thing.” And he did a great job. This is the chart he was dealing with. This is the map that they had. But, he was an electrical engineer and so he spent his time making circuit boards back in the ’30s and said “wait a minute, why don’t we just apply that to the system? The only things people care about are where the lines connect, not where they are in the city. So sucessful, in fact, that it’s still the chart that we use today, not just for London, but for almost every subway map. So, a really impressive example. And, if you draw some conclusions from this, we can see that it’s important that we find a story in the data. It is important that we assign different visual cues to each dimension of that data, like I did with color and rain, and really help people look and visually understand what the data is telling us. And then, just like each one of these designers did, remove everything that’s not telling the story. Boil the story down to just the story that you’re interested in. And that works fine if your data never changes, if you know what the story is, if you have a message as a designer that you wish to communicate.
But that doesn’t kinda match up with what we were talking about with the web, kind of at all. And if we think back about 1974, the one thing that we were trying to get over then was control. We were rebelling against centralized authority, whether it was the government, or our phone system, or whatever, we were trying to get out from under this control. We wanted to be producers instead of consumers. We wanted to control, for our own selves, what was happening, and not let somebody else do that.
I found this in the design world when I worked at Wired back in the 90s. And it was incredibly difficult for us to make that transition from magazine to web site, because we had to give up so much control by moving to the web. We didn’t have ink on paper. We didn’t even know what the paper size was, we didn’t have control of type. We didn’t even know what browser capabilities people had. And, as it turns out, most people want to control their own web experience, and not let some designer do it. And in fact, most of the web sites I read, I read like this, because I have the choice to do that, and I can do that. Some people don’t have a choice. They have to read your web site using this, right? They put their hands on the Braille keyboard, and the page is sort of read through that, and converted into Braille, and they feel it in their fingertips. It’s a very different way, but they need to have that control. They need to have that choice.
So, if we were to apply that to our design work that we’re doing here, where we’re moving and trying to tell this story, we realize that I as a designer, and you as a designer, developer, marketer, anything — have to step back, let go of the control you normally have in telling those stories, and empower people to tell their own stories, or to find and discover their own stories. And perhaps the next step in my rain design here would be something like this, where instead of trying to design the story, I design an application that allows people to find their own story. And this is, of course, an example that I threw together in about ten minutes, but you get the idea, right?
So, perhaps we should enable people to find their own stories, because people have their own data. Here’s a visualization of somebody’s social network on Facebook. Here’s a visualization of the music that somebody was listening to. They scrobbled it off through iTunes to last.fm, and it’s produced in this really sort of amazing pattern of behavior that they’ve been experiencing over the last few months. And, it gets even more personal. Here, we’re looking at stuff that’s happened at EveryBlock.com, stuff that’s happened in their neighborhoods, data that’s being synthesized and displayed in standard, normalized ways, for their neighborhood. Or, even more personal, here’s what you’ve been working on. This is RescueTime, that monitors what you’ve been doing while you’re using your computer, and it gives your charts and graphs that measure your efficiency, which is a bit of a nightmare in my opinion. But, I think a lot of people find a lot of value in getting that realtime feedback. Or, perhaps, one of the most personal things I’ve ever seen is the Nike site, which has a chip on your shoe that talks to your iPod that sends that off to the Internet so you can compare your biodata. And, even more personal than that, is FertilityFriend.com, where women can track their menstrual cycles as they’re trying to get pregnant, and chart, and graph — this is an amazing graph, right? Because when the two red lines cross, you get to have sex. That’s pretty impressive!
So, finding their own story is very important for people. We as designers, developers, product managers, we need to create tools to help people manipulate their data. So, that’s where a lot of this interactivity is coming from. This here is an example of using the data as the navigation source. Trulia is showing houses for sale here in Boston, and I’m scrolling across how much the houses cost, to show me what neighborhoods are popular, using the data as the point of navigation.
Here’s something we did when I was at Google, where we worked with Google Reader, is giving people reports on their consumption of feeds as an interface for filtering through and kind of refining their subscriptions. Again, using the data, using people’s behavior as a navigation source. Dopplr does this with their social network. This is a great interface for visualizing who you’re connected with, who you’re not connected with, and sort of managing that delta. Likewise, here’s a project that we worked on when we were doing sort of realtime data with blogs, about four or five years ago, with MeasureMap, the thing that we sold to Google, where we could use the data as the navigation source.
And then, finally, providing filters allow us to enable clarity for people. So, whereas John Snow and those guys were reducing down the visual clutter of their designs to promote their story, we can give those same tools to people to help them find stories. So, this is another Google product that allows you to look at tremendous amounts of data. And, what we’re looking at here in the vertical axis is life expectancy for countries around the world, and on the bottom is the income per person. Each dot is a different country, the color is the continent that that country is in, and the size of the dot is the amount of population. And, as you see, if we push “play” on this chart, we can scroll through about 40 or 50 years of data. There’s a tremendous amount of data that we’re looking at here. But we can find individual stories that happen in this data by filtering that down. So, if I were to drill in on Botswana and replay the animation over time, we can follow what happened to just that country over the course of the years. And, you can see things are going pretty well, but then when we get to 1990, things go horribly bad. And here we have allowed our users to find a story in this data, in this case the horrible AIDS epidemic in Africa, and the effects that that had had on the population of countries like Botswana.
I think the New York Times has been doing a particularly good job of showing visualizations over time. I like this visualization here — we’re scrolling across casualties during the war in Iraq. The thing that I like about this, not only is that it’s a filter that helps you to understand the story, but it’s also a visualization that they created in 2004 and just left running. So, the story continues to tell itself over time, something that, frankly, the designers of the past never had to concern themselves with.
So, one last example. A bunch of different visualizations, a bunch of different filtering tools from Google Analytics. Again, using the data as a way of navigating through and finding the story. So, if we were to sort of sum this all up, I would say we have taken this shift from storytelling, which authors and artists and designers have always done, and kind of shifted that to put users more in control, and given them tools for discovery. If we thought about the visual cues that designers used to use in visualizations — super important, right? But, now we need to give those as tools, through interactivity, to our users, so that they can have control of that. And editing, that sort of reducing down to the core of the message — sure, that’s really important for communication to happen, and for designers to do. But, now we need to think about that as filtering in the tools that we give our users.
So, I think there is a lot of inspiration that we can pull from the past. These dead white guys did a pretty good of laying the foundation for us, so that we can take that forward. But, here and now, in the web, and the things that we’re doing, we need to think about tools that empower our users, rather than us trying to have control of our applications, or designs, our web sites. It’s really important, because I think there’s the potential for all of this data that floods us all of the time to really be pretty confusing and overwhelming. But, I have a lot of confidence that we can get through this, because things are a lot different now than they used to be, because now we can control what’s on the screen.