Time-Series of Twitter Status Updates, The Poisson Arrival Process, and How T-Pain Auto-Tuuunnneeesss His Voice

The Poisson arrival function is a stochastic process named after the French Mathematician Simeon-Denis Poisson. It’s a fundamental technique to most systems engineers, generally used to model arrivals. Look here if you’d like to learn more about the theoretical backing.

It can sometimes be difficult to learn about dense concepts without some application. So here’s a guide for those students who are learning the Poisson model, but want some applications: twitter.

First: You need to know the basic form of a Poisson model (from the wikipedia page):

This says: the probability of having k arrivals between time t and time t + tau is denoted by the equation on the left. We’re using the finite form, because you can’t have 1.4 status updates in an hour; we’re dealing with discrete items.

In the formula above, tau is the time unit and that lambda is the arrival parameter. Lambda * tau = the expected number of arrivals in one unit time. So for example, if tau = 1 h-1 and lambda = 60 calls, the expected value of the function is 1 hour-1 * 60 calls = 60 calls/hour. The rate parameter is denoted as an inverse-time, so 1 / hours, and not hours.

Next, I’ll demonstrate where this distribution can appear.

I wrote a script that uses the twitter-api to keep track of my friends’ status updates for the day. For more technical details, download the script here. The script is very simple — it listens for and records when my friends post status updates.

At the end of the day, I ran a distribution analysis on the status updates using the Poisson model and found the following:

arrival

The blue represents the actual arrival process, the red represents a poisson distribution.

The blue area represents the actual arrivals. So for example, the big blue rectangle on the far left, going from 0, 100 denotes that 30% of the status updates came within 100 seconds of the previous update. The next box from 100, 200 denotes that about 15% of the arrivals came  between 100 and 200 seconds of the previous. This extends all the way, and shows that only .01% of the time, did it take 1,000 seconds to see a new update.

Using this data, we could predict how long it will take for someone to post an update to twitter. If we’re working with twitter and trying to manage scaling, we could use this distribution to decide how much server power we need.

Poisson arrivals can be used for some pretty cool techniques. One of the most interesting uses it has found is auto-tuning. I know, I was pretty amazed to find stochastic processes in pop music.

Before we dive in, let’s look at another artifact from our poisson distribution. With the distribution we fit on twitter status updates, we can also run an auto-correlation analysis to see how random the data is. By taking out some of the randomness, we could make the data more predictable. For example, instead of saying that we had twitter status updates at times 1,2,3,2,1,2,3,1,3, we could just average them and say that all of the arrivals came at time 2. It’s not true, but that’s what auto-correlation does.

When T-Pain sings, his voice comes out at different frequencies. Even when he says a single word, T-Pain’s voice comes out with a few different frequencies. To simplify things a lot, say a certain segment of a song is over 5 seconds and contains five notes, played sequentially: A, B, C#, D, E . To auto-tune, we could average his voice throughout to get that good auto-tune effect that mars most of today’s rap music.

The connection isn’t entirely apparent quite yet, but come back next time for the full explanation.

blog comments powered by Disqus