Latest posts.

Testing divvyshot widget


Side Project: Six Legged Walking Lego Robot

Time-Series of Twitter Status Updates, The Poisson Arrival Process, and How T-Pain Auto-Tuuunnneeesss His Voice

The Poisson arrival function is a stochastic process named after the French Mathematician Simeon-Denis Poisson. It’s a fundamental technique to most systems engineers, generally used to model arrivals. Look here if you’d like to learn more about the theoretical backing.

It can sometimes be difficult to learn about dense concepts without some application. So here’s a guide for those students who are learning the Poisson model, but want some applications: twitter.

First: You need to know the basic form of a Poisson model (from the wikipedia page):

This says: the probability of having k arrivals between time t and time t + tau is denoted by the equation on the left. We’re using the finite form, because you can’t have 1.4 status updates in an hour; we’re dealing with discrete items.

In the formula above, tau is the time unit and that lambda is the arrival parameter. Lambda * tau = the expected number of arrivals in one unit time. So for example, if tau = 1 h-1 and lambda = 60 calls, the expected value of the function is 1 hour-1 * 60 calls = 60 calls/hour. The rate parameter is denoted as an inverse-time, so 1 / hours, and not hours.

Next, I’ll demonstrate where this distribution can appear.

I wrote a script that uses the twitter-api to keep track of my friends’ status updates for the day. For more technical details, download the script here. The script is very simple — it listens for and records when my friends post status updates.

At the end of the day, I ran a distribution analysis on the status updates using the Poisson model and found the following:

arrival

The blue represents the actual arrival process, the red represents a poisson distribution.

The blue area represents the actual arrivals. So for example, the big blue rectangle on the far left, going from 0, 100 denotes that 30% of the status updates came within 100 seconds of the previous update. The next box from 100, 200 denotes that about 15% of the arrivals came  between 100 and 200 seconds of the previous. This extends all the way, and shows that only .01% of the time, did it take 1,000 seconds to see a new update.

Using this data, we could predict how long it will take for someone to post an update to twitter. If we’re working with twitter and trying to manage scaling, we could use this distribution to decide how much server power we need.

Poisson arrivals can be used for some pretty cool techniques. One of the most interesting uses it has found is auto-tuning. I know, I was pretty amazed to find stochastic processes in pop music.

Before we dive in, let’s look at another artifact from our poisson distribution. With the distribution we fit on twitter status updates, we can also run an auto-correlation analysis to see how random the data is. By taking out some of the randomness, we could make the data more predictable. For example, instead of saying that we had twitter status updates at times 1,2,3,2,1,2,3,1,3, we could just average them and say that all of the arrivals came at time 2. It’s not true, but that’s what auto-correlation does.

When T-Pain sings, his voice comes out at different frequencies. Even when he says a single word, T-Pain’s voice comes out with a few different frequencies. To simplify things a lot, say a certain segment of a song is over 5 seconds and contains five notes, played sequentially: A, B, C#, D, E . To auto-tune, we could average his voice throughout to get that good auto-tune effect that mars most of today’s rap music.

The connection isn’t entirely apparent quite yet, but come back next time for the full explanation.

Forecasting Top-100 Radio Music Success

I’ve been thinking about this project a lot. It started off as my final project for a predictive analysis class, but I’m still thinking about it a fair amount. Here’s the project wiki page: http://bluwiki.com/go/Music_Analysis.

The purpose of this project is to predict, out of a set of songs, which will be successful and which will not. I think it has some pretty good implications in the music industry. A lot of people have done work in the field, but an open-sourced movement with people purely motivated by solving cool problems should yield some interesting results!

Hypothesis: There is a correlation between the contents (lyrical, sounds, cultural) and how successful a song is throughout its lifetime, and the correlation coefficient is at least .6.

I’m aware that my technical writing can be confusing sometimes. Please make edits to the page if you can think of a way to make it more clear. Don’t hesitate to post any questions, here, or on the wiki page.

Ben Franklin would support open-source software

Another instance of the constraining effect of republican political ideal is
Benjamin Franklin’s refusal to exploit his inventions for private profit. Thus
Franklin’s reaction when the governor of Pennsylvania urged him to accept a
patent for his successful design of the “Franklin stove”:

    Governor Thomas was so pleased with the construction of this stove
    as described in… [the pamphlet] that… he offered to give me a
    patent for the sole vending of them for a term of years; but I
    declined it from a principle which has ever wieghed with me on
    such occasions, namely; viz., that as we enjoy great advantages
    from the invention of others, we should be glad of an opportunity
    to serve others by any invention of ours, and this we should do
    freely and generously [emphasis in original].

From Leo Marx - Does Improved Technology Mean Progress.

UVaHousing.com

I applied some finishing touches this morning and have since completed UVaHousing.com. Have a look and find a place to live. Version 2 is on the way soon; leave a comment if there are some missing features you’d like to see!

The site is quite cool under the hood. Anybody can build a real-estate website with a week of time and their favorite web framework. This project is a bit more than that to me; I started wanting to build a django site that really abides by the django design philosophy: no quick hacks, doing everything the proper way while maintaining significant attention to detail. To most python hackers, this is old news, but it’s pretty cool to me at least.

I can’t really emphasize enough how important it is to learn the style and philosophy behind any tool you use. It’s always so exciting to start playing with a new program/framework as soon as you download it, but don’t under-estimate the utility of taking your time and learning how to do things the right way. I’ve been using django for about a year now, but have only begun to realize how useful it actually is.

Globalizing the Arab World

The Romantic Revolutionary has a very interesting post about using technology to globalize the Arab World. Check it out here! if you want to change the world, you should be following great minds like these!

Autumn Mix 2008

Here’s an old mix I did in autumn of last year — contains some fun tracks to get you going, whether your going involves writing, coding, driving, or launching social networks about fashion.

http://dl.getdropbox.com/u/274/autumn-mix-01.mp3

Expect another mix out soon! To keep up with this and more mixes I do make, add my rss feed to your favorite RSS reader. Here’s the link:

http://feeds2.feedburner.com/omarishBlog/.

Using Zip-codes as Networks in your Location-Based App

You can greatly enhance your location-based app by using zip-codes.

So building a location based application is an interesting ordeal. It involves creating a few basic networks that people can start with, and expands from there. Facebook, for example, started with Harvard as its only network and expended from there to allow more college networks, and then the rest of the world.

Craigslist does the same. Your content is based on the URL; so if you go to washingtondc.craigslist.org, you get content tailored for Washington, DC. Equally, if you use charlottesville.craigslist.org, you get content for Charlottesville, VA.

When scaling a site like this, creating on networks is a very daunting tasks. You have to come up with huge lists of cities and scale up. Here’s my solution.

All US locales have zip codes. Zip-codes are unique and most can be traversed within a 5-minute drive. There are software packages available, like GeoDjango that let you search based on zip code.

Let’s focus on the Craigslist model. Instead of having charlottesville.craigslist.org, you could have craigslist.org/22903 (0r 22903.craigslist.org). The URL isn’t quite as pretty, but it contains much more information than only having the city name.

Then, you coud replace your search logic. Instead of filtering your query based on network (ie - network = “charlottesville”), you could order based on distance from your target zip-code.

The upsides are countless. Now, say I’m searching for a 1965 Shelby Mustang GT on Craigslist, I could simply expand my search radius instead of going from network-to-network. This decreases the time for me to find what I’m looking for an enhances the user experience. I can find something that’s hard to find, much faster.

There are a couple of down-sides, though.

1. Installing a geographically-aware package can be hard. I installed GeoDjango for dusoto a few months ago and had a pretty tough time, even with the nicely-written manual.

2. This won’t work as well internationally. Not all countries have a zip-code system that one can use. Great Britain, Canada, and Australia have good postal code systems, countries like Lebanon do not.

3. It’s a bit more computationally complex on the database to do a location-based search. It might take .01 seconds to do the search instead of the .001 it would take to do the keyword-based search.

These things aside, I could see a zip-code based approach overcoming the network approach in the next couple of years, especially as things go closer to location-aware devices.

FreeCreditReport.com is a Scam

PAYMENT INFORMATION
When you order your free report here, you will begin your free trial membership in Triple AdvantageSM Credit Monitoring. If you don’t cancel your membership within 9 days of enrollment, you will be billed $14.95 for each month that you continue your membership. If you are not satisfied, you can cancel at any time to discontinue the membership and stop the monthly billing; however, you will not be eligible for a pro-rated refund of your current month’s paid membership fee.