My First Kaggle Contest, part 1

At long last, I decided to enter my first Kaggle contest. For the uninitiated, Kaggle hosts predictive, data science competitions. For example, Zillow recently had a contest on Kaggle to better improve their pricing algorithm. Prizes for the competitions can be pretty substantial (the Zillow prize pool was $1.2 million!).

As you can read about in my analysis of their survey, Kaggle is seen as a great resource for learning the tools of data science. I have something of a more mixed review.

About two weeks ago I entered the TalkingData AdTracking Fraud Detection Challenge. According to the competition overview, TalkingData is China’s “largest independent big data service platform,” and “covers over 70% of active mobile devices nationwide.” The contest is to predict whether a person will download an app after clicking on an ad. They provide 200 million clicks over four days with which to train your model.

That last bit is what makes this competition challenging. The csv file for the training data is over 7.5 gigs (that is A LOT of commas)! Now, this is no BigData, but it’s enough to make my little MacBook with a paltry 16 gigs of RAM cough and sputter a bit. What’s more, out of all those 200 million clicks, only 0.2% result in a download. This results in a very unbalanced data set, which brings its own challenges.

The competition has been something of a trial-by-fire. I had to learn the Kaggle platform, how to deal with the most raw data I’ve seen yet, and I had to become familiar with packages like XGBoost and Light GBM. The competition results uses the area-under-the -curve of an ROC plot (more on this in an upcoming post). Roughly, it’s a measurement of how accurate your prediction is, taking into account false negative and false positive predictions. A score of 1 is perfect. My current best is sitting at 0.9633. Pretty good right? Ha, wrong! My rank on the leaderboard is 1,244th out of 2196. The current leader has a score of 0.9815. The competition is pretty fierce.

I’m not sure what everyone else is doing that I’m not. I’ve built several models from scratch and based on what others of posted, but I still can’t get any higher. I have a few ideas left, and I’ll you know how that goes. But here are some lessons I’ve learned so far:

1. More and fancier features are not better.

I spent a lot of time looking at the data, trying to figure out what might be important. Here’s a screen shot of what it looks like:Screen Shot 2018-04-04 at 12.13.47 PM

One of the reasons I chose this competition is the limited number of features (columns). The first five are encoded, meaning that TalkingData knows what they mean by them, but to us they are just categories (i.e. maybe device 1 is an iPhone, maybe app 3 is Chrome). Overall there are just 6 features, with the “is_attributed” column telling us if they downloaded the app or not. There are really only a few new features you can create here. I looked at the total counts for ip, app, os, and channel, and I looked at the mean difference in click_time, thinking that if the clicks are fraudulent, they will happen faster than if a person is doing the clicking. I also included the hour of the day.

But I think this is overboard. As you can learn from reading my post on overfitting, more features can lead to a low-bias, high-variance model, i.e. I think I’ve overfit the training data, so my model does not generalize well to the test data. I’m considering dialing back some of these features to err on the side of simplicity.

2. Here are some good tips for dealing with unbalanced data. I’m currently working on implementing some of these ideas into my latest model.

3. Kaggle can be a time suck.

definitely have spent too much time on this. I’ve learned a lot, yes, but I also spent a week trying to turn my old desktop into a CUDA-powered GPU machine. I got Ubuntu and all my packages installed just fine, but I wanted to go one step further and use the GPU on my old NVIDIA Geforce 760. I tried for hours and hours. No luck. I keep running into problems with the display driver. And this was all in an effort to be able to model faster to get my score higher. That last bit’s the rub. Maybe it’s more of a personal character flaw, but I wasn’t satisfied with just learning cool new tools…I wanted to be in the top 10% at least. I definitely feel that I’ve neglected my other studies (and probably my kids a bit too). I need to learn to budget my Kaggle time better. Maybe I should walk away with my knowledge and not worry as much about my rank…

4. The Kaggle community is pretty great.

I’m super impressed with the help that people offer each other. The discussions are interesting and useful, and so many people post kernels from which you can build your own. They have 16 gigs of RAM at your disposal if you use their servers (i.e. you don’t have to have any languages or packages installed on your computer. You can do it all through your browser!). I look forward to doing more competitions, and to using other platforms like DrivenData where competitions are for non-profits.

That’s all for now. More coming in part 2!

 

 

Kaggle Data Scientist Survey 2017

To the reader,

The best way to read this report is on my Github page here. You can also play around with the code yourself by forking (copying) the kernel on Kaggle itself here. I tried various ways to get the report to render correctly here on WordPress, but to no avail. If anyone has some CSS magic that could make the window below bigger (i.e. without the scrollbar) let me know!

This report got longer than I anticipated. There’s quite a bit of rich data from the survey. I hope to do another report sometime soon that looks into more conditional insights.

Enjoy!

-Chad

Fitting the Noise

I’m still not in a place to really produce some original, quality analysis of my own yet, so I thought I’d teach you all about what is probably the most common pitfall in data science: overfitting.

In very broad strokes, machine learning consists of splitting your data set into two chunks: a training set and a test set. Then you take whatever model you are attempting to use, whether it’s linear regression, k-nearest-neighbors, or a random forest, etc., and “train” it on the training set. This involves tuning the hyperparameters that minimize whichever error function you’re using. In other words, when you train your model, you are taking your function and finding a way to best fit it to the training data. Finally, you take that trained model and see how well it performs on the test data. This type of machine learning is called supervised machine learning, because we know the answers in our test set, and can therefore measure how well our model performs directly.

Several interesting themes emerge when attempting to fit models to data like this. Allow me to illustrate with a somewhat lengthy example (credit goes to this talk from Dr. Tal Yarkoni at the Machine Learning Meetup in Austin for the overall structure of this example. However, the code which underlies it this analysis is my own).

Here’s some data:

data

I created a simple quadratic function and added some normally distributed noise. Our goal is to find a polynomial that best fits this data. First, we randomly pick out 60% of our data on which to train, leaving 40% used to test our model. I mainly chose these proportions for illustrative purposes:

train_test

With the naked eye we can detect an overall upward trend, perhaps with a little bump at the front (your brain is pretty good at finding patterns). Lets try a linear fit (i.e. a first degree polynomial):

1d

On the left in blue, we see the training points along with the best fit line of our model. Overall, our model doesn’t really capture the overall trend of our data. We call a model like this underfit. The program is using least squares regression, trying to minimize the sum of the squares of the residuals (a residual is the difference between what the model predicts and the actual value of the data—we use the square so a positive residual is not canceled out by an equal but negative one). On the right, we see the test data, the same linear model, and, plotted in red, the actual function I used to create the data. Notice the MSE value on each graph. MSE stands for the mean squared error and is a calculation of the average of the squares of the residuals. The gist is this: the closer to zero we can get the MSE, the better our model. There are many and better ways to measure the success of models, but for our purposes here the MSE will suffice. Let’s try a quadratic function (2nd degree polynomial) and see if our MSE decreases.

2d

Pretty good! Notice how close to our target function we get, despite the noise I added (I imagine that normally distributed noise with a small standard deviation averages out in this situation). Our MSE for both our training data and our test data are a lot lower than our linear fit. We should expect this, since the actual function I used is a 2nd degree function. This model actually captures the underlying structure of the data. Often we don’t have the luxury of knowing this, but we’re learning here, so it’s okay.

Now, lets make our model more complicated, really trying to fit the training data with a 10th degree polynomial:

10d

Aha! Something interesting has happened. Notice how the MSE on our training data is better than our 2nd degree fit. The 10 degrees of freedom allow the model to squiggle up and down, getting close to all the little bumps and dips in our training data. But look what happens when we test our model: the MSE is greater than our 2nd degree fit! Here, at last, is the impetus behind the title of this post: we have no longer fit the underlying structure of the data—we have fit the noise instead.

The terms signal and noise come from electrical engineering: the signal is the goal, the underlying “truth” of the matter, while the noise is all the extra bits of randomness from various sources. For an accessible introduction to all of this, I highly recommend Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. When we test our model against data it has never seen, it fails because the model was built to satisfy the idiosyncrasies of our training set. Thus, when it encounters the idiosyncrasies of our testing set, it misses its target. The model is overfit.

In the real world, we don’t really know the underlying structure of the data. So how can we guard against overfitting? How do we know if we are fitting the noise or if we are beginning to capture the underlying signal? One way to check ourselves is to run multiple “experiments.” Keeping our overall dataset the same, we can randomly choose different training and test sets, creating models and tests for each one. Let’s do this and plot all of the fits as well as the average fit:

bootstrapped

Each of the 100 fits I ran is plotted as a faint, blue line, while the mean fit is plotted with the dark line. We can learn a lot from these graphs. First, notice that the first two graphs don’t change much. Each model, no matter which 60% of the points we sample, turns out about the same. But take a look at the 10 degree fit: it’s much more wild, sometimes up, others down, bumps here one time, not there the next. This is a nice illustration of variance. The first two models have a low variance: the same point doesn’t change a whole lot from model to model. The 10 degree fit has a high variance, with a single point having a much wider range of movement.

I gathered all of the error data for each of these fits:

mse_histsboxplots_mse

Here we can see the accuracy and the variance. Notice the tall peaks in the histogram for the one and two dimensional models. The errors are all clustered pretty close together (i.e. each model isn’t changing too much from iteration to iteration). This is especially noticeable in the boxplot of the two dimensional fit. The colored area represent the central 50% of the measurements, also known as the inter-quartile range or IQR. Notice how narrow it is compared to the other two models.

Directly related to the problem of under/overfitting your data is the bias-variance trade off. In a nutshell, bias can be defined as error introduced to your prediction due to assumptions made about the structure of the underlying data. In our case, models with a lower degree have a higher bias. The 10 degree polynomial has more degrees of freedom, so it is less constrained with respect to the shapes of trends it can model, whereas the linear and quadratic cases can only model straight lines and parabolas, respectively. However, combining this with what we’ve already said about the variance of each of models, we see there is a trade off. Models can generally have high bias and low variance, or they can have a low bias and a high variance. Often, the job of a data scientist is to find the sweet spot in the middle. Like this:

bias_variance

I have to say I’m proud of this little graph (it was the culmination of four days of coding in between being a dad). I modeled the same 100 training and test sets shown above with every degree of polynomial from 0 to 10, keeping track of the errors along the way. The result is this graph. The average fit is shown with the dark lines. You can see that the blue training line steadily gets closer to zero as we increase the degree of the polynomial. In other words, the more degrees of freedom we give our model (low bias), the better it can fit the training data (and actually, if we have 60 degrees of freedom, it would have a training MSE of exactly zero, since it could produce a function that goes exactly through each of the 60 training points).

However, as we run those models on the testing data, they get substantially worse as our degree increases (the red line). This increased error is due to having fit the noise of the training data. Since this noise changes with each set, each model varies wildly. We have decreased the bias of our underlying assumption at the expense of greater variance and unpredictability. What we are after is the lowest error along that red line: surprise, surprise, a degree of two! Data scientists use a similar technique to tune their models and discover the underlying trends in their data.

As hinted at above, real world cases are not this cut and dry. The models are much more complicated with many more variables and features involved. Often, the scientist has very little information about the underlying structure, and sometimes the model’s accuracy won’t be known until more data is captured. But overfitting—fitting the noise over the signal—is a problem with which every data scientist must contend on a daily basis. And now you know why!

***For all your fellow Python lovers and data heads (or just for the curious), check out the complete code for this post on my GitHub page! I’d love your feedback.***

 

Afterward

While thinking about and writing this post, I was struck by its use as an analogy for many of our society’s problems. We have evolved to find the patterns around us: this seemingly causing that, event A always following event B, etc. Our brain’s pattern-finding behavior was a distinct evolutionary advantage and probably primarily responsible for our rise to dominant species on the planet. But it can also lead us astray: we stereotype and let those stereotypes invade our social systems to the deepest levels; we tend to think tomorrow will be like today and have a difficulty imagining long time spans, leading to doubts about climate change and the like; we let the few speak for the many, with so many of our squabbles revolving around which few get to speak for which many. I can see how these are all like the problem of overfitting: we use too few data points to generalize to the world around us. Keep an eye on your models of the world friends! Don’t be afraid to let a little more data into your brain!

:o)

My Random Walk

I have had a difficult time figuring out when to stop learning and to start trying to work on something of my own. I feel a pretty strong urge to make something original. I downloaded years of data from the Florida Department of Education, but that whole thing seemed so daunting. Also, I don’t feel like I know enough yet to really discover anything. I can make some graphs and do some exploratory data analysis, but I haven’t dived very deep into the statistical thinking or machine learning required to really find insights.

So here is a little tidbit to go along with my own path which has brought me here: a random walk.

1_1000
1 walk, 1000 steps

From a starting point, in this case the origin, randomly take a step either forward, backward, to the right, or to left. The probability of each choice is the same. And repeat. The image above shows a random walk with one thousand steps. I generated it with some Python code using a visualization package called Bokeh. You can find the code on my Githup page.

Random walks are a common feature in nature. A photon released from a fusion explosion in the center of the sun can take over a million years to reach the sun’s surface as it is absorbed and reemitted in a random direction by billions of hydrogen atoms. Once freed from its walk, the photon reaches earth only eight and a half minutes later. Brownian motion, described by Einstein in one of his “miracle year” papers of 1905, can also be modeled as a random walk.

I decided to explore random walks with Bokeh because the package can create interactive visualizations. I got it all working on my home machine, but I haven’t figured out how to host them on this site. So for now, instead of creating your own walks, you must find satisfaction in this gallery of randomly colored random walks. I made the lines semi-transparent so you can see the overlap. Once I get the interactive version working online, I’ll update this post (it lets you slide through the walk and watch the whole thing unfold. Very cool!).

100_100
100 walks, 100 steps each
100_1000
100 walks, 1000 steps each
100_2000
100 walks, 2000 steps each
100_2000 _zoom
detail of 100 walks, 2000 steps
100_10000
100 walks, 10,000 steps each
1000_2000
1,000 walks, 2,000 steps eachs
1000_10000
1,000 walks, 10,000 steps each

I did some simple analysis of the distribution of the final distances traveled by the walks. Since each walk beings at the origin, the straight-line distance it has traveled is just found with your favorite equation from algebra: a2 + b2 = c2. I ran 10,000 walks with 100 steps each. Here are the results:

hist
Normalized Histogram of the distances to the origin
ecdf
Cumulative Distribution Function

In blue are the empirical results and in green is a normal distribution with identical mean and standard deviation. At first, I though the heavy counts to left of the mean might be noise, but after running the simulations several times, the distribution continued to be skewed the same way. The distances are not normally distributed! You have a little better chance of being closer to the origin than the average distance traveled. Your chances of being on the low side of the mean are better than being on the high side of the mean.

Perhaps that makes sense: to get far from the origin, your steps overall need to be in the same direction (or at least 2 of the same directions, e.g. if you keep going up and to the right). This is less likely than having a roughly equal number of steps in each direction. Maybe I can do some further analysis and count the steps each walk takes and see where that leads me.

I’m sure that this is all well documented in the analytic literature, but it has been fun running through it on my own. Currently, I’m working through some of the stats courses on DataCamp, after which comes the machine learning stuff. Hopefully I’ll have more tools at my disposal soon to actually dive into some real data.

Until then!