My First Kaggle Contest, Part 2

So it’s finally over! I can come out of the Kaggle hole in which I’ve been hiding for the last month or so. It’s been quite a little ride. When we left off, I was struggling to engineer some new features and  was looking at ways to deal with the size of the data set, such as using an Easy Ensemble (which failed miserably). Since then, I had several little break throughs.

First, I started using a much better validation method. This let me really gauge whether the features I was adding and the model parameters I was using were adding to my model’s success or causing it to overfit the training data.

Second, I spent a lot of time reading the forums and realized I was missing a key piece of information. TalkingData had accidentally released a larger test set early on, and were now providing it to everyone so nobody had an unfair advantage. I needed to incorporate this larger set when calculating my features on the test set since many of my features involved counting clicks over certain groupings and measuring the time between clicks. Without the full picture, my calculations had been off.

Third, due to the way TalkingData set up the encoded features, it made sense to create my features on a daily basis rather than for the entire set altogether (i.e. the group counts, etc. would reset each day. See the chart below.)

Screen Shot 2018-05-11 at 9.51.05 AM
One of the charts I made to demonstrate how the day affected model validation.

So not too long after my last post, armed with the above changes, I jumped up to 17th out of about 3000 teams! What!? Huge!

Partially, I think I got a little lucky. I feel like I was thinking through things, but using only my 14 features and getting an AUC score of .9805 seems pretty magical, especially when others were using fancier computers and more features only to be stuck in the high .9700s. But I’ll take a little magic I guess!

My relatively simple and high performing model caught the attention of another team, and they reached out to me. We had fairly different approaches, and their score was pretty close to mine, so we joined forces, hoping that combining our models would prove effective. The story from here is rather long. In a nutshell, we spent a lot of time writing a separate report that we sent directly to TalkingData explaining why we thought the competition setup was a bit flawed. Submissions were evaluated based on the AUC score of predictions about whether the app was downloaded, but the competition was billed as fraud detection. We discovered machines that were obviously fraudulent clickers but which still downloaded the app.

In the meantime, other teams continued to improve their models and our position on the leaderboard dropped. We adjusted our model to its final form, and we were in a good place to finish in the top 4% or so, a fairly satisfactory place for a first time contestant. But, alas, 8 hours before the end of a three-month-long competition, a user posted a kernel which provided a score of .9811. So all of the copy-and-paste aficionados leap frogged us, and we fell down to 12%. I ended up pretty disappointed in how the competition ended.

But, I learned a lot along the way. Here are my takeaways:

The daily work involved in doing well in a Kaggle contest is not the same work a data scientist or analyst does. The models created by data scientists are deployed at scale, so efficiency must be carefully weighted against accuracy, as accuracy often comes from complexity, and complexity comes at the cost of speed. For Kaggle, the only factor that matters is accuracy. In a Kaggle contest, spending hours and hours getting your metric to increase by 0.0005 is time very well spent. This is not the case in the real world. Kaggle can be great fun, and a great learning experience, but the competitions should be approached the same way an RPG should be. Do you want to spend 250 hours leveling up from level 75 to 76? Yeah? Great! Go ahead. But if you’re more interested in just learning the ropes, maybe try to get into the top 25% and move on.

While the Kaggle community is great about sharing and teaching, this ends up being a double-edged sword. I definitely got a lot of help at the beginning through the discussions and the kernels. I would comb through the forums looking for new ideas if I got stuck on my end. But I’m not sure that the ability to run a full kernel and get the exact submission from someone else is beneficial to the community, and I definitely feel it’s a detriment to the competition. The forums should be a place for exploratory data analysis, code snippets, etc, but not for sharing complete solutions to the problem. So much of my and my teammates hard work was undone at the last second from one person posting a kernel (and all those competitors who took advantage of it). If somebody’s Kaggle rank is supposed to mean anything in the real world, copy-and-pasting your way to the top shouldn’t be possible.

I recently listened to an episode of DataFramed where Hugo Bowne-Anderson interviews Anthony Goldbloom, the CEO of Kaggle. Goldbloom says that the competitions on Kaggle are only about a fourth of the activity on the site. Kaggle also offers dataset storage and sharing, and the kernels can be used to write, share,  and run any code you want. Kaggle was recently acquired by Google, and will soon grant access to GPU (graphics processing unit) and TPU (tensor processing unit) machines. I know of no where else the public can have access to this kind of computing power for free. Deep learning with artificial neural networks is one of the areas on which I’d eventually like to focus, so I’m excited to see this feature launched.

Overall, despite how it ended, I’m fairly satisfied with my first Kaggle competition experience. I got to try a nice handful of different algorithms and techniques on a real-world data set and have my results measured against others in the field. If you’re an aspiring analyst or data scientist, but haven’t dived into the world of Kaggle yet, then, to quote the immortal Ms. Frizzle: “Take chances, make mistakes, and get messy!”

PS – has several tutorials on how to get started with Kaggle competitions. Check them out!

My First Kaggle Contest, part 1

At long last, I decided to enter my first Kaggle contest. For the uninitiated, Kaggle hosts predictive, data science competitions. For example, Zillow recently had a contest on Kaggle to better improve their pricing algorithm. Prizes for the competitions can be pretty substantial (the Zillow prize pool was $1.2 million!).

As you can read about in my analysis of their survey, Kaggle is seen as a great resource for learning the tools of data science. I have something of a more mixed review.

About two weeks ago I entered the TalkingData AdTracking Fraud Detection Challenge. According to the competition overview, TalkingData is China’s “largest independent big data service platform,” and “covers over 70% of active mobile devices nationwide.” The contest is to predict whether a person will download an app after clicking on an ad. They provide 200 million clicks over four days with which to train your model.

That last bit is what makes this competition challenging. The csv file for the training data is over 7.5 gigs (that is A LOT of commas)! Now, this is no BigData, but it’s enough to make my little MacBook with a paltry 16 gigs of RAM cough and sputter a bit. What’s more, out of all those 200 million clicks, only 0.2% result in a download. This results in a very unbalanced data set, which brings its own challenges.

The competition has been something of a trial-by-fire. I had to learn the Kaggle platform, how to deal with the most raw data I’ve seen yet, and I had to become familiar with packages like XGBoost and Light GBM. The competition results uses the area-under-the -curve of an ROC plot (more on this in an upcoming post). Roughly, it’s a measurement of how accurate your prediction is, taking into account false negative and false positive predictions. A score of 1 is perfect. My current best is sitting at 0.9633. Pretty good right? Ha, wrong! My rank on the leaderboard is 1,244th out of 2196. The current leader has a score of 0.9815. The competition is pretty fierce.

I’m not sure what everyone else is doing that I’m not. I’ve built several models from scratch and based on what others of posted, but I still can’t get any higher. I have a few ideas left, and I’ll you know how that goes. But here are some lessons I’ve learned so far:

1. More and fancier features are not better.

I spent a lot of time looking at the data, trying to figure out what might be important. Here’s a screen shot of what it looks like:Screen Shot 2018-04-04 at 12.13.47 PM

One of the reasons I chose this competition is the limited number of features (columns). The first five are encoded, meaning that TalkingData knows what they mean by them, but to us they are just categories (i.e. maybe device 1 is an iPhone, maybe app 3 is Chrome). Overall there are just 6 features, with the “is_attributed” column telling us if they downloaded the app or not. There are really only a few new features you can create here. I looked at the total counts for ip, app, os, and channel, and I looked at the mean difference in click_time, thinking that if the clicks are fraudulent, they will happen faster than if a person is doing the clicking. I also included the hour of the day.

But I think this is overboard. As you can learn from reading my post on overfitting, more features can lead to a low-bias, high-variance model, i.e. I think I’ve overfit the training data, so my model does not generalize well to the test data. I’m considering dialing back some of these features to err on the side of simplicity.

2. Here are some good tips for dealing with unbalanced data. I’m currently working on implementing some of these ideas into my latest model.

3. Kaggle can be a time suck.

definitely have spent too much time on this. I’ve learned a lot, yes, but I also spent a week trying to turn my old desktop into a CUDA-powered GPU machine. I got Ubuntu and all my packages installed just fine, but I wanted to go one step further and use the GPU on my old NVIDIA Geforce 760. I tried for hours and hours. No luck. I keep running into problems with the display driver. And this was all in an effort to be able to model faster to get my score higher. That last bit’s the rub. Maybe it’s more of a personal character flaw, but I wasn’t satisfied with just learning cool new tools…I wanted to be in the top 10% at least. I definitely feel that I’ve neglected my other studies (and probably my kids a bit too). I need to learn to budget my Kaggle time better. Maybe I should walk away with my knowledge and not worry as much about my rank…

4. The Kaggle community is pretty great.

I’m super impressed with the help that people offer each other. The discussions are interesting and useful, and so many people post kernels from which you can build your own. They have 16 gigs of RAM at your disposal if you use their servers (i.e. you don’t have to have any languages or packages installed on your computer. You can do it all through your browser!). I look forward to doing more competitions, and to using other platforms like DrivenData where competitions are for non-profits.

That’s all for now. More coming in part 2!