Kaggle Data Scientist Survey 2017

To the reader,

The best way to read this report is on my Github page here. You can also play around with the code yourself by forking (copying) the kernel on Kaggle itself here. I tried various ways to get the report to render correctly here on WordPress, but to no avail. If anyone has some CSS magic that could make the window below bigger (i.e. without the scrollbar) let me know!

This report got longer than I anticipated. There’s quite a bit of rich data from the survey. I hope to do another report sometime soon that looks into more conditional insights.

Enjoy!

-Chad

Fitting the Noise

I’m still not in a place to really produce some original, quality analysis of my own yet, so I thought I’d teach you all about what is probably the most common pitfall in data science: overfitting.

In very broad strokes, machine learning consists of splitting your data set into two chunks: a training set and a test set. Then you take whatever model you are attempting to use, whether it’s linear regression, k-nearest-neighbors, or a random forest, etc., and “train” it on the training set. This involves tuning the hyperparameters that minimize whichever error function you’re using. In other words, when you train your model, you are taking your function and finding a way to best fit it to the training data. Finally, you take that trained model and see how well it performs on the test data. This type of machine learning is called supervised machine learning, because we know the answers in our test set, and can therefore measure how well our model performs directly.

Several interesting themes emerge when attempting to fit models to data like this. Allow me to illustrate with a somewhat lengthy example (credit goes to this talk from Dr. Tal Yarkoni at the Machine Learning Meetup in Austin for the overall structure of this example. However, the code which underlies it this analysis is my own).

Here’s some data:

data

I created a simple quadratic function and added some normally distributed noise. Our goal is to find a polynomial that best fits this data. First, we randomly pick out 60% of our data on which to train, leaving 40% used to test our model. I mainly chose these proportions for illustrative purposes:

train_test

With the naked eye we can detect an overall upward trend, perhaps with a little bump at the front (your brain is pretty good at finding patterns). Lets try a linear fit (i.e. a first degree polynomial):

1d

On the left in blue, we see the training points along with the best fit line of our model. Overall, our model doesn’t really capture the overall trend of our data. We call a model like this underfit. The program is using least squares regression, trying to minimize the sum of the squares of the residuals (a residual is the difference between what the model predicts and the actual value of the data—we use the square so a positive residual is not canceled out by an equal but negative one). On the right, we see the test data, the same linear model, and, plotted in red, the actual function I used to create the data. Notice the MSE value on each graph. MSE stands for the mean squared error and is a calculation of the average of the squares of the residuals. The gist is this: the closer to zero we can get the MSE, the better our model. There are many and better ways to measure the success of models, but for our purposes here the MSE will suffice. Let’s try a quadratic function (2nd degree polynomial) and see if our MSE decreases.

2d

Pretty good! Notice how close to our target function we get, despite the noise I added (I imagine that normally distributed noise with a small standard deviation averages out in this situation). Our MSE for both our training data and our test data are a lot lower than our linear fit. We should expect this, since the actual function I used is a 2nd degree function. This model actually captures the underlying structure of the data. Often we don’t have the luxury of knowing this, but we’re learning here, so it’s okay.

Now, lets make our model more complicated, really trying to fit the training data with a 10th degree polynomial:

10d

Aha! Something interesting has happened. Notice how the MSE on our training data is better than our 2nd degree fit. The 10 degrees of freedom allow the model to squiggle up and down, getting close to all the little bumps and dips in our training data. But look what happens when we test our model: the MSE is greater than our 2nd degree fit! Here, at last, is the impetus behind the title of this post: we have no longer fit the underlying structure of the data—we have fit the noise instead.

The terms signal and noise come from electrical engineering: the signal is the goal, the underlying “truth” of the matter, while the noise is all the extra bits of randomness from various sources. For an accessible introduction to all of this, I highly recommend Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. When we test our model against data it has never seen, it fails because the model was built to satisfy the idiosyncrasies of our training set. Thus, when it encounters the idiosyncrasies of our testing set, it misses its target. The model is overfit.

In the real world, we don’t really know the underlying structure of the data. So how can we guard against overfitting? How do we know if we are fitting the noise or if we are beginning to capture the underlying signal? One way to check ourselves is to run multiple “experiments.” Keeping our overall dataset the same, we can randomly choose different training and test sets, creating models and tests for each one. Let’s do this and plot all of the fits as well as the average fit:

bootstrapped

Each of the 100 fits I ran is plotted as a faint, blue line, while the mean fit is plotted with the dark line. We can learn a lot from these graphs. First, notice that the first two graphs don’t change much. Each model, no matter which 60% of the points we sample, turns out about the same. But take a look at the 10 degree fit: it’s much more wild, sometimes up, others down, bumps here one time, not there the next. This is a nice illustration of variance. The first two models have a low variance: the same point doesn’t change a whole lot from model to model. The 10 degree fit has a high variance, with a single point having a much wider range of movement.

I gathered all of the error data for each of these fits:

mse_histsboxplots_mse

Here we can see the accuracy and the variance. Notice the tall peaks in the histogram for the one and two dimensional models. The errors are all clustered pretty close together (i.e. each model isn’t changing too much from iteration to iteration). This is especially noticeable in the boxplot of the two dimensional fit. The colored area represent the central 50% of the measurements, also known as the inter-quartile range or IQR. Notice how narrow it is compared to the other two models.

Directly related to the problem of under/overfitting your data is the bias-variance trade off. In a nutshell, bias can be defined as error introduced to your prediction due to assumptions made about the structure of the underlying data. In our case, models with a lower degree have a higher bias. The 10 degree polynomial has more degrees of freedom, so it is less constrained with respect to the shapes of trends it can model, whereas the linear and quadratic cases can only model straight lines and parabolas, respectively. However, combining this with what we’ve already said about the variance of each of models, we see there is a trade off. Models can generally have high bias and low variance, or they can have a low bias and a high variance. Often, the job of a data scientist is to find the sweet spot in the middle. Like this:

bias_variance

I have to say I’m proud of this little graph (it was the culmination of four days of coding in between being a dad). I modeled the same 100 training and test sets shown above with every degree of polynomial from 0 to 10, keeping track of the errors along the way. The result is this graph. The average fit is shown with the dark lines. You can see that the blue training line steadily gets closer to zero as we increase the degree of the polynomial. In other words, the more degrees of freedom we give our model (low bias), the better it can fit the training data (and actually, if we have 60 degrees of freedom, it would have a training MSE of exactly zero, since it could produce a function that goes exactly through each of the 60 training points).

However, as we run those models on the testing data, they get substantially worse as our degree increases (the red line). This increased error is due to having fit the noise of the training data. Since this noise changes with each set, each model varies wildly. We have decreased the bias of our underlying assumption at the expense of greater variance and unpredictability. What we are after is the lowest error along that red line: surprise, surprise, a degree of two! Data scientists use a similar technique to tune their models and discover the underlying trends in their data.

As hinted at above, real world cases are not this cut and dry. The models are much more complicated with many more variables and features involved. Often, the scientist has very little information about the underlying structure, and sometimes the model’s accuracy won’t be known until more data is captured. But overfitting—fitting the noise over the signal—is a problem with which every data scientist must contend on a daily basis. And now you know why!

***For all your fellow Python lovers and data heads (or just for the curious), check out the complete code for this post on my GitHub page! I’d love your feedback.***

 

Afterward

While thinking about and writing this post, I was struck by its use as an analogy for many of our society’s problems. We have evolved to find the patterns around us: this seemingly causing that, event A always following event B, etc. Our brain’s pattern-finding behavior was a distinct evolutionary advantage and probably primarily responsible for our rise to dominant species on the planet. But it can also lead us astray: we stereotype and let those stereotypes invade our social systems to the deepest levels; we tend to think tomorrow will be like today and have a difficulty imagining long time spans, leading to doubts about climate change and the like; we let the few speak for the many, with so many of our squabbles revolving around which few get to speak for which many. I can see how these are all like the problem of overfitting: we use too few data points to generalize to the world around us. Keep an eye on your models of the world friends! Don’t be afraid to let a little more data into your brain!

:o)

Why Data Science?

I realized last night that I have yet to really articulate why I am making this switch from teaching to data science. There’s more to it than just not being fully satisfied with teaching and needing a job to earn a buck. This post is more for me than it is you, but I thought I’d share.

Those of you who know me well know that I spent last fall studying for the LSAT. I was fairly convinced for about a year and a half that I wanted to become a lawyer. I was taken with the idea of using my mind to answer tough questions and to convince others of my argument’s merit, all while fighting for the most vulnerable. I have since abandoned that course, mostly. The salary distribution of lawyers is very bimodal: i.e. there are plenty of lawyers who make a gabazillion dollars, plenty of lawyers who make less than I did as a teacher, and not very many in between. Here is some data gathered by the National Association for Lawyer Placement on the starting salaries of the class of 2014:

DistributionCurve2014
Class of 2014 Lawyer Salary Distribution (source)

I interpret this data to mean that there are generally two types of lawyers, do gooders and corporate. Yes, those are not mutually exclusive, and yes there are lawyers in between. There is a good chance I could have started somewhere near $50k a year and moved up from there, but I’m not sure that little boost is worth the hassle of law school (I was probably going to have to commute to the University of Florida from Jacksonville, about a 90 minute drive, all while my wife was commuting to Daytona, another 90 minute drive, with both of us somehow being parents during that time—sounds like a nightmare). Also, the preponderance of evidence suggests that lawyers are largely unhappy with their work (a quick Google search returns this, and this). I came across a book called The Destruction of Young Lawyers at my local, used book store. It lays out a pretty compelling argument that the field of law is broken and that it’s taking young lawyers with it. All of this combined was enough for me to start thinking about other options.

While I taught high school, I always kept a rather elaborate grade book in Excel. AP exams are scored from a 1 to a 5, and I used my grade book to analyze my students’ test scores and come up with an equivalent, AP score. Some of the most fun I had teaching was staying up late, playing with Excel, trying to pry as much information I could out of my data. During my last year teaching physics, I wrote a Google Sheets add-on that generated an individualized report for each student outlining his or her areas of weakness and strength, both in terms of content and in terms of question difficulty. Looking back, I guess I could have taken the hint that I was in the wrong field. I will miss parts of teaching: those few students who really wanted to learn, seeing kids grow and push through the hard content, among others. But in the end, teaching isn’t about solving  and wrangling content, it’s about solving and wrangling teenagers, and I that’s not how I want to spend my energy.

datascientistsalary
2016 Data Scientist Salary Distribution (source)

 

So here we are: data science (doesn’t the salary distribution looks a lot friendlier?).  I took a year of Java in undergraduate, and I’ve always had a penchant for computers and their inner workings, complete with hosting LAN parties all through high school and college (if you don’t know what a LAN party is, don’t Google it; just assume it’s very cool and hip). This past November, when I really decided to abandon my LSAT studies, I first started looking to become a software developer in general. But I quickly zeroed in on data science. Here are the five main reasons why I’m fairly confident that data science is where I belong:

1. I love to solve problems (or, perhaps I’m slightly obsessive about solving problems).

I originally wrote that Google Sheets add-on to save time. I didn’t want to copy and paste my students’ data 170 times in 4 different places. It took me about a week to get the script working correctly (this includes the time needed to brush up on my JavaScript via CodeAcademy). In other words, I spent way more time writing the script than if I had just put on a movie and trudged through the copy and paste in the first place. But with the script I was so focused and so excited.  I had plans to generalize the add-on and distribute it to my fellow teachers. I thought about my script as I was falling asleep, driving to work, and pretty much every other time in between, much to my wife’s dismay.

I love tinkering with it all. If I think something is possible, I can’t stop until I’ve got it working. That sounds like an asset when it comes to data science.

2. I like to find new ways to ask old questions, or find new questions to ask altogether.

I learned the value of thinking clearly about method and theory in graduate school. I have a MA in the History and Philosophy of Science, and I finished all my coursework for a Ph.D. in the History of Religion. The historical goal of religious studies has been to tie all the world’s religions together, searching for underlying themes, asking about what lies beneath them all, and trying to get at religion’s core. I belonged to a more critical camp, asking instead about how the category of “religion” itself gets deployed in various settings to gain and restrict access to social, political, and economic capital (the summary of each of these camps is rather crude, but they will do for now). Using similar methods in the history and philosophy of science, I took fresh looks at the is/ought distinction, at the dispute between Schrödinger, Heisenberg and their interpretations of quantum mechanics, and at the role of social construction in our scientific concepts.

I want to do the same thing but with data. What questions could we be asking that we haven’t thought of yet? How could we reframe old questions to get at new answers? What discoveries await us in the ridiculous plethora that is our modern data?

3. I love to write and to teach.

I have always loved the careful communication of a concept and the meeting of minds that occurs when that communication is successful.  An underlying current in all my career meandering has been the sharing of ideas: from astrophysicist, to philosopher, to historian, to teacher, at their core they are all roles concerned with the dissemination of knowledge. Distilling difficult concepts and helping students understand them was the most rewarding part of teaching physics. This is a crucial aspect of data science. A data scientist must communicate data clearly through visualizations and explain predictive models to clients or another departments. I think this pedagogical role of the data scientist will fit me very well.

4. I want to contribute to my community and the global one.

The primary impetus behind my brief flirtation with law was the chance to use my mind to help the world. I abandoned my Ph.D. primarily because I saw the academy, and the humanities in particular, as a bit of a blackhole—I did not want to spend my life writing books that maybe one hundred other people on earth might read. I did not want to hyper-specialize to play the publish or perish game. It all felt a bit useless (not an education in the humanities in general, mind you, just the post-graduate side of it). Law felt like a place I could research, read, and write for the greater good—but, for the reasons above, also not a great fit.

I feel that with data science, I could finally contribute in a way that fits who I am. I’m not one to get out there and volunteer at this or that event on a regular basis (I participate in the occasional creek clean up and the like). But I’m very drawn to the idea of sitting down at my computer and figuring out answers that could help guide the city or a non-profit. What resources are needed where and when? Who is falling through the cracks and how can they be better served? The potential for using data science for the greater good is immense, and I can’t wait to participate.

5. I love science.

All of my childhood I wanted to be an astronomer. The universe captivated me since I was in grade school, and it continues to hold me in its grasp. I began college as an English major, having fallen in love with literature during my AP courses in high school, but after freshman year I switched to astronomy. I spent my summers trying to catch up on coursework, and did not participate in any undergraduate research until late senior year. This put me a little behind the curve, and I didn’t get into a Ph.D. program. None of that has changed my love of science. It’s the best way of knowing. I may not be discovering the true nature of dark matter, but I’ll making discoveries nonetheless, and who knows, perhaps some of them will be a little more down to earth.

-Chad