Blog

One Year Later…

I received an email the other day from Google. My domain name will be renewed in 30 days.

What?!

I checked the date: October 15th. That’s when I realized it had been a year. I was registered to take the November 2017 LSAT and had to request my refund by around this time last year. It’s been one year since I decided to try and become a data scientist.

I’ve fallen away from this blog over the last few months, but I have good reason: I got a job! I am currently working as a data analyst on the data warehouse team at AdTheorent (a quick note to say that this blog remains my own work, and is unaffiliated with my role there). I started applying to jobs more to test the waters of the local job market here in Jacksonville, FL. I wanted to see what employers were looking for and get some experience interviewing for tech jobs. I got a couple of interviews, but the AdTheorent one stuck out to me. AdTheorent is entirely based in the cloud, uses data science and predictive analytics to place bids on ad exchanges, and it seemed the team leader was looking less for someone with a certain skillset and more for someone who had the ability to think analytically and critically. I was offered the job and jumped at the chance.

I’m not doing any data science for my job (we do have a data science team, but they’re based in New York). However, I am using quite a bit of what I learned at DataCamp. The majority of my life at AdTheorent is lived in SQL and some C# for our data loading and processing jobs. I still use Python and Seaborn for visualization when needed, and I’m starting to delve into Pyspark as well. Oh and Git. Lots of Git. Which bring me to the point of this post:

Five Things I wish I had known

1. Git
I dipped into DataCamp’s Git course, but I didn’t pay very close attention. Now it’s my bread and butter. I always have Git bash open, but too often I’ve had to ask our resident Git guru for help. I love using it now (seriously–it’s a brilliant piece of software), but I wish I had learned it better beforehand. Maybe I would have a few less git reverts under my belt (or worse, I even had to check out a commit, rename it to master, and delete the old master…this is probably not standard practice…).

2. Dev – UAT – Production
I’m not sure how I would have gained much experience here before actually being employed, but the actual day-to-day life of the software developer has been entirely new to me. We use an Agile workflow which I was basically Googling on day one (lots of small, frequent, and therefore flexible deployments instead of fewer, larger ones). Also, I had never written any code which had to pass a test other than, “Hey it worked this time!” Going from little home projects to working on production code was a huge jump, and I have definitely skinned my knees a time or two, which brings me to…

3. The difference between big data, and BIG DATA
We process and house a lot of data. We aren’t a Google or a Facebook, but we deal with tables that are terabytes. The Kaggle contest that I participated in dealt with enough data to make my MacBook run for cover, but that was just a couple of gigabytes. For anyone else starting out in all of this, just in case this isn’t as obvious to you as it wasn’t for me, methods, programs, and procedures that work with a gigabyte of data will not necessarily work with petabytes. I have had to go back to almost the drawing board twice now for a single project because, when it came to time to test on full, production-sized tables, my jobs crashed and burned (or at least were taking too long to be workable solutions). Often, when you get to this scale, the name of the game becomes less about how sexy you can make your solution and more about how reliable and quickly debuggable it is.

4. Environments
I wish I had been familiar with the ways to set up and manage development environments. I didn’t really know what they were or why they were important (if you want to learn this, use Anaconda and this documentation). I mean, why didn’t everyone just use the most recent version of all their packages? Well, also figured out some of this the hard way (sensing a pattern?).  I finally managed to set up a cloned environment on my laptop of our Spark cluster–Python 2.7, Hadoop, and all.

5. Network & Server Speak
My house was the main gathering place for LAN parties in high school. We played Counter Strike, the Command and Conquer series, Starcraft, Quake 3, and other classics. We would sometimes spend hours troubleshooting, trying to get all the computers on the network (this was Windows 2000 and XP people…it was rough). I think our record was 24 computers up and running. We had to run extensions cords all over the house to keep from flipping the breakers (writing this now, I am reminded about how gracious my parents were…). Nonetheless, these experiences did not prepare me for network, VPN, server, production, UAT, cluster, box, prod talk (can you believe it?). I’m not sure how I would have learned all of this outside of a technology workplace, but I’m googling away trying to catch up.

Now, despite my trial by fire, I feel like none of these things have been impossible hurdles. I’m picking things up and enjoying every minute. Now that I’ve settled into things, I hope to start back up on my data science education. Stay tuned!

My First Kaggle Contest, Part 2

So it’s finally over! I can come out of the Kaggle hole in which I’ve been hiding for the last month or so. It’s been quite a little ride. When we left off, I was struggling to engineer some new features and  was looking at ways to deal with the size of the data set, such as using an Easy Ensemble (which failed miserably). Since then, I had several little break throughs.

First, I started using a much better validation method. This let me really gauge whether the features I was adding and the model parameters I was using were adding to my model’s success or causing it to overfit the training data.

Second, I spent a lot of time reading the forums and realized I was missing a key piece of information. TalkingData had accidentally released a larger test set early on, and were now providing it to everyone so nobody had an unfair advantage. I needed to incorporate this larger set when calculating my features on the test set since many of my features involved counting clicks over certain groupings and measuring the time between clicks. Without the full picture, my calculations had been off.

Third, due to the way TalkingData set up the encoded features, it made sense to create my features on a daily basis rather than for the entire set altogether (i.e. the group counts, etc. would reset each day. See the chart below.)

Screen Shot 2018-05-11 at 9.51.05 AM
One of the charts I made to demonstrate how the day affected model validation.

So not too long after my last post, armed with the above changes, I jumped up to 17th out of about 3000 teams! What!? Huge!

Partially, I think I got a little lucky. I feel like I was thinking through things, but using only my 14 features and getting an AUC score of .9805 seems pretty magical, especially when others were using fancier computers and more features only to be stuck in the high .9700s. But I’ll take a little magic I guess!

My relatively simple and high performing model caught the attention of another team, and they reached out to me. We had fairly different approaches, and their score was pretty close to mine, so we joined forces, hoping that combining our models would prove effective. The story from here is rather long. In a nutshell, we spent a lot of time writing a separate report that we sent directly to TalkingData explaining why we thought the competition setup was a bit flawed. Submissions were evaluated based on the AUC score of predictions about whether the app was downloaded, but the competition was billed as fraud detection. We discovered machines that were obviously fraudulent clickers but which still downloaded the app.

In the meantime, other teams continued to improve their models and our position on the leaderboard dropped. We adjusted our model to its final form, and we were in a good place to finish in the top 4% or so, a fairly satisfactory place for a first time contestant. But, alas, 8 hours before the end of a three-month-long competition, a user posted a kernel which provided a score of .9811. So all of the copy-and-paste aficionados leap frogged us, and we fell down to 12%. I ended up pretty disappointed in how the competition ended.

But, I learned a lot along the way. Here are my takeaways:

The daily work involved in doing well in a Kaggle contest is not the same work a data scientist or analyst does. The models created by data scientists are deployed at scale, so efficiency must be carefully weighted against accuracy, as accuracy often comes from complexity, and complexity comes at the cost of speed. For Kaggle, the only factor that matters is accuracy. In a Kaggle contest, spending hours and hours getting your metric to increase by 0.0005 is time very well spent. This is not the case in the real world. Kaggle can be great fun, and a great learning experience, but the competitions should be approached the same way an RPG should be. Do you want to spend 250 hours leveling up from level 75 to 76? Yeah? Great! Go ahead. But if you’re more interested in just learning the ropes, maybe try to get into the top 25% and move on.

While the Kaggle community is great about sharing and teaching, this ends up being a double-edged sword. I definitely got a lot of help at the beginning through the discussions and the kernels. I would comb through the forums looking for new ideas if I got stuck on my end. But I’m not sure that the ability to run a full kernel and get the exact submission from someone else is beneficial to the community, and I definitely feel it’s a detriment to the competition. The forums should be a place for exploratory data analysis, code snippets, etc, but not for sharing complete solutions to the problem. So much of my and my teammates hard work was undone at the last second from one person posting a kernel (and all those competitors who took advantage of it). If somebody’s Kaggle rank is supposed to mean anything in the real world, copy-and-pasting your way to the top shouldn’t be possible.

I recently listened to an episode of DataFramed where Hugo Bowne-Anderson interviews Anthony Goldbloom, the CEO of Kaggle. Goldbloom says that the competitions on Kaggle are only about a fourth of the activity on the site. Kaggle also offers dataset storage and sharing, and the kernels can be used to write, share,  and run any code you want. Kaggle was recently acquired by Google, and will soon grant access to GPU (graphics processing unit) and TPU (tensor processing unit) machines. I know of no where else the public can have access to this kind of computing power for free. Deep learning with artificial neural networks is one of the areas on which I’d eventually like to focus, so I’m excited to see this feature launched.

Overall, despite how it ended, I’m fairly satisfied with my first Kaggle competition experience. I got to try a nice handful of different algorithms and techniques on a real-world data set and have my results measured against others in the field. If you’re an aspiring analyst or data scientist, but haven’t dived into the world of Kaggle yet, then, to quote the immortal Ms. Frizzle: “Take chances, make mistakes, and get messy!”

PS – DataCamp.com has several tutorials on how to get started with Kaggle competitions. Check them out!

My First Kaggle Contest, part 1

At long last, I decided to enter my first Kaggle contest. For the uninitiated, Kaggle hosts predictive, data science competitions. For example, Zillow recently had a contest on Kaggle to better improve their pricing algorithm. Prizes for the competitions can be pretty substantial (the Zillow prize pool was $1.2 million!).

As you can read about in my analysis of their survey, Kaggle is seen as a great resource for learning the tools of data science. I have something of a more mixed review.

About two weeks ago I entered the TalkingData AdTracking Fraud Detection Challenge. According to the competition overview, TalkingData is China’s “largest independent big data service platform,” and “covers over 70% of active mobile devices nationwide.” The contest is to predict whether a person will download an app after clicking on an ad. They provide 200 million clicks over four days with which to train your model.

That last bit is what makes this competition challenging. The csv file for the training data is over 7.5 gigs (that is A LOT of commas)! Now, this is no BigData, but it’s enough to make my little MacBook with a paltry 16 gigs of RAM cough and sputter a bit. What’s more, out of all those 200 million clicks, only 0.2% result in a download. This results in a very unbalanced data set, which brings its own challenges.

The competition has been something of a trial-by-fire. I had to learn the Kaggle platform, how to deal with the most raw data I’ve seen yet, and I had to become familiar with packages like XGBoost and Light GBM. The competition results uses the area-under-the -curve of an ROC plot (more on this in an upcoming post). Roughly, it’s a measurement of how accurate your prediction is, taking into account false negative and false positive predictions. A score of 1 is perfect. My current best is sitting at 0.9633. Pretty good right? Ha, wrong! My rank on the leaderboard is 1,244th out of 2196. The current leader has a score of 0.9815. The competition is pretty fierce.

I’m not sure what everyone else is doing that I’m not. I’ve built several models from scratch and based on what others of posted, but I still can’t get any higher. I have a few ideas left, and I’ll you know how that goes. But here are some lessons I’ve learned so far:

1. More and fancier features are not better.

I spent a lot of time looking at the data, trying to figure out what might be important. Here’s a screen shot of what it looks like:Screen Shot 2018-04-04 at 12.13.47 PM

One of the reasons I chose this competition is the limited number of features (columns). The first five are encoded, meaning that TalkingData knows what they mean by them, but to us they are just categories (i.e. maybe device 1 is an iPhone, maybe app 3 is Chrome). Overall there are just 6 features, with the “is_attributed” column telling us if they downloaded the app or not. There are really only a few new features you can create here. I looked at the total counts for ip, app, os, and channel, and I looked at the mean difference in click_time, thinking that if the clicks are fraudulent, they will happen faster than if a person is doing the clicking. I also included the hour of the day.

But I think this is overboard. As you can learn from reading my post on overfitting, more features can lead to a low-bias, high-variance model, i.e. I think I’ve overfit the training data, so my model does not generalize well to the test data. I’m considering dialing back some of these features to err on the side of simplicity.

2. Here are some good tips for dealing with unbalanced data. I’m currently working on implementing some of these ideas into my latest model.

3. Kaggle can be a time suck.

definitely have spent too much time on this. I’ve learned a lot, yes, but I also spent a week trying to turn my old desktop into a CUDA-powered GPU machine. I got Ubuntu and all my packages installed just fine, but I wanted to go one step further and use the GPU on my old NVIDIA Geforce 760. I tried for hours and hours. No luck. I keep running into problems with the display driver. And this was all in an effort to be able to model faster to get my score higher. That last bit’s the rub. Maybe it’s more of a personal character flaw, but I wasn’t satisfied with just learning cool new tools…I wanted to be in the top 10% at least. I definitely feel that I’ve neglected my other studies (and probably my kids a bit too). I need to learn to budget my Kaggle time better. Maybe I should walk away with my knowledge and not worry as much about my rank…

4. The Kaggle community is pretty great.

I’m super impressed with the help that people offer each other. The discussions are interesting and useful, and so many people post kernels from which you can build your own. They have 16 gigs of RAM at your disposal if you use their servers (i.e. you don’t have to have any languages or packages installed on your computer. You can do it all through your browser!). I look forward to doing more competitions, and to using other platforms like DrivenData where competitions are for non-profits.

That’s all for now. More coming in part 2!

 

 

Medium Steps

“Baby steps” are how we get from A to B. We do the hard work of learning the details, spending hours on hiccups and chasing rabbits down holes. But I find it difficult to really post about baby steps. You all don’t need to know about every little aspect of my data science education, and I don’t have time to write about them. Reporting falls prey to the law of diminishing returns.

But the pressure to avoid reporting baby steps can overshoot the mark, leading to a desire to only post polished, new material. But if I only posted finished products, I wouldn’t be able to write anything for weeks on end, and you, dear reader, would get no sense of the process. Also, I need to get over my fear of appearing too raw or unpolished. Thus, I’m going to try and be better about posting medium steps. Here are the medium steps I’ve taken lately:

    1. I discovered that GitHub hosts websites, which, of course, display Jupyter Notebooks beautifully and with ease. I spent an entire day trying to figure out a way to display my analysis of the 2017 Kaggle Survey here on WordPress, only to have to resort an awful scrolling embedded window. So, in retrospect, I should probably have built my entire site on GitHub pages. Oh well. I’ve decided to marry the two for now. I’ll continue to use this as my blog and primary writing outlet, but I’ll host my portfolio projects on my GitHub site, which I hope to build soon.
    2. I’ve started working on start-to-finish project concerning Florida high school performance! Start with what you know, right? I’ve downloaded a lot of raw data from the Florida DOE. I’m currently working on cleaning it up, getting it ready to visualize and to run through some predictive modeling. I haven’t decided on my fundamental questions yet, outside of “What factors contribute to failing schools and how much so?” Hopefully I’ll get some time in the coming week to really jump into this.
    3. I’m three weeks into Andrew Ng’s well known Machine Learning course on Coursera. I was privileged enough to get to chat with Hugo Bowne-Anderson from DataCamp the other week, and he suggested it as a resource. I’m really enjoying it so far. I’m glad that Ng gets into the weeds a bit with the math. Having taken four semesters of calculus in undergrad, I feel confident that I can do the math, but I haven’t had a great opportunity yet to dust off those skills.
    4. I finally finished Nate Silver’s book The Signal and the Noise. I really enjoyed it and learned lot, but I feel that it was a bit of a slog. He could have applied his general thesis/warning to fewer fields and still have written a great book. But I’m glad to have read it. I am going to try dive deeper into some more data science specific material next (see both my progress page and my current reading page for more!).
    5. I’ve got several posts coming up explaining some data science basics: one on conditional probability, a followup on Bayesian analysis, and a third on gradient descent (I’m looking forward to building the visuals and mechanics behind this one!). So stay tuned!

Kaggle Data Scientist Survey 2017

To the reader,

The best way to read this report is on my Github page here. You can also play around with the code yourself by forking (copying) the kernel on Kaggle itself here. I tried various ways to get the report to render correctly here on WordPress, but to no avail. If anyone has some CSS magic that could make the window below bigger (i.e. without the scrollbar) let me know!

This report got longer than I anticipated. There’s quite a bit of rich data from the survey. I hope to do another report sometime soon that looks into more conditional insights.

Enjoy!

-Chad

Fitting the Noise

I’m still not in a place to really produce some original, quality analysis of my own yet, so I thought I’d teach you all about what is probably the most common pitfall in data science: overfitting.

In very broad strokes, machine learning consists of splitting your data set into two chunks: a training set and a test set. Then you take whatever model you are attempting to use, whether it’s linear regression, k-nearest-neighbors, or a random forest, etc., and “train” it on the training set. This involves tuning the hyperparameters that minimize whichever error function you’re using. In other words, when you train your model, you are taking your function and finding a way to best fit it to the training data. Finally, you take that trained model and see how well it performs on the test data. This type of machine learning is called supervised machine learning, because we know the answers in our test set, and can therefore measure how well our model performs directly.

Several interesting themes emerge when attempting to fit models to data like this. Allow me to illustrate with a somewhat lengthy example (credit goes to this talk from Dr. Tal Yarkoni at the Machine Learning Meetup in Austin for the overall structure of this example. However, the code which underlies it this analysis is my own).

Here’s some data:

data

I created a simple quadratic function and added some normally distributed noise. Our goal is to find a polynomial that best fits this data. First, we randomly pick out 60% of our data on which to train, leaving 40% used to test our model. I mainly chose these proportions for illustrative purposes:

train_test

With the naked eye we can detect an overall upward trend, perhaps with a little bump at the front (your brain is pretty good at finding patterns). Lets try a linear fit (i.e. a first degree polynomial):

1d

On the left in blue, we see the training points along with the best fit line of our model. Overall, our model doesn’t really capture the overall trend of our data. We call a model like this underfit. The program is using least squares regression, trying to minimize the sum of the squares of the residuals (a residual is the difference between what the model predicts and the actual value of the data—we use the square so a positive residual is not canceled out by an equal but negative one). On the right, we see the test data, the same linear model, and, plotted in red, the actual function I used to create the data. Notice the MSE value on each graph. MSE stands for the mean squared error and is a calculation of the average of the squares of the residuals. The gist is this: the closer to zero we can get the MSE, the better our model. There are many and better ways to measure the success of models, but for our purposes here the MSE will suffice. Let’s try a quadratic function (2nd degree polynomial) and see if our MSE decreases.

2d

Pretty good! Notice how close to our target function we get, despite the noise I added (I imagine that normally distributed noise with a small standard deviation averages out in this situation). Our MSE for both our training data and our test data are a lot lower than our linear fit. We should expect this, since the actual function I used is a 2nd degree function. This model actually captures the underlying structure of the data. Often we don’t have the luxury of knowing this, but we’re learning here, so it’s okay.

Now, lets make our model more complicated, really trying to fit the training data with a 10th degree polynomial:

10d

Aha! Something interesting has happened. Notice how the MSE on our training data is better than our 2nd degree fit. The 10 degrees of freedom allow the model to squiggle up and down, getting close to all the little bumps and dips in our training data. But look what happens when we test our model: the MSE is greater than our 2nd degree fit! Here, at last, is the impetus behind the title of this post: we have no longer fit the underlying structure of the data—we have fit the noise instead.

The terms signal and noise come from electrical engineering: the signal is the goal, the underlying “truth” of the matter, while the noise is all the extra bits of randomness from various sources. For an accessible introduction to all of this, I highly recommend Nate Silver’s The Signal and the Noise: Why So Many Predictions Fail—but Some Don’t. When we test our model against data it has never seen, it fails because the model was built to satisfy the idiosyncrasies of our training set. Thus, when it encounters the idiosyncrasies of our testing set, it misses its target. The model is overfit.

In the real world, we don’t really know the underlying structure of the data. So how can we guard against overfitting? How do we know if we are fitting the noise or if we are beginning to capture the underlying signal? One way to check ourselves is to run multiple “experiments.” Keeping our overall dataset the same, we can randomly choose different training and test sets, creating models and tests for each one. Let’s do this and plot all of the fits as well as the average fit:

bootstrapped

Each of the 100 fits I ran is plotted as a faint, blue line, while the mean fit is plotted with the dark line. We can learn a lot from these graphs. First, notice that the first two graphs don’t change much. Each model, no matter which 60% of the points we sample, turns out about the same. But take a look at the 10 degree fit: it’s much more wild, sometimes up, others down, bumps here one time, not there the next. This is a nice illustration of variance. The first two models have a low variance: the same point doesn’t change a whole lot from model to model. The 10 degree fit has a high variance, with a single point having a much wider range of movement.

I gathered all of the error data for each of these fits:

mse_histsboxplots_mse

Here we can see the accuracy and the variance. Notice the tall peaks in the histogram for the one and two dimensional models. The errors are all clustered pretty close together (i.e. each model isn’t changing too much from iteration to iteration). This is especially noticeable in the boxplot of the two dimensional fit. The colored area represent the central 50% of the measurements, also known as the inter-quartile range or IQR. Notice how narrow it is compared to the other two models.

Directly related to the problem of under/overfitting your data is the bias-variance trade off. In a nutshell, bias can be defined as error introduced to your prediction due to assumptions made about the structure of the underlying data. In our case, models with a lower degree have a higher bias. The 10 degree polynomial has more degrees of freedom, so it is less constrained with respect to the shapes of trends it can model, whereas the linear and quadratic cases can only model straight lines and parabolas, respectively. However, combining this with what we’ve already said about the variance of each of models, we see there is a trade off. Models can generally have high bias and low variance, or they can have a low bias and a high variance. Often, the job of a data scientist is to find the sweet spot in the middle. Like this:

bias_variance

I have to say I’m proud of this little graph (it was the culmination of four days of coding in between being a dad). I modeled the same 100 training and test sets shown above with every degree of polynomial from 0 to 10, keeping track of the errors along the way. The result is this graph. The average fit is shown with the dark lines. You can see that the blue training line steadily gets closer to zero as we increase the degree of the polynomial. In other words, the more degrees of freedom we give our model (low bias), the better it can fit the training data (and actually, if we have 60 degrees of freedom, it would have a training MSE of exactly zero, since it could produce a function that goes exactly through each of the 60 training points).

However, as we run those models on the testing data, they get substantially worse as our degree increases (the red line). This increased error is due to having fit the noise of the training data. Since this noise changes with each set, each model varies wildly. We have decreased the bias of our underlying assumption at the expense of greater variance and unpredictability. What we are after is the lowest error along that red line: surprise, surprise, a degree of two! Data scientists use a similar technique to tune their models and discover the underlying trends in their data.

As hinted at above, real world cases are not this cut and dry. The models are much more complicated with many more variables and features involved. Often, the scientist has very little information about the underlying structure, and sometimes the model’s accuracy won’t be known until more data is captured. But overfitting—fitting the noise over the signal—is a problem with which every data scientist must contend on a daily basis. And now you know why!

***For all your fellow Python lovers and data heads (or just for the curious), check out the complete code for this post on my GitHub page! I’d love your feedback.***

 

Afterward

While thinking about and writing this post, I was struck by its use as an analogy for many of our society’s problems. We have evolved to find the patterns around us: this seemingly causing that, event A always following event B, etc. Our brain’s pattern-finding behavior was a distinct evolutionary advantage and probably primarily responsible for our rise to dominant species on the planet. But it can also lead us astray: we stereotype and let those stereotypes invade our social systems to the deepest levels; we tend to think tomorrow will be like today and have a difficulty imagining long time spans, leading to doubts about climate change and the like; we let the few speak for the many, with so many of our squabbles revolving around which few get to speak for which many. I can see how these are all like the problem of overfitting: we use too few data points to generalize to the world around us. Keep an eye on your models of the world friends! Don’t be afraid to let a little more data into your brain!

:o)

Why Data Science?

I realized last night that I have yet to really articulate why I am making this switch from teaching to data science. There’s more to it than just not being fully satisfied with teaching and needing a job to earn a buck. This post is more for me than it is you, but I thought I’d share.

Those of you who know me well know that I spent last fall studying for the LSAT. I was fairly convinced for about a year and a half that I wanted to become a lawyer. I was taken with the idea of using my mind to answer tough questions and to convince others of my argument’s merit, all while fighting for the most vulnerable. I have since abandoned that course, mostly. The salary distribution of lawyers is very bimodal: i.e. there are plenty of lawyers who make a gabazillion dollars, plenty of lawyers who make less than I did as a teacher, and not very many in between. Here is some data gathered by the National Association for Lawyer Placement on the starting salaries of the class of 2014:

DistributionCurve2014
Class of 2014 Lawyer Salary Distribution (source)

I interpret this data to mean that there are generally two types of lawyers, do gooders and corporate. Yes, those are not mutually exclusive, and yes there are lawyers in between. There is a good chance I could have started somewhere near $50k a year and moved up from there, but I’m not sure that little boost is worth the hassle of law school (I was probably going to have to commute to the University of Florida from Jacksonville, about a 90 minute drive, all while my wife was commuting to Daytona, another 90 minute drive, with both of us somehow being parents during that time—sounds like a nightmare). Also, the preponderance of evidence suggests that lawyers are largely unhappy with their work (a quick Google search returns this, and this). I came across a book called The Destruction of Young Lawyers at my local, used book store. It lays out a pretty compelling argument that the field of law is broken and that it’s taking young lawyers with it. All of this combined was enough for me to start thinking about other options.

While I taught high school, I always kept a rather elaborate grade book in Excel. AP exams are scored from a 1 to a 5, and I used my grade book to analyze my students’ test scores and come up with an equivalent, AP score. Some of the most fun I had teaching was staying up late, playing with Excel, trying to pry as much information I could out of my data. During my last year teaching physics, I wrote a Google Sheets add-on that generated an individualized report for each student outlining his or her areas of weakness and strength, both in terms of content and in terms of question difficulty. Looking back, I guess I could have taken the hint that I was in the wrong field. I will miss parts of teaching: those few students who really wanted to learn, seeing kids grow and push through the hard content, among others. But in the end, teaching isn’t about solving  and wrangling content, it’s about solving and wrangling teenagers, and I that’s not how I want to spend my energy.

datascientistsalary
2016 Data Scientist Salary Distribution (source)

 

So here we are: data science (doesn’t the salary distribution looks a lot friendlier?).  I took a year of Java in undergraduate, and I’ve always had a penchant for computers and their inner workings, complete with hosting LAN parties all through high school and college (if you don’t know what a LAN party is, don’t Google it; just assume it’s very cool and hip). This past November, when I really decided to abandon my LSAT studies, I first started looking to become a software developer in general. But I quickly zeroed in on data science. Here are the five main reasons why I’m fairly confident that data science is where I belong:

1. I love to solve problems (or, perhaps I’m slightly obsessive about solving problems).

I originally wrote that Google Sheets add-on to save time. I didn’t want to copy and paste my students’ data 170 times in 4 different places. It took me about a week to get the script working correctly (this includes the time needed to brush up on my JavaScript via CodeAcademy). In other words, I spent way more time writing the script than if I had just put on a movie and trudged through the copy and paste in the first place. But with the script I was so focused and so excited.  I had plans to generalize the add-on and distribute it to my fellow teachers. I thought about my script as I was falling asleep, driving to work, and pretty much every other time in between, much to my wife’s dismay.

I love tinkering with it all. If I think something is possible, I can’t stop until I’ve got it working. That sounds like an asset when it comes to data science.

2. I like to find new ways to ask old questions, or find new questions to ask altogether.

I learned the value of thinking clearly about method and theory in graduate school. I have a MA in the History and Philosophy of Science, and I finished all my coursework for a Ph.D. in the History of Religion. The historical goal of religious studies has been to tie all the world’s religions together, searching for underlying themes, asking about what lies beneath them all, and trying to get at religion’s core. I belonged to a more critical camp, asking instead about how the category of “religion” itself gets deployed in various settings to gain and restrict access to social, political, and economic capital (the summary of each of these camps is rather crude, but they will do for now). Using similar methods in the history and philosophy of science, I took fresh looks at the is/ought distinction, at the dispute between Schrödinger, Heisenberg and their interpretations of quantum mechanics, and at the role of social construction in our scientific concepts.

I want to do the same thing but with data. What questions could we be asking that we haven’t thought of yet? How could we reframe old questions to get at new answers? What discoveries await us in the ridiculous plethora that is our modern data?

3. I love to write and to teach.

I have always loved the careful communication of a concept and the meeting of minds that occurs when that communication is successful.  An underlying current in all my career meandering has been the sharing of ideas: from astrophysicist, to philosopher, to historian, to teacher, at their core they are all roles concerned with the dissemination of knowledge. Distilling difficult concepts and helping students understand them was the most rewarding part of teaching physics. This is a crucial aspect of data science. A data scientist must communicate data clearly through visualizations and explain predictive models to clients or another departments. I think this pedagogical role of the data scientist will fit me very well.

4. I want to contribute to my community and the global one.

The primary impetus behind my brief flirtation with law was the chance to use my mind to help the world. I abandoned my Ph.D. primarily because I saw the academy, and the humanities in particular, as a bit of a blackhole—I did not want to spend my life writing books that maybe one hundred other people on earth might read. I did not want to hyper-specialize to play the publish or perish game. It all felt a bit useless (not an education in the humanities in general, mind you, just the post-graduate side of it). Law felt like a place I could research, read, and write for the greater good—but, for the reasons above, also not a great fit.

I feel that with data science, I could finally contribute in a way that fits who I am. I’m not one to get out there and volunteer at this or that event on a regular basis (I participate in the occasional creek clean up and the like). But I’m very drawn to the idea of sitting down at my computer and figuring out answers that could help guide the city or a non-profit. What resources are needed where and when? Who is falling through the cracks and how can they be better served? The potential for using data science for the greater good is immense, and I can’t wait to participate.

5. I love science.

All of my childhood I wanted to be an astronomer. The universe captivated me since I was in grade school, and it continues to hold me in its grasp. I began college as an English major, having fallen in love with literature during my AP courses in high school, but after freshman year I switched to astronomy. I spent my summers trying to catch up on coursework, and did not participate in any undergraduate research until late senior year. This put me a little behind the curve, and I didn’t get into a Ph.D. program. None of that has changed my love of science. It’s the best way of knowing. I may not be discovering the true nature of dark matter, but I’ll making discoveries nonetheless, and who knows, perhaps some of them will be a little more down to earth.

-Chad

My Random Walk

I have had a difficult time figuring out when to stop learning and to start trying to work on something of my own. I feel a pretty strong urge to make something original. I downloaded years of data from the Florida Department of Education, but that whole thing seemed so daunting. Also, I don’t feel like I know enough yet to really discover anything. I can make some graphs and do some exploratory data analysis, but I haven’t dived very deep into the statistical thinking or machine learning required to really find insights.

So here is a little tidbit to go along with my own path which has brought me here: a random walk.

1_1000
1 walk, 1000 steps

From a starting point, in this case the origin, randomly take a step either forward, backward, to the right, or to left. The probability of each choice is the same. And repeat. The image above shows a random walk with one thousand steps. I generated it with some Python code using a visualization package called Bokeh. You can find the code on my Githup page.

Random walks are a common feature in nature. A photon released from a fusion explosion in the center of the sun can take over a million years to reach the sun’s surface as it is absorbed and reemitted in a random direction by billions of hydrogen atoms. Once freed from its walk, the photon reaches earth only eight and a half minutes later. Brownian motion, described by Einstein in one of his “miracle year” papers of 1905, can also be modeled as a random walk.

I decided to explore random walks with Bokeh because the package can create interactive visualizations. I got it all working on my home machine, but I haven’t figured out how to host them on this site. So for now, instead of creating your own walks, you must find satisfaction in this gallery of randomly colored random walks. I made the lines semi-transparent so you can see the overlap. Once I get the interactive version working online, I’ll update this post (it lets you slide through the walk and watch the whole thing unfold. Very cool!).

100_100
100 walks, 100 steps each
100_1000
100 walks, 1000 steps each
100_2000
100 walks, 2000 steps each
100_2000 _zoom
detail of 100 walks, 2000 steps
100_10000
100 walks, 10,000 steps each
1000_2000
1,000 walks, 2,000 steps eachs
1000_10000
1,000 walks, 10,000 steps each

I did some simple analysis of the distribution of the final distances traveled by the walks. Since each walk beings at the origin, the straight-line distance it has traveled is just found with your favorite equation from algebra: a2 + b2 = c2. I ran 10,000 walks with 100 steps each. Here are the results:

hist
Normalized Histogram of the distances to the origin
ecdf
Cumulative Distribution Function

In blue are the empirical results and in green is a normal distribution with identical mean and standard deviation. At first, I though the heavy counts to left of the mean might be noise, but after running the simulations several times, the distribution continued to be skewed the same way. The distances are not normally distributed! You have a little better chance of being closer to the origin than the average distance traveled. Your chances of being on the low side of the mean are better than being on the high side of the mean.

Perhaps that makes sense: to get far from the origin, your steps overall need to be in the same direction (or at least 2 of the same directions, e.g. if you keep going up and to the right). This is less likely than having a roughly equal number of steps in each direction. Maybe I can do some further analysis and count the steps each walk takes and see where that leads me.

I’m sure that this is all well documented in the analytic literature, but it has been fun running through it on my own. Currently, I’m working through some of the stats courses on DataCamp, after which comes the machine learning stuff. Hopefully I’ll have more tools at my disposal soon to actually dive into some real data.

Until then!

Where to start? How to proceed?

I’ve been toying with all of this for about a month now. I have so many bookmarks for blogs, podcasts, free courses, paid courses, and on and on. I’ve checked out books from the library, bought others on Amazon, and downloaded open source texts. I have to admit that I’m a bit daunted. There are a million places to begin, and I have plenty of work to do before even becoming mildly employable. We’ll see…

Here’s my game plan:

  • Books to read
    • Data Science from Scratch by Joel Grus
    • Python for Data Analysis by Wes McKinney
    • Hands on Machine Learning by Aurélien Géron
    • The Art of Data Science by Elizabeth Matsui and Roger Peng
    • OpenIntro to Statistics by David Diez et. al.
  • Paid Online Courses
    • The entire Data Scientist Career Track from DataCamp.
    • Udemy
      • Python Megacourse
      • Python for Data Analysis & Visualization
      • Python for Machine Learning
      • Deep Learning x 4
  • Free Online Courses
    • Udacity
      • Intro to Computer Science
      • Intro to Data Science
    • Stanford
      • Statistical Inference
      • Prob – Stats
      • Statistical Learning (certified)
      • Mining Massive Datasets (certified)
      • Algorithms 1 & 2 (certified)

This should all take me the better part of a year. I hope to get enough dirt under my nails to start some simple projects soon, which will be posted here for all your viewing pleasure.

Introduction

Hello world! My name is Chad Gardner. I am former AP Physics teacher, with an educational background in astronomy, philosophy, and religion(?!). I am currently a stay-at-home dad, spending what spare time I can muster learning Python for data science. I hope to use this site as a place to dump my brain, share my progress, keep myself accountable, and all those other reason people start blogs. Eventually, this will morph into a portfolio filled with beautiful insights, graphs, and stories from the world of data. I have another, somewhat neglected blog, where I post the odd poem, philosophical insight, or political rant. Here, I hope to stick to data science, Python, and the questions that can be answered with them. I hope to be employed in this new field within a year or so, ideally without having to go back to school. Enjoy the ride!

-Chad