One Year Later…

I received an email the other day from Google. My domain name will be renewed in 30 days.

What?!

I checked the date: October 15th. That’s when I realized it had been a year. I was registered to take the November 2017 LSAT and had to request my refund by around this time last year. It’s been one year since I decided to try and become a data scientist.

I’ve fallen away from this blog over the last few months, but I have good reason: I got a job! I am currently working as a data analyst on the data warehouse team at AdTheorent (a quick note to say that this blog remains my own work, and is unaffiliated with my role there). I started applying to jobs more to test the waters of the local job market here in Jacksonville, FL. I wanted to see what employers were looking for and get some experience interviewing for tech jobs. I got a couple of interviews, but the AdTheorent one stuck out to me. AdTheorent is entirely based in the cloud, uses data science and predictive analytics to place bids on ad exchanges, and it seemed the team leader was looking less for someone with a certain skillset and more for someone who had the ability to think analytically and critically. I was offered the job and jumped at the chance.

I’m not doing any data science for my job (we do have a data science team, but they’re based in New York). However, I am using quite a bit of what I learned at DataCamp. The majority of my life at AdTheorent is lived in SQL and some C# for our data loading and processing jobs. I still use Python and Seaborn for visualization when needed, and I’m starting to delve into Pyspark as well. Oh and Git. Lots of Git. Which bring me to the point of this post:

Five Things I wish I had known

1. Git
I dipped into DataCamp’s Git course, but I didn’t pay very close attention. Now it’s my bread and butter. I always have Git bash open, but too often I’ve had to ask our resident Git guru for help. I love using it now (seriously–it’s a brilliant piece of software), but I wish I had learned it better beforehand. Maybe I would have a few less git reverts under my belt (or worse, I even had to check out a commit, rename it to master, and delete the old master…this is probably not standard practice…).

2. Dev – UAT – Production
I’m not sure how I would have gained much experience here before actually being employed, but the actual day-to-day life of the software developer has been entirely new to me. We use an Agile workflow which I was basically Googling on day one (lots of small, frequent, and therefore flexible deployments instead of fewer, larger ones). Also, I had never written any code which had to pass a test other than, “Hey it worked this time!” Going from little home projects to working on production code was a huge jump, and I have definitely skinned my knees a time or two, which brings me to…

3. The difference between big data, and BIG DATA
We process and house a lot of data. We aren’t a Google or a Facebook, but we deal with tables that are terabytes. The Kaggle contest that I participated in dealt with enough data to make my MacBook run for cover, but that was just a couple of gigabytes. For anyone else starting out in all of this, just in case this isn’t as obvious to you as it wasn’t for me, methods, programs, and procedures that work with a gigabyte of data will not necessarily work with petabytes. I have had to go back to almost the drawing board twice now for a single project because, when it came to time to test on full, production-sized tables, my jobs crashed and burned (or at least were taking too long to be workable solutions). Often, when you get to this scale, the name of the game becomes less about how sexy you can make your solution and more about how reliable and quickly debuggable it is.

4. Environments
I wish I had been familiar with the ways to set up and manage development environments. I didn’t really know what they were or why they were important (if you want to learn this, use Anaconda and this documentation). I mean, why didn’t everyone just use the most recent version of all their packages? Well, also figured out some of this the hard way (sensing a pattern?).  I finally managed to set up a cloned environment on my laptop of our Spark cluster–Python 2.7, Hadoop, and all.

5. Network & Server Speak
My house was the main gathering place for LAN parties in high school. We played Counter Strike, the Command and Conquer series, Starcraft, Quake 3, and other classics. We would sometimes spend hours troubleshooting, trying to get all the computers on the network (this was Windows 2000 and XP people…it was rough). I think our record was 24 computers up and running. We had to run extensions cords all over the house to keep from flipping the breakers (writing this now, I am reminded about how gracious my parents were…). Nonetheless, these experiences did not prepare me for network, VPN, server, production, UAT, cluster, box, prod talk (can you believe it?). I’m not sure how I would have learned all of this outside of a technology workplace, but I’m googling away trying to catch up.

Now, despite my trial by fire, I feel like none of these things have been impossible hurdles. I’m picking things up and enjoying every minute. Now that I’ve settled into things, I hope to start back up on my data science education. Stay tuned!

Medium Steps

“Baby steps” are how we get from A to B. We do the hard work of learning the details, spending hours on hiccups and chasing rabbits down holes. But I find it difficult to really post about baby steps. You all don’t need to know about every little aspect of my data science education, and I don’t have time to write about them. Reporting falls prey to the law of diminishing returns.

But the pressure to avoid reporting baby steps can overshoot the mark, leading to a desire to only post polished, new material. But if I only posted finished products, I wouldn’t be able to write anything for weeks on end, and you, dear reader, would get no sense of the process. Also, I need to get over my fear of appearing too raw or unpolished. Thus, I’m going to try and be better about posting medium steps. Here are the medium steps I’ve taken lately:

    1. I discovered that GitHub hosts websites, which, of course, display Jupyter Notebooks beautifully and with ease. I spent an entire day trying to figure out a way to display my analysis of the 2017 Kaggle Survey here on WordPress, only to have to resort an awful scrolling embedded window. So, in retrospect, I should probably have built my entire site on GitHub pages. Oh well. I’ve decided to marry the two for now. I’ll continue to use this as my blog and primary writing outlet, but I’ll host my portfolio projects on my GitHub site, which I hope to build soon.
    2. I’ve started working on start-to-finish project concerning Florida high school performance! Start with what you know, right? I’ve downloaded a lot of raw data from the Florida DOE. I’m currently working on cleaning it up, getting it ready to visualize and to run through some predictive modeling. I haven’t decided on my fundamental questions yet, outside of “What factors contribute to failing schools and how much so?” Hopefully I’ll get some time in the coming week to really jump into this.
    3. I’m three weeks into Andrew Ng’s well known Machine Learning course on Coursera. I was privileged enough to get to chat with Hugo Bowne-Anderson from DataCamp the other week, and he suggested it as a resource. I’m really enjoying it so far. I’m glad that Ng gets into the weeds a bit with the math. Having taken four semesters of calculus in undergrad, I feel confident that I can do the math, but I haven’t had a great opportunity yet to dust off those skills.
    4. I finally finished Nate Silver’s book The Signal and the Noise. I really enjoyed it and learned lot, but I feel that it was a bit of a slog. He could have applied his general thesis/warning to fewer fields and still have written a great book. But I’m glad to have read it. I am going to try dive deeper into some more data science specific material next (see both my progress page and my current reading page for more!).
    5. I’ve got several posts coming up explaining some data science basics: one on conditional probability, a followup on Bayesian analysis, and a third on gradient descent (I’m looking forward to building the visuals and mechanics behind this one!). So stay tuned!

Why Data Science?

I realized last night that I have yet to really articulate why I am making this switch from teaching to data science. There’s more to it than just not being fully satisfied with teaching and needing a job to earn a buck. This post is more for me than it is you, but I thought I’d share.

Those of you who know me well know that I spent last fall studying for the LSAT. I was fairly convinced for about a year and a half that I wanted to become a lawyer. I was taken with the idea of using my mind to answer tough questions and to convince others of my argument’s merit, all while fighting for the most vulnerable. I have since abandoned that course, mostly. The salary distribution of lawyers is very bimodal: i.e. there are plenty of lawyers who make a gabazillion dollars, plenty of lawyers who make less than I did as a teacher, and not very many in between. Here is some data gathered by the National Association for Lawyer Placement on the starting salaries of the class of 2014:

DistributionCurve2014
Class of 2014 Lawyer Salary Distribution (source)

I interpret this data to mean that there are generally two types of lawyers, do gooders and corporate. Yes, those are not mutually exclusive, and yes there are lawyers in between. There is a good chance I could have started somewhere near $50k a year and moved up from there, but I’m not sure that little boost is worth the hassle of law school (I was probably going to have to commute to the University of Florida from Jacksonville, about a 90 minute drive, all while my wife was commuting to Daytona, another 90 minute drive, with both of us somehow being parents during that time—sounds like a nightmare). Also, the preponderance of evidence suggests that lawyers are largely unhappy with their work (a quick Google search returns this, and this). I came across a book called The Destruction of Young Lawyers at my local, used book store. It lays out a pretty compelling argument that the field of law is broken and that it’s taking young lawyers with it. All of this combined was enough for me to start thinking about other options.

While I taught high school, I always kept a rather elaborate grade book in Excel. AP exams are scored from a 1 to a 5, and I used my grade book to analyze my students’ test scores and come up with an equivalent, AP score. Some of the most fun I had teaching was staying up late, playing with Excel, trying to pry as much information I could out of my data. During my last year teaching physics, I wrote a Google Sheets add-on that generated an individualized report for each student outlining his or her areas of weakness and strength, both in terms of content and in terms of question difficulty. Looking back, I guess I could have taken the hint that I was in the wrong field. I will miss parts of teaching: those few students who really wanted to learn, seeing kids grow and push through the hard content, among others. But in the end, teaching isn’t about solving  and wrangling content, it’s about solving and wrangling teenagers, and I that’s not how I want to spend my energy.

datascientistsalary
2016 Data Scientist Salary Distribution (source)

 

So here we are: data science (doesn’t the salary distribution looks a lot friendlier?).  I took a year of Java in undergraduate, and I’ve always had a penchant for computers and their inner workings, complete with hosting LAN parties all through high school and college (if you don’t know what a LAN party is, don’t Google it; just assume it’s very cool and hip). This past November, when I really decided to abandon my LSAT studies, I first started looking to become a software developer in general. But I quickly zeroed in on data science. Here are the five main reasons why I’m fairly confident that data science is where I belong:

1. I love to solve problems (or, perhaps I’m slightly obsessive about solving problems).

I originally wrote that Google Sheets add-on to save time. I didn’t want to copy and paste my students’ data 170 times in 4 different places. It took me about a week to get the script working correctly (this includes the time needed to brush up on my JavaScript via CodeAcademy). In other words, I spent way more time writing the script than if I had just put on a movie and trudged through the copy and paste in the first place. But with the script I was so focused and so excited.  I had plans to generalize the add-on and distribute it to my fellow teachers. I thought about my script as I was falling asleep, driving to work, and pretty much every other time in between, much to my wife’s dismay.

I love tinkering with it all. If I think something is possible, I can’t stop until I’ve got it working. That sounds like an asset when it comes to data science.

2. I like to find new ways to ask old questions, or find new questions to ask altogether.

I learned the value of thinking clearly about method and theory in graduate school. I have a MA in the History and Philosophy of Science, and I finished all my coursework for a Ph.D. in the History of Religion. The historical goal of religious studies has been to tie all the world’s religions together, searching for underlying themes, asking about what lies beneath them all, and trying to get at religion’s core. I belonged to a more critical camp, asking instead about how the category of “religion” itself gets deployed in various settings to gain and restrict access to social, political, and economic capital (the summary of each of these camps is rather crude, but they will do for now). Using similar methods in the history and philosophy of science, I took fresh looks at the is/ought distinction, at the dispute between Schrödinger, Heisenberg and their interpretations of quantum mechanics, and at the role of social construction in our scientific concepts.

I want to do the same thing but with data. What questions could we be asking that we haven’t thought of yet? How could we reframe old questions to get at new answers? What discoveries await us in the ridiculous plethora that is our modern data?

3. I love to write and to teach.

I have always loved the careful communication of a concept and the meeting of minds that occurs when that communication is successful.  An underlying current in all my career meandering has been the sharing of ideas: from astrophysicist, to philosopher, to historian, to teacher, at their core they are all roles concerned with the dissemination of knowledge. Distilling difficult concepts and helping students understand them was the most rewarding part of teaching physics. This is a crucial aspect of data science. A data scientist must communicate data clearly through visualizations and explain predictive models to clients or another departments. I think this pedagogical role of the data scientist will fit me very well.

4. I want to contribute to my community and the global one.

The primary impetus behind my brief flirtation with law was the chance to use my mind to help the world. I abandoned my Ph.D. primarily because I saw the academy, and the humanities in particular, as a bit of a blackhole—I did not want to spend my life writing books that maybe one hundred other people on earth might read. I did not want to hyper-specialize to play the publish or perish game. It all felt a bit useless (not an education in the humanities in general, mind you, just the post-graduate side of it). Law felt like a place I could research, read, and write for the greater good—but, for the reasons above, also not a great fit.

I feel that with data science, I could finally contribute in a way that fits who I am. I’m not one to get out there and volunteer at this or that event on a regular basis (I participate in the occasional creek clean up and the like). But I’m very drawn to the idea of sitting down at my computer and figuring out answers that could help guide the city or a non-profit. What resources are needed where and when? Who is falling through the cracks and how can they be better served? The potential for using data science for the greater good is immense, and I can’t wait to participate.

5. I love science.

All of my childhood I wanted to be an astronomer. The universe captivated me since I was in grade school, and it continues to hold me in its grasp. I began college as an English major, having fallen in love with literature during my AP courses in high school, but after freshman year I switched to astronomy. I spent my summers trying to catch up on coursework, and did not participate in any undergraduate research until late senior year. This put me a little behind the curve, and I didn’t get into a Ph.D. program. None of that has changed my love of science. It’s the best way of knowing. I may not be discovering the true nature of dark matter, but I’ll making discoveries nonetheless, and who knows, perhaps some of them will be a little more down to earth.

-Chad

Where to start? How to proceed?

I’ve been toying with all of this for about a month now. I have so many bookmarks for blogs, podcasts, free courses, paid courses, and on and on. I’ve checked out books from the library, bought others on Amazon, and downloaded open source texts. I have to admit that I’m a bit daunted. There are a million places to begin, and I have plenty of work to do before even becoming mildly employable. We’ll see…

Here’s my game plan:

  • Books to read
    • Data Science from Scratch by Joel Grus
    • Python for Data Analysis by Wes McKinney
    • Hands on Machine Learning by Aurélien Géron
    • The Art of Data Science by Elizabeth Matsui and Roger Peng
    • OpenIntro to Statistics by David Diez et. al.
  • Paid Online Courses
    • The entire Data Scientist Career Track from DataCamp.
    • Udemy
      • Python Megacourse
      • Python for Data Analysis & Visualization
      • Python for Machine Learning
      • Deep Learning x 4
  • Free Online Courses
    • Udacity
      • Intro to Computer Science
      • Intro to Data Science
    • Stanford
      • Statistical Inference
      • Prob – Stats
      • Statistical Learning (certified)
      • Mining Massive Datasets (certified)
      • Algorithms 1 & 2 (certified)

This should all take me the better part of a year. I hope to get enough dirt under my nails to start some simple projects soon, which will be posted here for all your viewing pleasure.

Introduction

Hello world! My name is Chad Gardner. I am former AP Physics teacher, with an educational background in astronomy, philosophy, and religion(?!). I am currently a stay-at-home dad, spending what spare time I can muster learning Python for data science. I hope to use this site as a place to dump my brain, share my progress, keep myself accountable, and all those other reason people start blogs. Eventually, this will morph into a portfolio filled with beautiful insights, graphs, and stories from the world of data. I have another, somewhat neglected blog, where I post the odd poem, philosophical insight, or political rant. Here, I hope to stick to data science, Python, and the questions that can be answered with them. I hope to be employed in this new field within a year or so, ideally without having to go back to school. Enjoy the ride!

-Chad