What is data science ?
A data scientist is the adult version of the kid who can’t stop asking “Why?”. They’re the kind of person who goes into an ice cream shop and gets five different scoops on their cone because they really need to know what each one tastes like. Similarly, even the term data scientist is a catchall title that encompasses many different flavors of work. I think that’s the major differentiator between a data scientist and a statistician or an analyst or an engineer; the data scientist is doing a little of each of those tasks. Of course, what someone whose job title is data scientist will do at a given company depends on the company and the person, and may look more like one of those other titles, rather than a mixture of all three. To me, a data scientist is someone who does the following tasks:
1. Data analysis
The order of these tasks is intentional, and it roughly reflects the life cycle of a data science project. To be fair, we should add “0. Data cleaning” to that list, as it can be one of the most time-consuming tasks of a data scientist. It’s also an incredible litmus test for data scientists. Someone who can’t parse a messy CSV isn’t going to cut it as a data scientist). Let’s look at these tasks in more detail.
There’s lot’s of data out there, but much of it is not in an easy to use format. This part of a data scientist’s job involves making sure that data is nicely formatted and conforms to some set of rules.
As an example, consider a CSV where each row describes the finances of a fast-food franchise. There might be columns for a city, state, and the number of burgers sold in the last year. But, rather than having all this data in one document (that would be too easy, right?), it probably comes spread across many different files, which need to be joined together. Doing this is in some sense the easy part. The hard part is making sure the resulting combination makes sense. Typically there will be some formatting inconsistencies, and floating somewhere in the data set is a row where the number of burgers sold is ‘Idaho’ and the state is 25,000. Data cleaning is all about finding these hiccups, fixing them, and making sure they’ll be fixed automatically in the future. As an added bonus, all the downstream work from this point can only be as good as the data you’ve assembled.
This is the sort of work most people think of using Excel for but dramatically juiced up. A data scientist will typically work with data sets that are too large to open in a typical spreadsheet program, and may even be too large to work with on a single computer.
Data analysis is the realm of visualization (tables are for robots). This is where you make lots of plots of the data in an attempt to understand it (plotting is also another place where spreadsheets start lagging behind). Through this process, a data scientist is trying to craft a story, explaining the data in a way that will be easy to communicate and easy to act on. Sometimes this can be something simple, like figuring out what property or event signals when new users convert into long-term users, or something more complex, like figuring out when someone is slowly scamming you for lots of money ala Office Space. For example, data scientists at Facebook figured out that having at least ten friends helps guarantee that a user will stay active on the site, which is why there is so much machinery on the site devoted to finding new friends.
Whether a data scientist thinks they’re doing modeling or statistics depends on their background. People who studied statistics consider themselves to be statisticians; everyone else is probably going to claim to be more of a modeler (or an expert in machine learning if they’re feeling fancy).
My own background is in the purest of pure mathematics, so I think of statistics as a funny way of talking about probability and regression as a bunch of linear algebra. This makes me a modeler. In either case, this is where deep theoretical knowledge creeps into data science. Once you’ve got clean data and an understanding of that data, you generally want to make predictions either from that data or similar-looking data that you’ll get in the future.
One of the problems we’re tackling at Alexa is predicting how many visitors a website gets. To do this we’ve built a model based on what we know about traffic to individual websites as well as how people interact with the web. There’s a lot going on there, and it’s really the subject of a separate blog post. However, I’ll just add that this step is often very complicated. We live in a golden age of machine learning, where very powerful algorithms are available as black boxes that produce good results. However, it’s easy to find yourself sitting on a problem that no model is going to work well on right out of the box. So a data scientist spends a lot of time evaluating and tweaking models, as well as going back to the data to bring out new features that can help make better models.