How to work with raw data for data science
With data science, I frequently joke that it’s 20% analysis, 30% data cleaning/engineering, and 50% waiting for the transformations on your dataset to load. There is more to data science than just implementing an algorithm, you need to first understand what you’re dealing with and what it could potentially be. For instance, you are given a dataset by a client, you spend an hour or two looking at the variables that it contains, and then you decide to run a regression model on the variables of your choosing. Yes, your output produced an insight, but with the small amount of time spent critically thinking about what is in front of you, is your regression model as meaningful as it could be? You’re given a lot of things at once, sometimes you need to take the time to take it all in.
I find that I spend a large amount of time just staring at the raw dataset, not doing any modeling, just looking at it. I look at every column, what it contains, and just sit and think about what it is, what it represents, what it could be, and if it could be connected to other variables. It is important here to think outside the box, maybe the data that you need is elsewhere, whether it is in another file you can acquire from the client, or available on the internet. You are not limited to only using what is given to you, narrow thinking leads to narrow insights. Other times you have too much data and half of it is irrelevant and needs to be cleaned up. As with any practiced skill, you eventually end up exploring and playing with datasets so many times that you develop the ability to more quickly determine what is and is not needed.
When I find something that I think would be meaningful, I write down the idea, do some transformations on the data and look at the outputs. A lot of times these small ideas end up not being anything insightful, especially the first few transformations. These failed ideas are still important because you learn what works and what does not, leading to an understanding of what the data can and cannot do. Over time these ideas snowball into a plan, which will become the fundamental approach to solving the data science problem you’ve been presented.
More often than not, data scientists are misconstrued as spending their time tinkering away making customizations to fancy algorithms, as if we are all crass enough to think ourselves capable of knowing better than decades of research. No, K-fold cross-validation and tuning hyperparameters may sound complicated, but that’s the easy and fun part. The challenge is not simply finding the right questions to answer, but determining the best way to answer them. One that is both meaningful and interpretable, so that everyone can understand what you did and how to leverage it.
This all begins with diving into the raw data you’ve received and used your imagination. Try to envision the possible scenarios each column might be used for, and spend ample time with the customer ensuring you’re thinking about it in the right way. Data that may not be at all relevant for the model building could be critical context needed to tell the story in a supporting dashboard. When in doubt, keep it around and come back to it later, especially when you’re stuck!