How to work with raw data for data science
There is more to data science than just implementing algorithms. In fact, we frequently joke that data science is 20% analysis, 30% data cleaning/engineering, and 50% waiting for the transformations on your dataset to load.
When leveraging data science, you need first to understand the data you’re dealing with and what it has the potential to be or what insights you hope to glean. For instance, if a client gives you a dataset, you could spend an hour or two looking at its variables, then decide to run a regression model on your chosen variables. Your output produced an insight, but with the bit of time spent assessing the data, is your regression model’s output as meaningful as it could be?
We find that high-performing data scientists frequently spend a significant amount of time just staring at the raw dataset, not doing any modeling. They look at every column and what it contains and sit and think about what it represents, what it could be, and if they can connect them to other variables.
It is crucial to think outside the box: maybe the data you need is elsewhere; it might be in another file you can acquire from the client or available on the internet. You are not limited to only using the data that is given to you; narrow thinking can lead to narrow insights. Other times you may have too much data, half of which is irrelevant, and you must clean up. As with any practiced skill, you eventually explore and play with datasets so many times that you develop the ability to determine what is and is not needed more quickly.
When Neal’s data scientists find something, they think would be meaningful, they write down the idea, do some transformations on the data, and look at the outputs. While these small ideas often end up not being anything insightful, especially during the first few transformations, these failed ideas are still important. They’re important because you learn what works and what does not, leading to an understanding of what the data can and cannot do. Over time, these ideas snowball into a plan, which will become the fundamental approach to solving your data science problem.
More often than not, people misconstrue data scientists as spending their time tinkering away with customizing fancy algorithms. In reality, K-fold cross-validation and tuning hyperparameters may sound complicated, but that’s the easy and fun part. The challenge is not finding the right questions but determining the best way to answer them. Adding to the challenge, an ideal answer is meaningful and interpretable so everyone can understand what you did and how to leverage it.
Data science begins with diving into the raw data you’ve received and using your imagination. Try to envision the possible scenarios you can use each column for and spend ample time with the customer to ensure you’re thinking about it correctly. Not all of the data may be relevant for the model building could be critical context needed to tell the story in a supporting dashboard. When in doubt, keep data around and return to it later, especially when you’re stuck!
This blog was originally published on 4/20/2018 and since has been updated.