Working at a consulting firm as a data scientist makes quickly understanding new data a necessity. Limited timelines and novel problems are challenges that require efficient and effective tools to confront. I provide clients with solutions using a variety of machine learning techniques, which means my main toolkit is Python with a mix of Pandas, Numpy and Scikit-learn. These tools are popular with data scientists because they have diverse capabilities and allow them to quickly develop and implement solutions that can run on a myriad of platforms.
As great as Python is, one pain point I consistently experience is getting an overview and visualization of large amounts of data during the initial stages of my project. The most intuitive way to understand your data is through visualization; making Python’s Matplotlib module a powerful choice. However, it requires a lot of manual specification to get exactly what is needed and is a dangerous time sink. This is where Microsoft’s new Azure ML Workbench can do a lot of the heavy lifting when it comes to modeling, data transformation, and Exploratory Data Analysis. AML Workbench provides premade code for importing data from any type of local or cloud storage, whether it’s stored as a flat file (e.g. CSV file) or already structed in a SQL database. Once in the tool, basic transformations are a couple of clicks away, and it instantly provides histograms and data summaries for each column.
Once the data is in your project, you can immediately visualize and filter out anything that’ll choke your initial pipeline. Getting a quick, broad view of the data I’m working with is priceless and is something that the AML Workbench makes very simple, (where using a custom solution with Python could take hours.) Once you have your data prepared, it saves the transformations and creates a visual pipeline you can reference any time. A Python script is then generated with the data imports and transformations already done so you can get started modeling the data with any Python package you like.
Another great feature of AML Workbench is the ability to import data from the cloud, develop code locally in a desktop app, but run that code and output the results in the cloud (or locally.) This full data pipeline management capability enables data scientists to define the transformations and model data specification, and then immediately and simply export that code to a tool like Azure Data Factory for automation and production implementation. What this means is less time from raw data to initial model, and a more affordable transition to a production environment. This relieves a lot of dev ops complexity from the work I have to do as a data scientist.
Lastly, your data and code can be accessed by a Jupyter Notebook running in your project for easy testing and developing new ideas. Iterating models is straightforward and many different models, transformations, and parallelization can be done quickly and kept organized.
Overall, for a brand new tool, AML Workbench is an exciting and refreshing update. Being able to run machine learning workloads of any size and an efficient and performant way to execute data transformations and model code in Python is a welcome addition to some of the older, slower environments. It’s really exciting to see Microsoft making new tools specifically for data scientists and I’m looking forward to the new features and improvements they continue to make with these tools.