Data. It’s all about the data. We want to make more data driven decisions. We want to keep more data so we can make better decisions. We want that data stored cheaply, easily accessible, and quickly ingested. Hadoop promises to help with all those things. However, when you deal with Hadoop on-premises you have a multi-step process to load the data. Azure and WASB to the rescue!
With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back to a non-HDFS system (a SAN, a file share, a website).
What’s different in the cloud? With Azure you have Azure Blob Storage Accounts. Data can be stored there as blobs in any format. That data can be accessed by various applications – including Hadoop without first doing a separate load into HDFS! This is made possible because Microsoft used the public extensions available with HDFS to create the Windows Azure Storage Blobs (WASB) interface between Hadoop and the Azure blob storage. This WASB code is available for any distributor of Hadoop in the Apache source code and it is the default storage system in HDInsight – Microsoft’s Hadoop on Azure PaaS offering. It is also available for Hortonworks HDP on Azure VMs or Cloudera EDH/CDH on Azure VMs with some manual configuration steps.
With WASB you load your data to Azure blobs at any time – whether Hadoop clusters currently exist or not. That way you aren’t paying for Hadoop compute time simply to load data. You spin up one or more clusters, point them at the data sets (yes, multiple clusters pointing to same data!), and run your Hadoop jobs. When you don’t need the system for a while you take down your Hadoop cluster(s) and the data is still there. At any point, whether one or more Hadoop clusters are accessing the data or not, other applications can still access and manipulate the data. For example, you could have data sitting on an Azure storage account that is being added to by a SQL Server Integration Services (SSIS) job. At the same time someone is using Power Query to load that data into PowerPivot while a website inserts new data to the same location. Meanwhile your R&D department can be running highly intensive jobs that require a large cluster up for many days or weeks at a time, and your sales team can have a separate, smaller cluster that’s up for a few hours a day – all pointing at the same data!
With this separation of storage and compute you have simplified your data accessibility, reduced data movement and copies, and reduced the time it takes to have your data available! That all adds up to lower costs and a faster, more data-driven time to insight.