Tackling the cloud data workloads with Azure Data Factory v2

Tackling the cloud data workloads with Azure Data Factory v2

With the new preview of Azure Data Factory (ADF) Version 2 arriving on the scene in September 2017, we at Neal Analytics found it to be a great opportunity to utilize its new features in a promotion optimization project currently wrapping up and in need of solid production automation. Data Factory allows you to create “data-driven workflows”, called pipelines, for orchestrating and automating data movement and data transformation. By utilizing activities that makeup pipelines, you can ingest, move, and transform data before using business intelligence and machine learning applications to visualize and tell compelling stories. For Neal, its ability to connect to a wide variety of data sources is critical to creating production-grade solutions without a typical complex (and expensive) data engineering effort.

The two features I would like to highlight from the new ADF version 2 are the addition of Chaining activities and Master Pipelines. These new features assist in the Pipeline Control Flow to manage conditional execution of your pipelines, creating more flexible and efficient pipelines.

For the automation of our promotion optimization solution, we took advantage of the chaining activities to simplify our activity executions. New to ADF V2, is the ability to use the dependsOn property to easily set up the dependencies of activities within a pipeline. Comparing to V1 where each activity was required to have a designated input and output as well as arranging the output of one activity as the input of another activity, this new feature is a pleasant addition allowing for a streamlined approach to chaining activities. The dependsOn feature also makes the execution of multiple pipelines in parallel easy once you have determined the order of the dependencies. Simply define the activity that the next activities must wait for using the following script within each activity’s JSON and you’re done! With this, you can also set up different dependency conditions such as Succeeded, Completed, Failed and Skipped to best accomplish your activity chain.


"typeProperties":
{
},
"dependsOn": [
{
“activity”: “”
“dependencyConditions”: [
“Succeeded”
]
}

This brings Data Factory a lot closer to realizing its foundational idea: “We need SQL Integration Services for the cloud.”

Second, the addition of the capability to invoke a pipeline from another Master Pipeline has given my team more ease in control of the execution of our automation pipelines. This feature, using the Execute Pipeline activity, allows you to run a top-level master pipeline that manages the execution of multiple pipelines through the use of the dependsOn feature discussed above. In order to make use of the master pipeline, you need two key things:

  1. Create the pipelines for each data flow you wish to orchestrate. These pipelines will be invoked by the master pipeline, instead of, or in addition to running on their normal schedule
  2. Create a master pipeline with the execute pipeline activity calling the invoked pipelines. Within the Master Pipeline, you can then use the dependsOn feature to chain your invoked pipelines just as you would with your dependent activities, and the waitOnCompletion feature can be used to decide whether the master pipeline must wait until that pipeline is finished.

There are many more great additions in ADF V2 that my team is looking forward to exploring in the near future, including parameters and triggers. For more information on the differences between Azure Data Factory Version 1 and Version 2 please visit https://docs.microsoft.com/en-us/azure/data-factory/compare-versions.

References

Lo, Sharon, and Pelluru, Sreedhar. “Introduction to Azure Data Factory.” Microsoft (2017)https://docs.microsoft.com/en-us/azure/data-factory/introduction.

Kromer, Mark, Lo, Sharon, and Pelluru, Sreedhar. “Execute Pipeline activity in Azure Data Factory.” Microsoft (2017):
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-execute-pipeline-activity.