Azure SQL Data Warehouse for Structured Big Data

Posted By:

Many companies today are trying to transform themselves into data-driven organizations, that is, to become the type of organization where data and analytics is part of the business strategy, systems, processes, and culture. And while being data-driven is in part a mindset, it is also about having the right underlying tools and infrastructure to enable this strategic shift.

Big data analytics platforms load, store, and analyze volumes of data at high speed, providing timely insights to businesses. Data-driven organizations leverage this data, for example, for advanced analysis to market new promotions, operational analytics to drive efficiency, or for predictive analytics to evaluate credit risk and detect fraud. Customers are leveraging a mix of relational analytical databases and data warehouses to gain business insights.

Gigaom recently published a set of benchmarks comparing the price/performance of the four leading cloud-based relational databases: Amazon Redshift, Azure SQL Data Warehouse, Google Big Query, and Snowflake Data Warehouse. The whitepaper found that using representative set of queries from the TPC-H standard, Azure SQL Data Warehouse consistently outperformed its counterparts in both price and performance. This whitepaper, which can be found at https://azure.microsoft.com/en-us/services/sql-data-warehouse/compare/, is well worth reading.

Rather than rehash the findings of this whitepaper – it can speak for itself – I thought this might be a good opportunity to highlight some of the top features which makes Azure SQL Data Warehouse such a great fit for enterprise database solutions.

 

What is Azure SQL Data Warehouse?

Azure SQL Data Warehouse (SQL DW) is Microsoft’s cloud-based Platform-as-a-Service (PaaS) for massive, structured relational databases. SQL DW offers petabyte-scale due to its distributed architecture and use of parallel processing. A fully managed cloud service offering independently-scaled compute and storage, SQL DW is an excellent choice when looking to store big relational data on cloud with highly performant queries. SQL DW, which became generally available in 2015, has proven repeatedly to perform better performing competitor offerings including Amazon Redshift, Google BigQuery and Snowflake DW.

Features in Azure SQL Data Warehouse

Let us highlight some of the top features which makes SQL DW such a great fit for enterprise database solutions:

Secondary Index

Azure SQL DW, Amazon Redshift, Google BigQuery and Snowflake all provide clustered column store index (CCI). However, SQL DW offers the unique capability of a secondary index on CCI tables; this is also known as secondary B-Tree indexes or non-clustered index (NCI). Aggregation and analytical queries mostly perform well with CCI tables, but when querying relatively smaller dataset aggregation or lookup, the secondary index can improve performance greatly
Best practice:

  • User a secondary index for the columns which are frequently used for filter, group by, or order by
  • Do not overuse secondary indexes, as the additional overhead required to maintain them can negatively impacts query performance. Use for 1 or 2 columns per table.

Syntax
CREATE [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON [ database_name . [ schema ] . | schema . ] table_name
( { column [ ASC | DESC ] } [ ,...n ] )
WITH ( DROP_EXISTING = { ON | OFF } )
[;]

PolyBase

Data migration is a key step in any organization’s moving to the cloud. A poorly executed migration can lead to increased latency, higher costs, and risk business continuity goals.

Data migration is where SQL DW shines above all. SQL DW’s PolyBase feature enables lightning-fast data transfer from multiple sources and allows data queries directly from the source without moving beforehand, creating extremely rapid execution by comparison. PolyBase can query structured data stored in Hadoop Distributed File System (HDFS), Blob Storage, Azure Data Lake Storage, SQL Server, Oracle, Teradata and MongoDB. PolyBase can also query non-relational data, like delimited files.

PolyBase offers the fastest throughput compared to BCP, SQLBulkCopy/ADF or SSIS. Every data engineer has faced two very common hurdles when joining SQL data to external data sources: either transfer all the data to same location with same format,or query them individually and bring the query result to a centralized location to perform independent joins. PolyBase removes this hassle and allows data engineers to join using T-SQL.

Compute Gen2 Tiers

One of the important features of SQL Data Warehouse is that it can scale storage and compute independently. To achieve this, Microsoft re-architected the product to utilize Azure Blob Storage; because blob storage is persistent, this allows massively parallel processing databases (MPP) to be “paused” and “resumed”. Of course, SSD-based storage is even faster than Blob Storage, which has led to new, compute optimized tiers.

This new tier allows the data to be cached in NVMe SSDs, an incredible innovation as it still uses Azure Blob Storage. Consequently, SQL DW gains local SSD performance along with independently-scaled compute and storage. This provides up to 5x improvement in query performance compared to Gen1.

Business Continuity

Perhaps the single most important part of any data service’s security and availability is reliable backup and configuration restoration. SQL DW provides automatic restore points every 30 mins. Automatic restore points cannot be deleted by any users and are used to maintain recovery SLAs. Users can restore their data warehouse using any of restore points within the previous seven days.

In addition to Automatic Restore Points, SQL DW also allows user-defined, point-in-time restore points. Up to 42 can be created per SQL DW instance, with custom retention period. SQL DW also provides Geo-Backups. As the name implies, Geo-Backups allow users to backup their SQL DW to a paired datacenter, enabling restoration when users cannot access their primary region. The geo-backups can be restored to any data centers which support SQL DW.

Summary

A robust, data estate infrastructure is a foundational enabler of any move towards becoming a data-driven organization. Deploying and leveraging cloud-based solutions such as Azure SQL Data Warehousing are an important step in that journey. That said, preparing for this new data-driven world requires more than just a partner that can upgrade your database platform. It requires the unique blend of expertise in business strategy, people and processes, governance, and technology that only a company like Neal Analytics can provide. Contact us and let us help you begin this journey with a professional assessment.