
Azure Synapse Analytics for structured big data
The value of leveraging data to drive decision making is widely known today, but many companies are still trying to transform themselves into data-driven organizations, and more still are trying to implement AI and machine learning in their workloads. In order to become more data driven and enable innovative new use cases, organizations need to make sure they choose the best data platforms for their workloads.
Big data analytics platforms load, store, and analyze volumes of data at high speed, providing timely insights to businesses. For example, data-driven organizations leverage this data for advanced analysis to market new promotions, operational analytics to drive efficiency, or predictive analytics to evaluate credit risk and detect fraud. Organizations are leveraging a mix of relational analytical databases and data warehouses to gain business insights.
Gigaom published a set of benchmarks comparing the price/performance of the four leading cloud-based relational databases: Amazon Redshift, Azure Synapse Analytics (formerly Azure SQL Data Warehouse), Google Big Query, and Snowflake Data Warehouse. The whitepaper found that Azure Synapse Analytics consistently outperformed its counterparts in both price and performance when using a representative set of queries from the TPC-DS standard. In fact, Gigaom found that implementing an end-to-end analytics solution on Azure cost up to 13% less than AWS, 49% less than Google, and a whopping 59% less than Snowflake.
Rather than rehash the findings of this whitepaper (which can be found here), however, we thought it would be best to highlight some of the top features which make Azure Synapse Analytics an ideal great fit for enterprise database solutions.
What is Azure Synapse Analytics?
Azure Synapse Analytics (formerly Azure SQL Data Warehouse, or SQL DW) is Microsoft’s cloud-based Platform-as-a-Service (PaaS) for massive, structured relational databases. Synapse Analytics offers petabyte scale data storage and processing capabilities due to its distributed architecture and use of parallel processing. A fully managed cloud service offering compute and storage that scale independently, Azure Synapse is an excellent choice when looking to store big relational data on the cloud with highly performant queries. Azure Synapse, which became generally available under the SQL Data Warehouse moniker in 2015, has been repeatedly proven to outperform competitor offerings, including Amazon Redshift, Google BigQuery, and Snowflake Data Warehouse.
Features in Azure Synapse Analytics
Let us highlight some of the top features which make Synapse Analytics such an excellent fit for enterprise database solutions:
Secondary Index
Azure Synapse Analytics, Amazon Redshift, Google BigQuery, and Snowflake all provide clustered column store index (CCI). However, Synapse Analytics offers the unique capability of a secondary index on CCI tables, also known as secondary B-Tree indexes or non-clustered index (NCI). Aggregation and analytical queries mostly perform well with CCI tables, but when querying relatively small dataset aggregation or lookup, the secondary index can improve performance greatly
Best practice:
User a secondary index for the columns which are frequently used for filter, group by, or order by
Do not overuse secondary indexes, as the additional overhead required to maintain them can negatively impact query performance. Use for 1 or 2 columns per table.
Syntax
CREATE [ CLUSTERED | NONCLUSTERED ] INDEX index_name
ON [ database_name . [ schema ] . | schema . ] table_name
( { column [ ASC | DESC ] } [ ,...n ] )
WITH ( DROP_EXISTING = { ON | OFF } )
[;]
PolyBase
Data migration is a critical step in any organization’s moving to the cloud. A poorly executed migration can lead to increased latency, higher costs, and risk to business continuity goals.
Data migration is where Azure Synapse Analytics shines above all. The PolyBase feature in Azure Synapse enables lightning-fast data transfer from multiple sources. It allows data queries directly from the source without moving beforehand, creating rapid execution by comparison. PolyBase can query structured data stored in Hadoop Distributed File System (HDFS), Blob Storage, Azure Data Lake Storage, SQL Server, Oracle, Teradata, and MongoDB. PolyBase can also query non-relational data, like delimited files.
PolyBase offers the fastest throughput compared to BCP, SQLBulkCopy/ADF or SSIS. Every data engineer has faced two very common hurdles when joining SQL data to external data sources:
-
- Either transfer all the data to the same location with the same format, or
- Query them individually and bring the query result to a centralized location to perform independent joins.
PolyBase removes this hassle and allows data engineers to join using T-SQL.
Compute Gen2 Tiers
One of the crucial features of Azure Synapse Analytics is that it can scale storage and compute independently. To achieve this, Microsoft re-architected the product to use Azure Blob Storage; because blob storage is persistent, this allows massively parallel processing databases (MPP) to be “paused” and “resumed.” Of course, SSD-based storage is even faster than Blob Storage, which has led to new, compute-optimized tiers.
This new tier allows users to cache the data in NVMe SSDs, an incredible innovation as it still uses Azure Blob Storage. Consequently, Synapse Analytics gains local SSD performance and independently scaled compute and storage. The combination of SSD performance and right-sized compute and storage provides up to 5x improvement in query performance compared to Gen1.
Business Continuity
Perhaps the most important part of any data service’s security and availability is reliable backup and configuration restoration. Azure Synapse Analytics provides automatic restore points every 30 mins. Users cannot delete automatic restore points, which are used to maintain recovery SLAs. Users can, however, restore their data warehouse using any restore point from the previous seven days.
In addition to automatic restore points, Azure Synapse allow user-defined, point-in-time restore points. Organizations can create up to 42 per Synapse Analytics instance, with custom retention period.
Azure Synapse also provides geo-backups. As the name implies, geo-backups allows users to back up their Azure Synapse data to a paired data center, enabling restoration when users cannot access their primary region. Organizations can restore geo-backups to any data centers which support Azure Synapse Analytics.
Summary
A robust data estate infrastructure is a foundational enabler of any move toward becoming a data-driven organization and leveraging innovative AI solutions. Deploying and leveraging cloud-based solutions such as Azure Synapse Analytics is an important step in that journey. With that said, preparing for this new data-driven world requires more than just a partner that can upgrade your database platform. It requires the unique blend of expertise in business strategy, people and processes, governance, and technology that only a company like Neal Analytics can provide. Contact us, and let us help you begin this journey with a professional assessment.
This blog was originally published on 4/9/2019 and has since been updated.