What is Azure Synapse Analytics?
Azure Synapse Analytics is a collection and enhancement of current Azure resources that give the client a single work space to manage the data from ingestion to presentation. To understand this concept and why it is a hot topic for data people you must understand the challenges that currently exist.
For a quick look at Azure Synapse you can watch this: https://azure.microsoft.com/en-in/resources/videos/tour-synapse-analytics-01/
This will be different for everyone depending on when you started your journey. Today you might be using SSIS packages to take source data and ETL into a structured data warehouse. If that is the case, then you know that if something changes at the source it breaks and time is spent to get that data in which can be complicated and time-consuming.
Data Lake (Hadoop), allows you to do ELT, basically, your transformations come after the load so you can fix it up at that point. We can load the lake using different tools but let’s say we use Azure Data Factory, which is essentially a read from and write to, typically no transformations, just a raw dump or copy of files.
We used SSIS to transform previously now this time we have different tools for transforming the data, in this case, we will use Spark which is the most popular. Spark works as a query against the Data Lake and, using that we can also write cleaned up data and views of data from the raw loads. To fix that would be fixing a query, so less complex than SSIS pipelines.
Now we may need to load into our Data Warehouse where we use Dimensional Modeling. We can use Spark again to read from the Data Lake and write to the database, most likely writing to staging tables then using stored procedures to load into final tables for consumption by Analysis Services that holds the data models to server data to Power BI. Eventually, you end up with architecture like this…
There are a few elements in this diagram that need some additional explanation for those unfamiliar with them.
I mentioned Spark earlier, but you need infrastructure to run Spark. In this case, a popular choice is Azure Databricks which you can think of as a compute cluster Spark engine that processes notebooks to do that querying and writing with the Data Lake.
Next is the Azure Synapse Analytics, that diagram used to say Azure SQL Server DW. This is where the first confusion of Synapse is. For now, just think of that iteration of Synapse as a rebranding of Azure SQL Server DW, once fully baked and released it will be more and that diagram will change.
Finally, there is Polybase. This is an SQL Server technology that allows mapping of tables to external data like the Data Lake. A lot of people refer to this as Data Virtualization. The key thing here is that the objects in the Data Lake appear like a regular table in SQL Server DW.
One of the big issues here is the segregation of the architectural pieces needed to get data from source to Power BI. We have a Databricks workspace, Azure Storage Explorer, Data Factory Workspace, SQL Server Management Studio or Azure Data Studio, Power BI, etc. With this segmentation, it can be difficult to maintain, integrate, and monitor.
Let’s talk about the collection of services… if we take the previous diagram and we look at it differently. Note the following diagrams are from Microsoft presentations given during the announcement and may change.
You can see it transition more to something like this:
Now, this shows that the Data Lake and the Power BI portion are separate but, in the workspace, we will be able to see everything. So where is the Database, the Spark engine, and the Data Factory? All of that is integrated together now.
In this diagram you can see what is happening inside of that Synapse container from the previous diagram. The cool thing here is that Spark is fully integrated, so you can work on your notebooks using different languages of your choice. The best part is because of this tight integration all the plumbing is done for you, so even things like Polybase is much simpler. There is no managing of connection strings between the different pieces. This is all managed under one workspace:
You can see that the workspace is broken into different area that would be the focus of what you are trying to do. So, as you can see the architecture isn’t all new but evolved in to a better experience for developing, managing, and monitoring.
If you are starting to think about a new data analytics platform either to replace what you have or something you didn’t have before, then Azure Synapse Analytics should be on your roadmap. It brings together everything you need to get from the source to provisioning data in one workspace. The integration lets you focus on the data and not the plumbing. The collective view allows you to monitor and troubleshoot to the root cause quickly. All this equals to a reduction of time to delivery and low operational costs because of the tight built-in integration. The last challenge now is to see it come to public preview in Azure.