Data Warehousing and Data Science with SQL Data Warehouse and Spark

By for April 13, 2017

Report Abuse
Most enterprises require a centralized data warehouse for the purpose of advanced analytics and reporting. Traditionally, this required IT organizations to manually set up a number of systems and services and build pipelines that ingest, process and store data using them. This solution sets up an end-to-end data ingestion, wrangling and warehousing pipeline using Apache Spark, Azure SQL Data warehouse and Azure Data Factory within a few minutes. It demonstrates how to use Jupyter notebook and Power BI to explore and visualize the raw data and processed results residing in Azure HDInsight and Azure SQL Data Warehouse.
> **Note:** If you have already deployed this solution, click [here]( to view your deployment. ### Estimated Provisioning Time: 25 Minutes ## Overview This solution uses the Million Song dataset as a sample to create a data ingestion and processing pipeline. [![Solution Diagram](]( The various steps involved in this solution are as follows: * Creation of aforementioned Azure resources in user’s Azure Subscription. * Copy of Million Song dataset from a public storage location to a newly created Azure Storage container. * Execution of data sanitization and aggregation using Apache Spark powered by Azure HDInsight. * Transfer of processed results from Apache HDInsight storage location into Azure SQL Data Warehouse using Polybase load queries. * Exploration of raw data and processed results using Jupyter notebook and Power BI. All of the above steps (except data exploration) are orchestrated by Azure Data Factory. Dataset courtesy of: Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011. ## Disclaimer ©2017 Microsoft Corporation. All rights reserved. This information is provided "as-is" and may change without notice. Microsoft makes no warranties, express or implied, with respect to the information provided here. Third party data was used to generate the solution. You are responsible for respecting the rights of others, including procuring and complying with relevant licenses in order to create similar datasets. ![ ](