Create an End-to-End (E2E) Deployment-ready Data Pipeline for Consuming Azure Services: A Step-by-Step Guide

By for September 13, 2016

Report Abuse
There is a common need for a pipeline providing actionable visualizations, real-time metrics, long term storage and batch analytics across nearly every industry and use case. Often the framework of the architecture remains very similar across use cases with changes to certain components based on business needs. With this tutorial, we aim to walk the users through the steps to create an end-to-end (E2E) deployment-ready data pipeline for consuming Microsoft Azure Services.
This tutorial aims to help users to create an end-to-end (E2E) deployment ready data pipeline for consuming various Azure services in a step-by-step approach. It demonstrates how different components within the Cortana Intelligence Suite, such as Azure Stream Analytics (ASA), Azure Event Hub, Azure Data Factory (ADF), Azure HD Insights (HDI), Azure Machine Learning (AML), Azure SQL Database, and Power BI (PBI), can be assembled to build an E2E solution for data ETL, data analytics, data visualization etc. This guide aims to provide users an E2E pipeline with a basic and general architecture, which can be utilized for various application scenarios, e.g. website traffic monitoring, sensor log data analysis, etc. **Note that** the architecture of this tutorial is very similar to the architecture of another [tutorial]( published in Cortana Intelligence Gallery, however, with one major difference - our tutorial uses HD Insight for data preprocessing, while the other one uses Azure Data Lake Analytics. Both HD Insight and Azure Data Lake Analytics can be used for big data batch querying, however, there are some distinguishing factors. Azure HD Insight is an Apache Hadoop distribution powered by the cloud. It can handle any amount of data, scaling from terabytes to petabytes on demand, and spinning up any number of nodes at any time. In addition, HD Insight includes Apache Spark for large-scale data analytics in memory, Apache HBase for NoSQL transactional capabilities, and Apache Storm for real-time stream processing. HD Insight can process unstructured or semi-structured data and has powerful programming extensions for languages including Hive, C#, Java, and .NET. Users can also utilize Excel or other BI tools to visualize the data. Last but not least, HD Insight incorporates R Server for Hadoop, a scale out cloud implementation of 100 percent open-source R integrated with Hadoop and Spark clusters. It gives the familiarity of R with the scalability and performance of Hadoop. On the other hand, Azure Data Lake Analytics provides immediate and elastic scale with no sense of cluster creation. In addition, Data Lake Analytics includes U-SQL, a query language that combines C# and SQL feature together which enables much faster development and more customization. For users of strong affinity with .NET, C#, SQL ability, Azure Data Lake Analytics with Visual Studio Tooling and U-SQL is a good choice. The following figure shows the overall architecture of the end-to-end (E2E) deployment that this tutorial describes. There are two main paths in this architecture: the hot path is to process and visualize real-time streaming data and the cold path is to build and store more complicated analytics machine learning solutions. ![Overall Architecture]( The streaming data flows into Event Hub first and then into Azure Stream Analytics where data flow splits into hot path and the cold path. The hot path data flows into Power BI, and can be visualized to monitor the live data stream. The cold path data flows into Blob Storage first, and HD Insight is used for feature engineering. After that, Azure Machine Learning uses a pre-trained scoring model to make predictions based on the new features and the prediction results are written into Azure SQL Database, and finally into the on-prem server. Azure Data Factory is used to orchestrate the data flow for the cold path. This tutorial provides deployable components and will walk you through the steps necessary to install everything. Clicking **View Code** on the right will take you to a GitHub repository containing the following: - **** - The deployment steps - **E2E Pipeline Setup Tutorial.docx** - The detailed step-by-step tutorial, which is needed for the step-by-step deployment - **Data Generator** - folder containing the simulator that generates synthetic data - **script** - folder containing all scripts needed to deploy the pipeline - **media** - folder containing images used by It will take about 2-3 hours to finish the entire deployment of the sample pipeline. When everything is successfully deployed, you will have an E2E data pipeline running and the data will flow from data generator into SQL and Power BI. Click **View Code** on the right to begin.