Getting Started with using Azure Data Lake Analytics with R - A Tutorial

September 1, 2017

Report Abuse
A tutorial to get started with using Azure Data Lake Analytics with R for Data Science work.
Welcome to Getting Started with using ADLA with R tutorial! This tutorial is meant to help Data Scientists get started with using Azure Data Lake Analytics (ADLA) with R for Data Science work. In this tutorial we will be doing a series of short exercises to learn how to deploy R code within ADLA and understand how R interacts with USQL. At the end of the tutorial you should be ready to integrate your complex R projects with Azure Data Lake Analytics and exploit the many functionalities of Azure Data Lake Analytics for your big data needs. Click on *View Code* on the right to go to the tutorial page. **Azure Data Lake Analytics** Azure Data Lake Analytics is the first cloud on-demand analytics job service designed to make big data analytics easy. It has the capability to conduct data processing, advanced analytics, and machine learning modeling with high scalability in a cost-effective way. Using U-SQL, R, Python and .NET, it allows users to run massively parallel data transformation and processing over petabytes of data. A job in ADLA is submitted using a ***USQL*** script. **USQL** U-SQL is the new big data query language of the Azure Data Lake Analytics service from Microsoft. It combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language. It includes big data processing concepts such as “schema on reads”, custom processors and reducers on top of a scale-out runtime that can handle any size data. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs. **USQL with R** R Extensions for U-SQL enable developers to perform massively parallel execution of R code for end to end data science scenarios covering: merging various data files, massively parallel feature engineering, partitioned data model building, scoring and post deployment. In order to deploy R code in we need to install the usqlext in our azure data lake analytics account and within the usql script use the REFERENCE ASSEMBLY statement to enable R extensions for the U-SQL Script. More sample codes for using R can also be found in the following folder in your Data Lake Store:<your_account_address>/usqlext/samples/R. **Prerequisites for Tutorial** Before you begin this [tutorial][1], you must have the following items: - An Azure subscription. See [Get Azure free trial][2]. - Azure CLI 2.0. See [Install and configure Azure CLI][3]. - Azure Data Lake Analytics Account and Data Lake Store in your subscription. - Working knowledge of [R][4]. In this tutorial we will be using Azure CLI inside a [Jupyter notebook][5] to do various tasks such as managing and submitting jobs. You can download Azure CLI from [here][6]. To install the CLI on Windows and use it in the Windows command-line, download and run the [MSI][7]. To get started with using Azure CLI for this tutorial follow the instructions [here][8]. [Additional Resources][9] Acknowledgment Many thanks to Haibo Lin, Esin Saka, Shravan Matthur Narayanamurthy, Ivan Popivanov, Michael Rys and Hiren Patel for their help and support. [1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: [9]: