Getting Started with using Azure Data Lake Analytics with R - A Tutorial
A tutorial to get started with using Azure Data Lake Analytics with R for Data Science work.
Welcome to Getting Started with using ADLA with R tutorial! This tutorial is meant to help Data Scientists get started with using Azure Data Lake Analytics (ADLA) with R for Data Science work. In this tutorial we will be doing a series of short exercises to learn how to deploy R code within ADLA and understand how R interacts with USQL. At the end of the tutorial you should be ready to integrate your complex R projects with Azure Data Lake Analytics and exploit the many functionalities of Azure Data Lake Analytics for your big data needs. Click on *View Code* on the right to go to the tutorial page.
**Azure Data Lake Analytics**
Azure Data Lake Analytics is the first cloud on-demand analytics job service designed to make big data analytics easy. It has the capability to conduct data processing, advanced analytics, and machine learning modeling with high scalability in a cost-effective way. Using U-SQL, R, Python and .NET, it allows users to run massively parallel data transformation and processing over petabytes of data. A job in ADLA is submitted using a ***USQL*** script.
**USQL**
U-SQL is the new big data query language of the Azure Data Lake Analytics service from Microsoft. It combines a familiar SQL-like declarative language with the extensibility and programmability provided by C# types and the C# expression language. It includes big data processing concepts such as “schema on reads”, custom processors and reducers on top of a scale-out runtime that can handle any size data. It also provides the ability to query and combine data from a variety of data sources, including Azure Data Lake Storage, Azure Blob Storage, and Azure SQL DB, Azure SQL Data Warehouse, and SQL Server instances running in Azure VMs.
**USQL with R**
R Extensions for U-SQL enable developers to perform massively parallel execution of R code for end to end data science scenarios covering: merging various data files, massively parallel feature engineering, partitioned data model building, scoring and post deployment. In order to deploy R code in we need to install the usqlext in our azure data lake analytics account and within the usql script use the REFERENCE ASSEMBLY statement to enable R extensions for the U-SQL Script. More sample codes for using R can also be found in the following folder in your Data Lake Store:<your_account_address>/usqlext/samples/R.
**Prerequisites for Tutorial**
Before you begin this [tutorial][1], you must have the following items:
- An Azure subscription. See [Get Azure free trial][2].
- Azure CLI 2.0. See [Install and configure Azure CLI][3].
- Azure Data Lake Analytics Account and Data Lake Store in your subscription.
- Working knowledge of [R][4].
In this tutorial we will be using Azure CLI inside a [Jupyter notebook][5] to do various tasks such as managing and submitting jobs. You can download Azure CLI from [here][6]. To install the CLI on Windows and use it in the Windows command-line, download and run the [MSI][7]. To get started with using Azure CLI for this tutorial follow the instructions [here][8].
[Additional Resources][9]
Acknowledgment
Many thanks to Haibo Lin, Esin Saka, Shravan Matthur Narayanamurthy, Ivan Popivanov, Michael Rys and Hiren Patel for their help and support.
[1]: https://github.com/Azure/ADLAwithR-GettingStarted/tree/master/Tutorial
[2]: https://azure.microsoft.com/en-us/free/
[3]: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest
[4]: https://cran.r-project.org/doc/manuals/R-intro.html
[5]: https://github.com/Azure/ADLAwithR-GettingStarted/blob/master/Azure%20CLI/Tutorial_with_Jupyter_Notebook.ipynb
[6]: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli
[7]: https://aka.ms/InstallAzureCliWindow
[8]: https://github.com/Azure/ADLAwithR-GettingStarted/tree/master/Azure%20CLI
[9]: https://github.com/Azure/ADLAwithR-GettingStarted/blob/master/Azure%20CLI/Additional%20Resources.md