Analyzing Big Data with Microsoft R Server

By for April 5, 2017

Report Abuse
Data science and Machine Learning
# About the Course The open-source programming language R has for a long time been popular (particularly in academia) for data processing and statistical analysis. Among R's strengths as a programming language are its succinctness and its extensive repository of third party libraries for performing all kinds of analyses. Together, these two features make it possible for a data scientist to very quickly go from raw data to summaries, charts, and even full-blown reports. However, one deficiency with R it is memory-bound. In other words, R needs to load the data in its entirety into memory (like any other object). This is one of the reasons R has been more reluctantly received in industry, where data sizes are usually considerably larger than in academia. The main component of Microsoft R Server (MRS) is the RevoScaleR package. RevoScaleR is an R library that offers a set of functionalities for processing large datasets without having to load the data all at once in the memory. In addition, RevoScaleR offers a rich set of distributed statistical and machine learning algorithms, which get added to over time. Finally, RevoScaleR also offers a mechanism by which we can take code that we developed locally (such as on a laptop) and deploy it remotely (such as on SQL Server or a Spark cluster, where the underlying infrastructure is very different), with minimal effort. In this course, we will show you how to use MRS to run an analysis on a large dataset and provide some examples of how to deploy it on a Spark cluster or in-database inside SQL Server. Upon completion, you will know how to use R to solve big-data problems. Additionally, throughout this course students will learn to think like a data scientist by learning about the steps involved in the data science cycle ( https://docs.microsoft.com/en-us/azure/machine-learning/data-science-process-overview ): getting raw data, examining it and preparing it for analysis and modeling, running various analyses and examining the results, and finally deploying a solution. # Prerequisites There are a few things you will need in order to properly follow the course materials: * A solid understanding of R data structures (vectors, matrices, lists, data frames, environments) is required. For example, students should be able to confidently tell the difference between a list and a data.frame, or what each object suited for and how to subset it. * A basic understanding of programming concepts such as control flows, loops, functions and scope is required. * A good understanding of data manipulation and data processing in R (e.g. functions such as merge, transform, subset, cbind, rbind, or lapply). * Familiarity with third-party packages such as dplyr and ggplot2 is very helpful, as we use them in the course but don't cover them in great depth. * Familiarity with how to write and debug R functions is very helpful. * Although not required, a basic understanding of modeling and statistics can make some of the course easier to follow. * Courses: * • DAT204x: Introduction to R for Data Science : https://www.edx.org/course/introduction-r-data-science-microsoft-dat204x-2 * • DAT209x: Programming in R for Data Science : https://www.edx.org/course/programming-r-data-science-microsoft-dat209x-1 # Agenda * Getting started: We have an overview of RevoScaleR and show you how to access it by downloading and installing the Microsoft R Client.We then getting the NYC Taxi data used during the course. Finally, we install the required R packages we will be using throughout the course. * Reading the data: We talk about two different ways that RevoScaleR can handle the data and the trade-offs involved. * Preparing the data: We examine the data and ask how we can clean it and then make it richer and more useful to the analysis. In the process, we learn how to use RevoScaleR to perform data transformations and how third-party packages can be leveraged. * Examining the data: We now examine the data visually and through various summaries to see what does and does not mesh with our understanding of it. We look at sampling as a way to examine outliers. * Visualizing the data: We examine ways of visualizing our results and getting a feel for the data. In the process, we learn how RevoScaleR interacts with other visualization tools. * Clustering example: We look at k-means clustering our first RevoScaleR analytics function and look at how we can improve its performance when the data is large. * Modeling example: We build a few predictive models and show how we can examine the predictions and compare the models. We see how our choice of the model can have performance implications. * Deploying and scaling. We talk about RevoScaleR's write-once-deploy-anywhere philosophy and talk about what we mean by a compute context. We then take this into practice by deploying our code into SQL Server and Spark and talk about architectural differences. #Technologies Covered * R Language * Microsoft R Server # Materials * The course outline and course content can be found here: * https://smott.gitbooks.io/introduction-to-microsoft-r-server/content/