SQL Server Optimization Tips and Tricks for Analytics Services

April 11, 2017

Report Abuse
In this tutorial, we demonstrate some optimization tips and tricks on SQL Server to help you get the most out of SQL Server with R Services. Most importantly, those optimization techniques not only work for R Services, but any analytics services integrated with SQL server.
This tutorial will cover several useful optimization tips and tricks that you can use on SQL Server. In this tutorial, a resume-matching data science scenario (find out all good-match candidates for a given job opening) will be used to mimic the workflow of large volume prediction and to showcase how those optimization techniques boosts the data analytics. We formulate the resume-matching use case as a **binary classification** problem. Those good-matches are labeled with 1 while bad-matches are labeled with 0. However, it could be easily extended to be a multi-class classification problem if you want to interpret good-match into multiple levels. The solution is demonstrated using a few synthetic datasets that used to proxy the data in the resume-matching use case, with the following files: * Processed resume feature dataset (numerical data) * Processed project (job) feature dataset (numerical data) * Training dataset of resume-project pairs with labels There is no legitimate need to use as many features (or how we engineered those features, details please refer to the steps walk-through file) as we are showing in this tutorial. It is done merely to reflect the number of features we used. For this tutorial, we will use a DS series SQL Server 2016 on Windows Server 2016 on Azure. While creating the VM ensure that the SQL connectivity is set to public, and the R Services is enabled as well. This tutorial provides deployable components, and we will walk you through all optimization steps to setup the database and showcase how resume-matching problem is handled on SQL Server with R Services. All the documentation, datasets, and SQL scripts can be found in a GitHub repository which containing the following files: * **README.md** - The step-by-step walk through to deploy everything. * **Data** - The synthetic datasets used in this tutorial. * **SQLR** - SQL scripts and stored procedures for creating the database, loading the data from CSV files, optimizations on SQL Server, training a prediction model, and scoring. An extra PowerShell script is also provided to showcase how to initiate the scoring tasks in parallel. Click **Get Started** on the right to begin.