In this tutorial, we demonstrate some optimization tips and tricks on SQL Server to help you get the most out of SQL Server with R Services. Most importantly, those optimization techniques not only work for R Services, but any analytics services integrated with SQL server.
This tutorial will cover several useful optimization tips and tricks that you can use on SQL Server. In this tutorial, a resume-matching data science scenario (find out all good-match candidates for a given job opening) will be used to mimic the workflow of large volume prediction and to showcase how those optimization techniques boosts the data analytics. We formulate the resume-matching use case as a **binary classification** problem. Those good-matches are labeled with 1 while bad-matches are labeled with 0. However, it could be easily extended to be a multi-class classification problem if you want to interpret good-match into multiple levels.
The solution is demonstrated using a few synthetic datasets that used to proxy the data in the resume-matching use case, with the following files:
* Processed resume feature dataset (numerical data)
* Processed project (job) feature dataset (numerical data)
* Training dataset of resume-project pairs with labels
There is no legitimate need to use as many features (or how we engineered those features, details please refer to the steps walk-through file) as we are showing in this tutorial. It is done merely to reflect the number of features we used.
For this tutorial, we will use a DS series SQL Server 2016 on Windows Server 2016 on Azure. While creating the VM ensure that the SQL connectivity is set to public, and the R Services is enabled as well. This tutorial provides deployable components, and we will walk you through all optimization steps to setup the database and showcase how resume-matching problem is handled on SQL Server with R Services.
All the documentation, datasets, and SQL scripts can be found in a GitHub repository which containing the following files:
* **README.md** - The step-by-step walk through to deploy everything.
* **Data** - The synthetic datasets used in this tutorial.
* **SQLR** - SQL scripts and stored procedures for creating the database, loading the data from CSV files, optimizations on SQL Server, training a prediction model, and scoring. An extra PowerShell script is also provided to showcase how to initiate the scoring tasks in parallel.
Click **Get Started** on the right to begin.