Explore data from various Azure data services on the Data Science Virtual Machine using Apache Drill

December 10, 2016

Report Abuse
Learn how to setup and query data stored in various Azure data services using Apache Drill on the Data Science Virtual Machine using the familiar SQL interface.
# What is Apache Drill [Drill](http://drill.apache.org/) is an Apache open-source SQL query engine for Big Data exploration. Drill is designed from the ground up to support high-performance analysis on the semi-structured and rapidly evolving data coming from modern Big Data applications, while still providing the familiarity and ecosystem of ANSI SQL, the industry-standard query language. Drill can query different data sources. Here we demonstrate how you can query Azure blobs, Azure SQL Database/Datewarehouse, Cosmos DB(A managed NoSQL database on Azure), Azure HDInsight (Hive) and Azure Data Lake Store. # What is Data Science Virtual Machine (DSVM) [Data Science Virtual Machine](http://aka.ms/dsvm) is a custom virtual machine image on Azure pre-installed and configured with a host of popular tools commonly used for data science, machine learning and AI. The DSVM can be the analytics desktop in the cloud for both beginner and advanced data scientists and engineers. Drill is included on the Data Science Virtual Machine. In this tutorial we cover the details on the Windows edition of the VM. The instructions are very similar even if you use Linux. # Scope for the tutorial In this tutorial we will show how to use Drill to explore data that is stored on various Azure data services. Drill can access various data formats like flat files (CSV, TSV), full directory, Parquet files, JSON files, Relational databases, NoSQL databases. These data can be stored on various storage technologies on the cloud or Onprem. This tutorial will focus on data stored on the following four Azure data services: - Azure Blob - Azure SQL Data Warehouse (also applies to Azure SQL or a SQL Server on a Azure VM) - Azure Cosmos DB (aka DocumentDB) - Azure HDInsight (Hive) - Azure Data Lake Store On the Windows DSVM, the drivers needed for Apache Drill to connect to different Azure sources are now currently installed by default. We will still include the installation steps in this tutorial. It may still help to understand the end to end process of installing and configuring data sources on Apache Drill that we will cover here. Drill can be run in embedded mode (single server) and in distributed mode. In this tutorial we will be covering running Drill in a embedded mode on the Data Science Virtual Machine. A excellent [blog post series](https://blogs.msdn.microsoft.com/data_otaku/2016/05/27/deploying-apache-drill-on-azure/) covers how to run Drill in a distributed cluster topology and connect to different Azure data services. This tutorial borrows heavily from this blog post series. The step by step instructions follows. # Pre-Requisites: Create a DSVM instance If you don't have an instance of DSVM, you can create one by visiting this [link](http://aka.ms/dsvm). If you have an existing instance of the DSVM you can login to the VM for rest of the steps. Login to the DSVM as an administrator. # Step 1: Install the necessary drivers **NOTE: This is step is not needed anymore with the latest DSVM since the Azure drivers are preinstalled. Instructions below can be used if you want to update drivers to latest version or understand where the drivers are installed in Drill.** Drill needs Java drivers (JAR files) for the various Azure services like Azure Storage blob (aka Windows Azure Storage Blob (WASB)), Azure SQL Database/ Data Warehouse, Azure HDInsight (Hadoop). To install the drivers just run the following commands from the Windows command prompt as an administrator. <pre> # Install Azure Blob and Azure Data Lake Drivers cd \dsvm\tools\apache-drill-VERSION\jars\3rdparty (use the version of Drill on DSVM) wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.3/hadoop-azure-2.7.3.jar wget http://central.maven.org/maven2/com/microsoft/azure/azure-storage/4.4.0/azure-storage-4.4.0.jar wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-azure-datalake/3.0.0-alpha3/hadoop-azure-datalake-3.0.0-alpha3.jar wget http://central.maven.org/maven2/com/microsoft/azure/azure-data-lake-store-sdk/2.1.5/azure-data-lake-store-sdk-2.1.5.jar #Now install SQL Server JDBC Drivers curl --output sqljdbc.exe https://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/enu/sqljdbc_6.0.7728.100_enu.exe sqljdbc /auto copy "C:\Program Files\Microsoft JDBC Driver 6.0 for SQL Server\sqljdbc_6.0\enu\sqljdbc42.jar" . del sqljdbc.exe del /Q "Microsoft JDBC Driver 6.0 for SQL Server" # Now install components needed for Azure HdInsight (HDI) Hive # NOTE: YOU NEED A RUNNING HDINSIGHT CLUSTER TO FETCH THE JAR FILE scp YOURHDIUSER@YOURHDICLUSTER.azurehdinsight.net:/usr/hdp/current/hive-client/lib/regexp-*.jar . </pre> #Step 2 : Configure the data sources in Drill Here you will register the data sources (also known as Storage Plugins) and their access methods including credentials to Apache Drill. Each subsection describes the steps to register the data source that is stored in the specific type of Azure storage service in Drill. First you need to start Drill in embedded mode. Change your directory to **C:\dsvm\tools\apache-drill-VERSION\bin** and run the following command: sqlline.bat -u "jdbc:drill:zk=local" ### Azure Storage Blob Azure storage provides a fast, low cost method to store objects (called Blobs). You can upload flat files like CSV, TSV files to a Azure blob. Blobs can be public or private. Public bloc don't need any credentials to access them whereas private blob is only accessible if you have the storage account key. The steps to register a data source stores in a Azure blob are as follows: 1. If you are using a private blob you need to enter the credentials in a Apache Drill configuration file stored in **c:\dsvm\tools\apache-drill-{VERSION}\conf\core-site.xml**. Edit this file in any text editor. Each blob data source that needs credentials must be entered in this XML file. Replace *YourAzureStorageAccountname* and *YourStorageAccountKey*. You can find these on the Azure portal when you view properties of your Azure storage account. > <?xml version="1.0" encoding="UTF-8" ?> > <configuration> > > <property> > <name>fs.azure.account.key.YourAzureStorageAccountname.blob.core.windows.net</name> > <value>YourStorageAccountKey</value> > </property> > > </configuration> You can skip the above for public blobs since you don't need credentials to access these. 2. Connect to Drill console by navigating to **http://localhost:8047** on your browser on the DSVM. Once you are there click on the **Storage** tab and then at the bottom of the screen you will see a **New Storage Plugin** section where you can register a new data source. Enter a friendly name for the blob data source since this will be used in your SQL statements to references the data in the blob. Click on **Create** button. 3. A large text box will open up where you need to paste the following JSON configuration. Replace *YourContainer* and *YourAzureStorageAccountname* with that of your data source in Azure Blob. <pre> { "type": "file", "enabled": true, "connection": "wasb://YourContainer@YourAzureStorageAccountname.blob.core.windows.net/", "config": null, "workspaces": { "root": { "location": "/", "writable": false, "defaultInputFormat": null }, "tmp": { "location": "/tmp", "writable": true, "defaultInputFormat": null } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "extractHeader": true, "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "parquet": { "type": "parquet" }, "json": { "type": "json", "extensions": [ "json" ] }, "avro": { "type": "avro" }, "sequencefile": { "type": "sequencefile", "extensions": [ "seq" ] }, "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," } } } </pre> 4.Click on **Create** button to finish registering the Azure storage blob. You can repeat these steps for all the blob containers you will use in your analysis. ### Azure SQL Data Warehouse To register Azure SQL Data Warehouse (and Azure SQL Database and a SQL Server Onprem or on a Azure VM) you need create a new Apache Drill data source and enter the following configuration in the Drill Console (See Step (2) in Azure Blob section to get to the option on the Drill Console). Replace the following parameters in the configuration *YOURAZURESQLSERVERNAME, YOURDATABASENAME, YOURSQLUSER, YOURSQLPASSWORD* with the correct values for your Azure SQL Data Warehouse data source. <pre> { "type": "jdbc", "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver", "url": "jdbc:sqlserver://YOURAZURESQLSERVERNAME.database.windows.net:1433;databaseName=YOURDATABASENAME", "username": "YOURSQLUSER", "password": "YOURSQLPASSWORD", "enabled": true } </pre> ### Azure Cosmos DB / DocumentDB Azure Cosmos DB is a NoSQL database as a service that allows storing and querying JSON documents. Cosmos DB supports MongoDB protocol (See details [here](https://docs.microsoft.com/azure/documentdb/documentdb-protocol-mongodb). Currently you can access Cosmos DB from Apache Drill only using the MongoDB protocol. When you create the Cosmos DB you can specify whether the instance should support MongoDB. You must enable MongoDB support. Then you can get the MongoDB connection string from the Azure Portal when you view the Cosmos DB account properties. You can find details [here](https://docs.microsoft.com/azure/documentdb/documentdb-connect-mongodb-account#a-idquickstartconnectionaget-the-mongodb-connection-string-using-the-quick-start) on the Cosmos DB documentation page. Now, you can create a storage plugin (data source) in Drill using the Drill Console similar to the procedure in the Azure Storage Blob or Azure SQL Datawarehouse. The JSON configuration will be as follows. You need to replace the Cosmos DB(MongoDB protocol) connection string. You will notice that the *type* of the plugin is *mongo*. (Note: The procedure is same if you are querying MongoDB from Drill). <pre> { "type": "mongo", "connection": "PASTE COSMOSDB MONGODB PROTOCOL CONNECT STRING ", "enabled": true } </pre> In the Drill Web console or Drill shell, you need to run the following command one-time to avoid getting an "UNSUPPORTED_OPERATION_ERROR". ALTER SYSTEM SET `store.mongo.read_numbers_as_double` = true ### Azure HDInsight (Hive) Azure HDInsight is a managed Hadoop service based on Hortonworks Hadoop distribution. It supports all major Hadoop components like Hive, HBase, Spark, Storm. Here we will demonstrate how to query Hive tables in HDInsight. **NOTE: Your DSVM where Drill runs must be in the same Virtual Network (VNET) as the HDInsight OR you need to setup a vnet-to-vnet VPN tunnel as documented [here](https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-vnet-vnet-rm-ps).** The HDInsight cluster is associated with a primary storage (usually a Blob). Note: It also supports Azure Data Lake Storage (ADLS) as a storage (We have not yet explored connecting to ADLS from Drill). This instructions is only for HDInsight clusters with a blob as the primary storage. From the portal you can find the blob storage account and container used by HDInsight. You need to set up the credentials to access this blob in the core-site.xml as described in the section on Azure Storage Blob above. You also need some parameters for your HDInsight cluster that you need to get from HDInsight Ambari Admin console before you can setup a storage plugin in Drill. You can do that with the following steps: 1. Open a browser to your HDInsight cluster at https://myclustername.azurehdinsight.net with appropriate substitution for myclustername 2. At the prompt, login using the HTTP user name and password established when you configured the cluster 3. Within the Ambari interface, go to Services | Hive  | Configs | Advanced 4. Under the General subheading, locate and record the values for hive.metastore.uris and hive.metastore.warehouse.dir 5. Navigate to Services | HDFS | Configs | Advanced 6. Under the Advanced core-site subheading, locate and record the value for fs.defaultFS Next you need to enter these values in the Drill configuration for the storage plugin. The JSON configuration will be as follows. You need to replace the values for your HDInsight cluster. <pre> { "type": "hive", "enabled": true, "configProps": { "hive.metastore.uris": "YOUR VALUE FROM hive.metastore.uris ", "hive.metastore.warehouse.dir": "YOUR VALUE FROM hive.metastore.warehouse.dir (usually /hive/warehouse)", "fs.default.name": "wasb://YOURHDISTORAGECONTAINER@YOURHDISTORAGEACCOUNT.blob.core.windows.net", "hive.metastore.sasl.enabled": "false" } } </pre> ### Azure Data Lake Store (ADLS) Azure Data Lake Store is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake enables you to capture data of any size, type, and ingestion speed in one single place for operational and exploratory analytics. The steps to register Azure Data Lake Store as data source in Drill are as follows: 1. Create Service-to-Service authentication on ADLS to allow Drill to access it. Details can be found [here](https://docs.microsoft.com/azure/data-lake-store/data-lake-store-authenticate-using-active-directory) and allow the service account access to objects in the ADLS that you want to enable querying. The output of this steps will be the **Application id**, **authentication key**, and **tenant id**. These parameters will be entered into the Drill configuration file as indicated in next step. 2. Edit or create the core-site.xml in **conf** directory under the Drill home. Add the following lines within the <configuration> element. Replace value for {Application_ID}, {Authentication Key} and {Tenant ID} in the confgiration segment below. <property> <name>dfs.adls.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>dfs.adls.oauth2.refresh.url</name> <value> https://login.microsoftonline.com/{TENANT ID}/oauth2/token</value> </property> <property> <name>dfs.adls.oauth2.client.id</name> <value>{APPLICATION ID}</value> </property> <property> <name>dfs.adls.oauth2.credential</name><value>{AUTHENTICATION KEY}</value> </property> <property> <name>fs.adl.impl</name> <value>org.apache.hadoop.fs.adl.AdlFileSystem</value> </property> <property> <name>fs.AbstractFileSystem.adl.impl</name> <value>org.apache.hadoop.fs.adl.Adl</value> </property> 3. Connect to Drill console by navigating to **http://localhost:8047** on your browser on the DSVM. Once you are there click on the **Storage** tab and then at the bottom of the screen you will see a **New Storage Plugin** section where you can register a new data source. Enter a friendly name for the ADLS data source since this will be used in your SQL statements to references the data in the Data Lake. Click on **Create** button. 4. A large text box will open up where you need to paste the following JSON configuration. Replace **YOURDATALAKEACCOUNT** with that of your data source in Azure Data Lake Store. You can find this information on your Azure portal login when you browse all your resources. <pre> { "type": "file", "enabled": true, "connection": "adl://YOURDATALAKEACCOUNT.azuredatalakestore.net/", "config": null, "workspaces": { "root": { "location": "/", "writable": false, "defaultInputFormat": null }, "tmp": { "location": "/tmp", "writable": true, "defaultInputFormat": null } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "extractHeader": true, "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "parquet": { "type": "parquet" }, "json": { "type": "json", "extensions": [ "json" ] }, "avro": { "type": "avro" }, "sequencefile": { "type": "sequencefile", "extensions": [ "seq" ] }, "csvh": { "type": "text", "extensions": [ "csvh" ], "extractHeader": true, "delimiter": "," } } } </pre> 5.Click **Create** to register the data source in Azure Data Lake Store for further use in querying within Drill. # Step 3: Running queries in Drill Drill supports ANSI compliant SQL statements to query any data source and formats whether they are tabular, text files, JSON files. You can join data from different data sources in a single SQL query. Here we look at some simple query to reference the various Azure data services sources we configured in previous section. You can run the queries on the Drill console and navigating to the **Query** tab. You can also access Drill from clients like Excel, PowerBI, Tableau because it supports ODBC and JDBC interface to Drill itself. This can be a separate post at a later time. **Azure Storage Blob or Azure Data Lake**: Here, if we configured a Drill storage plugin called *myblob* and this Azure storage container / Azure Data Lake Store has a CSV called *myfile.csv* you can fetch the first 10 rows by with following query. select * from myblob.`myfile.csv` limit 10 **Azure SQL Data Warehouse**: Your query will reference tables with the following structure storagepluginname.databasename.schema.table. So your query will look like: select * from mysqldw.adventureworks.marketing.campaigns LIMIT 10 **Azure Cosmos DB**: You can run queries on nested JSON structure and reference them as follows in the SQL statement: select Country, Region, Status, cvolcano.Location.coordinates[0] as lat, cvolcano.Location.coordinates[1] as longi from docdbmongo.dvolcano.cvolcano Here we are querying a *volcano* dataset that is loaded into Cosmos DB as a JSON. *docdbmongo* is the storage plugin name we configured in Drill. *dvolcano* is the database name in CosmosDB . *cvolcano* is the container that where we loaded the Volcano dataset (using a tool like DocumentDB Data Migration Tool available on the DSVM. Just type **dtui** on a command prompt). **Azure HDInsight(Hive)**: Once the setup is done, running queries in Drill on Hive is easy. Queries will look as follows: SELECT * FROM myhive.`hivesampletable` LIMIT 10; (Most HDInsight cluster come with the *hivesampletable* that you can use to quickly test the setup). So there you have it. Easy access to various Azure data services from Drill where you can explore data right at the data source and combine them with SQL join clauses. This is very useful to get a quick understanding of data even before you build a ETL pipeline to bring all the data stored in various sources together. **Next Steps**: Learn more about how to query data in Apache Drill by visiting the Drill [documentation](https://drill.apache.org/docs/) page. You can scale your queries to run faster by using Drill in distributed mode on a cluster by following instructions in this [series of blog posts](https://blogs.msdn.microsoft.com/data_otaku/2016/05/27/deploying-apache-drill-on-azure/) (Thanks Bryan Smith!). A lot of instructions above are leveraged from this blog series.