Sample 1: Download dataset from UCI: Adult 2 class dataset
This sample demonstrates how to download a dataset from a http location, add column names to the dataset and examine the dataset and compute some basic statistics.
##Download Dataset##
This experiment demonstrates how to use the **Reader** module to read data into Azure ML using HTTP, and then add a header to the data by using the **Enter Data** module.
##Data
The dataset we want to use in our experiment contains income and demographics extracted from the public census data. We obtained the dataset from the [UCI repository](http://archive.ics.uci.edu/ml/machine-learning-databases/adult) by using the **Reader** module to specify the location of the source data. From the data dictionary, we know that the data is in CSV format, without a header row, so we will specify those options in the **Reader** module and use the following modules to improve the data:
- Using the **Enter Data** module, we will manually create a header row.
- Using the **Execute R Script** module, we will insert the header row into the dataset.
Finally, we will output some basic statistics for the dataset using the **Descriptive Statistics** module.
##Creating the Experiment##
First we need to configure the **Reader** module:
1. For **Data source** we select _Web URL via HTTP_.
2. In the **URL** text box, we provide the URL for the source data, including the name and file extension of the CSV data file.
3. For the **Data format** option, we select _CSV_.
4. The data file does not have a header row, so we leave the **CSV or TSV has header row** option unchecked.
![][image1] ![][image2]
One way to change column names would be to use the **Metadata Editor** module and provide a comma-separated list of names for the columns. However, a long list of column names is hard to see in the text box, so we will show you an alternate way to rename the columns.
1. First, use the **Enter Data** module to type a list of column names to be used as the header row. The illustration above shows the column names we typed in. (You can get a full list of the columns in the census data from the UCI repository)
2. Next, use the **Execute R Script** module to insert the header rows into the dataset. The following diagram shows the example code. (To get a copy of this sample R code, you can create a copy of this experiment, and then edit the **Execute R Script** module.)
![][image3]
##Data Visualization##
You can review the format of the original data on the UCI Machine Learning Repository, at [http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data](http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data). Note that the original data has no column names.
However, when you import the data into Azure ML Studio using the **Reader** module, default column names are assigned to all the columns. To view the auto-generated column names, right-click the output port of the **Reader** module and select **Visualize**.
![][image4]
![][image5]
After you use the **Execute R Script** module to insert the header row created by using the **Enter Data** module, the modified dataset is as shown in the diagram below. This small change makes the data much easier to read and work with.
![][image6]
Finally, we use the **Descriptive Statistics** module to compute some basic statistics on the dataset, and use the **Visualize** option from the output port to view the results.
![][image7]
<!-- Images -->
[image1]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_parameters.PNG
[image2]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/enter_data.PNG
[image3]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/r_code.PNG
[image4]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_visualize.PNG
[image5]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/reader_output.PNG
[image6]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/header_data.PNG
[image7]:http://az712634.vo.msecnd.net/samplesimg/v1/S1/desc_stats.PNG