Sample 9: Split, partition and sample system

By for February 19, 2015

Report Abuse
This sample demonstrates how to sample and split the data set using the Partition and Sample, and Split modules.
# Split, Partition and Sample System This sample demonstrates how to sample and split the data set using the **Partition and Sample** and **Split** modules. ##Data In this experiment, we use the Adult data set from the [UCI repository](https://archive.ics.uci.edu/ml/datasets/Adult). This data set has already been cleaned and uploaded to Azure ML, thus we can simply select the data from the "Saved Datasets" in the left panel directly. This data set contains 32,561 samples with 14 features and 1 label (the last column). ![][image1] ##Partition and Sample The **Parition and Sample** module is a versatile module to partition and sample data sets. It supports the following operations: 1. Data sampling 2. Data partition (i.e., assign to folds), including partition with replacement, randomized splitting, stratified splitting, etc. 3. Pick fold, which is commonly used in cross-validation 4. Check the top k rows of the data set In this experiment, we demonstrate 1-3. The first **Partition and Sample** module (from top to bottom) samples 10% of the original data. In the **Properties** of this module, we select *Sampling* in the droplist of *Partition or sample mode* and set *Rate of sampling* to 0.1. The figure below shows the detailed setting of this module. ![][image2] To generate k folds of a data set in cross-validation, typically we use a pair of **Partition and Sample** modules. Next we describe the detailed procedure below. The second **Partition and Sample** module (from top to bottom) assigns the sampled data set into 3 folds. In the **Properties** of the second module, we select *Assign to Folds* in the droplist of *Partition or sample mode*. Usually in cross-validation we partition the whole data sets into several folds with equal size. In this example, we select *Partition evenly* in the droplist of *Specify the partitioner method*. In this partition, we set the fold number as 3 and select stratified split by selecting *True* in the droplist of *Stratified split*. In order to perfrom stratified split, we need to specify the column(s) we use. In this example, we select the label column whose column name is *income*. The figure below shows how we split the data into 3 folds. ![][image3] To pick each fold of the partition, we need to use a separate **Partition and Sample** module to receive the output of the **Partition and Sample** module which assigns the data to folds. In this experiment, we created 3 more **Partition and Sample** modules ( in the 4th row). In all of them, we select *Pick Fold* in the droplist of *Partition or sample mode*. We set the fold index (1,2, and 3 in our case) in *Specify which fold to be sampled from*. Then the output of each of these 3 **Partition and Sample** modules corresponds to one fold of the partition. The figure below shows the settings of the **Partition and Sample** module which picks the second fold. ![][image4] ##Split The **Split** module divides a dataset from its input into 2 parts, which corresponds to its 2 output ports. It supports the following 4 splitting modes: 1. *Split Rows*, which also supports stratified split in this mode 2. *Recommender Split* 3. *Regular Expression* 4. *Relative Expression* The *Recommender Split* is only applicable for training or scoring in a recommendation model. The *Regular Expression* and *Relative Expression* modes basically set up a regular expression or relational expression to filter the data: the first output corresponds to the rows satisfying the condition while the second output corresponds to the rows not satisfying the condition. In this experiment, we demonstrate the use of **Split** in modes 1, 3, and 4. In the first **Split** module (from left to right), we select *Split Rows* in the droplist of *Splitting model*, and randomly split the input data into two sets with equal size by setting *Fraction of rows in the first output dataset* to 0.5. The figure below demonstrates how we set this module. ![][image5] In the second **Split** module (from left to right), we select *Regular Expression* in the droplist of *Splitting model*. In this experiment, we set the regular expression to be **\"education" (\d{1}th.+)|Doctorate**, which means that the education column must be in the pattern which begins with [d]th and contains two [d]th, where [d] is a digit, or simply *Doctorate*. The first output port corresponds to all rows satisfying this condition and the second output contains all rows not satisfying it. The figure below demonstrates how we set this module. ![][image6] In the third **Split** module (from left to right), we select *Relative Expression* in the droplist of *Splitting model*. In this experiment, we set the relative expression to be **\"age" > 65**, which means that the age column must be greater than 65. As a result, the first output port corresponds to all rows in which the age is greater than 65, and the second output port corresponds to all rows in which the the age is not greater than 65. The figure below demonstrates how we set this module. ![][image7] <!-- Images --> [image1]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/whole_exp.png [image2]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/partition_sample.png [image3]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/partition_cv.png [image4]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/partition_pickf2.png [image5]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/split_random.png [image6]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/split_regular_expression.png [image7]:https://az712634.vo.msecnd.net/samplesimg/v1/S9/split_relative.png