Tutorial Clustering YouTube Views

April 4, 2019
The experiment is meant to be a tutorial for creating Clustering model using YouTube views dataset through Azure ML.
Various process involved in the creation of Azure Machine learning are detailed below. **• Data** The dataset is available for public from the Kaggle website, which contains 5000 channels with features like Number of Video views, subscribers and Video uploads were used in the experiment. The dataset is available as CSV and it has additional columns – rank and grading, which were not used for this experiment. The goal of the experiment is to cluster the channels based on certain features using K-Means Clustering module. **• Retrieve data** The dataset was imported into Azure Machine Learning workspace using “Upload a new dataset from a local file” without any changes / updates to the raw information. **• Prepare Data** After initial run with “Summarize Data” module, the following were identified  1. Video Views: No missing count 2. Subscribers: 388 rows had “—” as values 3. Video Uploads: 6 rows had “—” as values This helped to understand the dataset better and the usage of other modules for preparing the data. **• Preprocess Data** Given the features are Character based, need to apply transformation to convert into int and as well remove some of the channels using “Apply SQL Transformation” module. select [Grade], [Channel name], cast([Video Uploads] as int) as [Video Uploads], cast([Subscribers] as int) as [Subscribers], cast([Video views] as int) as [Video views] from t1 where [Subscribers] <> '--' ; Also data was normalized using “Normalize Data” module with ZScore as transformation method for Subscribers and Video views. **• Algorithm** Prior to training the model, “Sweep Clustering was used to identify the optimal configuration and the following are the results for each “Metric for measuring clustering result” Metric for measuring Cluster Metric Number of Centroids Simplified Silhouette 0.788704 4 Davies-Bouldin 0.370735 6 Dunn 0.926708 4 Average Deviation 2101.017826 6 Various modules (ex. Convert to CSV, Assign Data to Clusters and Evaluation Model) were used to evaluate the usage of various parameters towards optimization and identification of number of Centroids recommended. **• ML Training** The final parameter utilized in the experiment were the following  - Trainer mode: Single Parameter - Number of Centroids: 4 - Initialization: K-Means++ - Metric: Euclidean - Iterations: 150 - Assign Label Mode: Fill missing values **• Results** To visualize the Clusters graphically using PowerBI, the output of the Trained model was outputted using “Convert to CSV”. Also another “Apply SQL Transformation” module was included to combine the whole dataset. The Cluster graph was created using PowerBI - Bubble chart by Akvelon. **• PowerShell to Train model** Given the amount of dataset, a PowerShell script was created for batch processing of the experiment. # The config file for the AML workspace is available here: C:\users\xyz\config.json Refer the link below on how to setup PowerShell https://docs.microsoft.com/en-us/azure/machine-learning/studio/powershell-module Below the PowerShell script to run this experiment in batch # After starting the PowerShell as admin, the following cmds needs to be executed separately Unblock-File C:\Users\xyz\AzureMLPS.dll Import-Module C:\Users\xyz\AzureMLPS.dll # Specify the name of the Experiment "Top5k_Youtube_Views" $expID = Get-AmlExperiment | where Description -eq 'Top5k_Youtube_Views' # Run the Experiment Start-AmlExperiment -ExperimentId $expID.ExperimentId **Acknowledgement** Thanks to Dhrumil Mehta and Socialblade for providing the dataset and making it available at [Kaggle Dataset][1] [1]: https://www.kaggle.com/mdhrumil/top-5000-youtube-channels-data-from-socialblade/discussion