Tutorial: Predicting rain in Australia

March 28, 2019
The experiment is meant to be a tutorial for building regression analysis and choosing an appropriate regression model to predict rain fall
Various process involved in the creation of Azure Machine learning is detailed below. **• Data** The Rainfall in Australia is available for public from the Kaggle website, which contains daily weather information for various locations in Australia. The dataset is available as CSV and it has 24 columns. It has 142,193 rows with zero missing values for all the features. The following is the link for the dataset https://www.kaggle.com/jsphyg/weather-dataset-rattle-package For this experiment, we will be using Feature: RainTomorrow to predict if it would rain in a given location tomorrow. **• Retrieve data** The dataset was imported into Azure Machine Learning workspace using “Upload a new dataset from a local file”. Within the new workspace the following were added Drag the saved dataset “WeatherAUS.CSV Drag the “Summarize data” to better understand the dataset Run the experiment Right click “Summarize Data” and select “Results Dataset” and select “Visualize” Review the summary for each feature to understand the distribution and as well as any missing counts. This would also help plan the usage of other modules for preparing the data. **• Prepare Data** With intent to make sure there is no noise and for better predictability, the feature: RISK_MM was not involved in the model of this experiment. **• Preprocess Data** Given the dataset had labels for all features and as well as in appropriate format, no changes to categorial values or use of “Edit Metadata” was implemented. **• Algorithm** Several regression models were used to compare against each other, (ex. usage of Neural Network regression had better results of predictability compared to Linear regression). The following are the comparison between the two. For this run, the features – WindGustDir, RainToday and RISK_MM were excluded. *Neural Network: Root Mean Squared Error: 0.220; Coefficient of Determination 0.721 Linear Regression: Root Mean Squared Error: 0.311; Coefficient of Determination 0.442* Also given the amount of dataset, the experiment uses Partition and Sample module with the following as parameters *Partition: Sampling; Rate of Sampling: 0.25; Stratified split for Sampling: True; Stratification key column: Location* **• ML Training** In an effort to train the model, usage of Tune Model HyperParameter did not contribution hugely in identifying the Neural Network parameter for his experiment. **• Results** To compare the results of the experiment with RISK_MM, the Convert to CSV module has been added to the experiment. **• PowerShell to Train model** Given the amount of dataset, a PowerShell script was created for batch processing of the experiment. # The config file for the AML workspace is available here: C:\users\xyz\config.json Refer the link below on how to setup PowerShell https://docs.microsoft.com/en-us/azure/machine-learning/studio/powershell-module Below the PowerShell script to run this experiment in batch, change the file path appropriately to reflect where the PowerShell assemblies are installed in your machine. # After starting the PowerShell as admin, the following cmds needs to be executed separately Unblock-File C:\Users\xyz\AzureMLPS.dll Import-Module C:\Users\xyz\AzureMLPS.dll # Specify the name of the Experiment "RainInAustralia" $expID = Get-AmlExperiment | where Description -eq 'RainInAustralia' # Run the Experiment Start-AmlExperiment -ExperimentId $expID.ExperimentId