Train a decision forest model to predict surface reflectivity of roofs

November 6, 2019
Train a decision forest regression model to predict roof albedo using band values (Red, green, blue, NIR) from NAIP imagery as input.
**Input Data:** Training/validation: - measured or known albedo values of specific materials/sites at specific times. We collected roof and pavement albedo measurements from roof manufactures installers and researchers. Currently we have known albedo values associated with specific measurement or installation dates for over 30,000 unique roofs or pavement location in approximately 45 US states. However, most of the roof samples were associated with high albedo values and there were very small number of samples with low albedo values. So, we employed two techniques to cover up the lack of training data: pixel sampling, SMOTE. Rather than using all the pixels within a roof or street, we selected 20 random pixels as input. This gave us the opportunity to select multiple sample of 20 pixels from underrepresented cases. SMOTE (Synthetic Minority Oversampling Technique) is a method in Azure Machine Learning Studio (classic) to increase the number of minority cases in a dataset used for machine learning. This statistical technique helped us to create new samples with a range of albedo values for which we had no data before. Overall, 283 roof samples were used as input for the roofs model. The samples were selected in a way to have a more diverse and balanced training data. - geometries of training site/material. Microsoft footprint data was used as the main source of geometries for the roofs training data. - high-resolution four-band imagery of training sites from within 6-months of site measurement. National Agriculture Imagery Program (NAIP) was used as the source of imagery which has 1m resolution. In summary, the input data consist of 283 rows. Each row contain 9 columns. "norm_red_mean", "norm_green_mean", "norm_blue_mean" and "norm_nir_mean" are the main inputs and represent preprocessed, normalized mean band values calculated from 20 random pixels within a roof. Each of the 283 roof entries are associated with an expected albedo value which will be used to train the model. **Data processing:** The data is split with a 70/30 ratio. 70% data will be used for training and the remaining 30% will be used for validation. **Model description:** Decision forest regression algorithm was used to train the model. The model was tuned to find the optimum parameter settings. **Results and evaluation of model performance:** The model is scored and evaluated on the validation data. The "Scored Label Mean" column from the "Score Model" module represent estimated mean reflectivity (unitless albedo, 0-1) for each roof. The output from the "Evaluate Model" shows that the model performs well with a coefficient of determination score of 0.997422.