Building a Regression Model to Predict Real Estate Sales Price
Predict the real estate sales price of a house based upon various quantitative features about the house and sale.
#Follow Along A fully illustrated version of this tutorial can be found [here. Follow along there](http://datasciencedojo.com/predicting-value-house/). [![Foo](http://cdn.datasciencedojo.com/channel9_real_estate_housing_price.PNG )](https://channel9.msdn.com/Blogs/Seth-Juarez/Predict-the-value-of-your-house-using-Azure-Machine-Learning) #Objective Predict the real estate sales price of a house based upon various features about the house and the sales transaction. #Data [Ames housing dataset](http://www.amstat.org/publications/jse/v19n3/decock.pdf) includes 81 features and 1460 observations. Each observation represents the sale of a home and each feature is an attribute describing the house or the circumstance of the sale. [Click here for the full list of feature descriptions](https://kaggle2.blob.core.windows.net/competitions-data/kaggle/5407/data_description.txt?sv=2012-02-12&se=2016-09-26T01%3A34%3A00Z&sr=b&sp=r&sig=4%2BHyTSyqQ7xfvAEk80tgmOW9p6%2BXk6LekhzV9peFaro%3D). #Initial Feature Selection Some low quality features were removed to improve the model's performance. Low quality includes lack of representative categories, too many missing values, or noisy features. #Categorical Casting [Nominal categorical features](http://www.ats.ucla.edu/stat/mult_pkg/whatstat/nominal_ordinal_interval.htm) were identified and cast to categorical data types using the meta data editor to ensure proper mathematical treatment by the machine learning algorithm. #Cleaning Missing Data For simplicity, all categorical missing values were cleaned with the mode and all numeric features were cleaned using the median. To further improve a model's performance, custom cleaning functions should be tried and implemented on each individual feature rather than a blanket transformation of all columns. #Statistical Feature Selection Not every feature in its current form is expected to contain predictive value to the model, and may mislead or add noise to the model. To filter these out we will perform a [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) to test all features against the response class (sales price) as a quick measure of their predictive strength, only picking the top X strongest features from this method, the remaining features will be left behind. This number can be tuned for further model performance increases. #Algorithm A regularized gradient descent variant of the linear regression model will be used to reduce over-fitting of the model. * To ensure a stable convergence of weight and biases, all features except the response class must be normalized to be placed into the same range. #Evaluation The method of cross validation will be used to evaluate the predictive performance of the model as well as that performance's stability in regard to new data. Cross validation will build ten different models on the same algorithm but with different and nonrepeating subsets of the same dataset. The evaluation metrics on each of the ten models will be averaged and a standard deviation will infer to the stability of the average performance. #Parameter Tuning This experiment will build a model which minimizes mean RMSE of the cross validation results with the lowest variance possible (but also consider bias-variance trade-offs). 1. The first regression model was built using default parameters and produced a very inaccurate model ($124,942 mean RMSE) and was very unstable (11,699 standard deviation). 2. The high bias and high variance of the previous model suggest the model is over-fitting to the outliers and is under-fitting the general population. The L2 regularization weight will be decreased to lower the penalty of higher coefficients. After lowering the L2 regularization weight, the model is more accurate with an average cross validation RMSE of $42,366. 3. The previous model is still quite unstable with a standard deviation of $8,121. Since this is a dataset with a small number of observations (1460), it may be better to increase the number of training epochs so that the algorithm has more passes to reach convergence. This will increase training times but also increase stability.The third linear model had the number of training epochs increased and saw a better mean cross validation RMSE of $36,684 and a much more stable standard deviation of $3,849. 4. The final model had a slight increase in the learning rate which improved both mean cross validation RMSE and the standard deviation. #Deployment The algorithm parameters that yeilded the best results will be the one that is shipped. The best algorithm (the last one) will be retrained using 100% of the data since cross validation leaves 10% out each time for validation. #Further Improve this Model Feature engineering was entirely left out of this experiment. Try engineering more features from the existing dataset to see if the model will improve. Some columns that were originally dropped may become useful when combined with other features. For example try bucketing the years in which the house was built by decade. Clustering the data may also yeild some hidden insights. # Related 1. [Detailed Tutorial: Building and deploying a classification model in Azure Machine Learning Studio](http://datasciencedojo.com/dojo/building-and-deploying-a-classification-model-in-azure-ml/) 2. [Demo: Interact with the user interface of a model deployed as service](http://demos.datasciencedojo.com/demo/titanic/) 3. [Tutorial: Creating a random forest regression model in R and using it for scoring](https://gallery.azureml.net/Details/b729c21014a34955b20fa94dc13390e5) 4. [Tutorial: Obtaining feature importance using variable importance plots](https://gallery.azureml.net/Details/964dfc4151e24511aa5f78159cab0485) : http://cdn.datasciencedojo.com/channel9_real_estate_housing_price.PNG