TX Elementary School Rank Predictive Analytics using B.L.R.-Prateeksha Hemraj

November 2, 2019
Predicting Ranks based on the available features and trying to identify what influences the ranking of schools in Texas.
This model is built to estimate major influencers and predict ranks of elementary schools in Texas. We tried to find correlations among different features such as ethnicity, Magnet feature, title 1, Charter, location, number of students and teachers, Student: Teacher, availability of free/discounted lunch facility, Range of grades being provided. The Dataset consists of 28 variables and have around 4308 individual entries. We used Tableau to visualize and get a better understanding of the dataset. Some visualizations suggested that at elementary education level, Magnet school didn’t influence their ranking. These visualizations also gave us some interesting facts about the schools. Eg: when we compared the data of 4 popular cities (Austin, Dallas, Houston, San Antonio), Houston on an average enrolled the maximum number of students in schools at an average of 729. It also recorded to have the highest Student: Teacher Ratio. The overall ranks of schools in Austin was the highest among the 4 cities. *Data Selection and Cleaning:* Dataset was imported to Azure ML Studio for further experiments. To Minimize computational process, we carefully identified the Predictor Variable(Rank), Noise variables(School, School URL, District URL, City URL, Phone), Feeder Variables(Average Standard Score), Independent variables(Title1, %African American, %American Indian, %Asian, %Hispanic, %White, Low grade, High grade, %Free/Disc Lunch, County, District, City, Student: Teacher, #Students, #Teachers, is Magnet, Is Charter) and Redundant variables(Is virtual). We selected the predictor variable and independent variables for further calculation. The Dataset was cleaned and prepared for calculation. The cleaned data had no missing values. *Chi-Square and Correlation:* We introduced Filter Based Feature Selection Module to filter out redundant features. Since we had a combination of categorical and continuous variables, we opted for Chi-Square scoring method. The top 5 Variables were *District, City, Percent free/Disc Lunch, County and Is Title 1.* We also conducted Linear correlation and found Strong Direct relation in *Number of fulltime teachers v/s Number of students*. And a strong Inverse relation in *%white v/s %Free/Disc Lunch* and *%white v/s Hispanic.* We also observed a moderate direct relation between *%free lunch v/s Rank, %Hispanic v/s %Free/Disc Lunch* and moderate inverse relation between *%2 or more Races v/s %Hispanic.* *Bayesian Linear Regression:* We used Bayesian Linear Regression for predictive Data modeling. We got Mean Absolute Error as 595.43 and Coef. of Error as 0.614736. *Conclusion:* The Bayesian Linear Regression model developed helps us in predicting the rank given the other variables. M.A.E value and Coef. of Error states that we are 95% sure of getting the rank with an error of ±595.43.