San Francisco Crime Classification

August 16, 2016
This experiment takes San Francisco crime data and evaluates which crimes are likeliest to occur at which places and times.
## Introduction ## In our minds today, San Francisco bustles with technology and innovation, not suffers from criminal activity that led to the creation of Alcatraz. With all of this growth, however, there comes a price: crime. This project uses a combination of Microsoft’s Azure Machine Learning Studio and Python’s Pandas data analysis library in order to predict the category of crime that could occur in San Francisco, given multiple parameters including location and time. This will allow for residents or visitors of a certain location to prepare themselves for the eventuality of a certain crime, as well as help police officers strengthen their protection of a certain area against this criminal activity. ## Data ## - The data was taken from San Francisco’s Open Data platform, on which the city publicly releases information from their various departments. - The police department has kept track of crime since January 2003; as of August 15, 2016, there are 1,933,808 rows. - The information listed includes: - Dates: the precise date and time at which the crime took place - Category: the type of crime that took place - Descript: description the police gave the crime - DayOfWeek: the day in the week during which the crime occurred - PdDistrict: the police district under which the crime took place - Resolution: the action taken after the crime - Address: a block or intersection near which the crime took place - X & Y: the coordinates at which the crime was committed [Link to data][1] ##Data Manipulation## San Francisco recognizes 39 different categories. Many of these are differentiated on a technical basis; in this experiment, many similar categories were combined to create 6 instead of 39. In addition, there were 24,931 unique values in the “Address” column. A counting transform was used on this column with the “Category” column as the label. Finally, the “Resolution” and “Descript” columns were taken out because residents would not have access to them until after the crime takes place. They cannot enter it into the algorithm, and it must therefore be trained without them. ## Results ## The results had an overall accuracy of 38.99%. Perhaps with more data manipulation and experimentation, this could be increased to higher levels and provide significant aid to police and residents in thwarting crime. [1]: