Binary Classification: Flight delay prediction

By for September 2, 2014
In this experiment, we predict whether scheduled passenger flight is delayed or not using a Binary-classifier.
#Binary Classification: Flight delay prediction In this experiment, we use historical on-time performance and weather data to predict whether the arrival of a scheduled passenger flight will be delayed by more than 15 minutes. We approach this problem as a classification problem, predicting two classes -- whether the flight will be delayed, or whether it will be on time. Broadly speaking, in machine learning and statistics, classification is the task of identifying the class or category to which a new observation belongs, on the basis of a training set of data containing observations with known categories. Classification is generally a supervised learning problem. Since this is a binary classification task, there are only two classes. To solve this categorization problem, we will build an experiment using Azure ML Studio. In the experiment, we train a model using a large number of examples from historic flight data, along with an outcome measure that indicates the appropriate category or class for each example. The two classes are labeled 1 if a flight was delayed, and labeled 0 if the flight was on time. There are five basic steps in building an experiment in Azure ML Studio: Create a Model - [Step 1: Get Data](#anchor-1) - [Step 2: Pre-process Data](#anchor-2) - [Step 3: Define Features](#anchor-3) Train the Model - [Step 4: Choose and apply a learning algorithm](#anchor-4) Score and Test the Model - [Step 5: Predict over new data](#anchor-5) ------------------------------------------ <a name="anchor-1"></a> ##Data **Passenger flight on-time performance data taken from TranStats data collection from [U.S. Department of Transportation](** The dataset contains flight delay data for the period April-October 2013. Before uploading the data to Azure ML Studio, we pre-processed it as follows: - Filtered to include only the 70 busiest airports in the continental United States. - For canceled flights, relabeled as delayed by more than 15 mins. - Filtered out diverted flights. From the dataset, we selected the following 14 columns: _Year_, _Month_, _DayofMonth_, _DayOfWeek_, _Carrier_, _OriginAirportID_, _DestAirportID_, _CRSDepTime_, _DepDelay_, _DepDel15_, _CRSArrTime_, _ArrDelay_, _ArrDel15_, and _Cancelled_. These columns contain the following information: _Carrier_ - Code assigned by IATA and commonly used to identify a carrier. _OriginAirportID_ - An identification number assigned by US DOT to identify a unique airport (the flight's origin). _DestAirportID_ - An identification number assigned by US DOT to identify a unique airport (the flight's destination). _CRSDepTime_ - The CRS departure time in local time (hhmm) _DepDelay_ - Difference in minutes between the scheduled and actual departure times. Early departures show negative numbers. _DepDel15_ - A Boolean value indicating whether the departure was delayed by 15 minutes or more (1=Departure was delayed) _CRSArrTime_ - CRS arrival time in local time(hhmm) _ArrDelay_ - Difference in minutes between the scheduled and actual arrival times. Early arrivals show negative numbers. _ArrDel15_ - A Boolean value indicating whether the arrival was delayed by 15 minutes or more (1=Arrival was delayed) _Cancelled_ - A Boolean value indicating whether the arrivalflight was cancelled (1=Flight was cancelled) We also used a set of weather observations: **[Hourly land-based weather observations from NOAA](** The weather data represents observations from airport weather stations, covering the same time period of April-October 2013. Before uploading to Azure ML Studio, we processed the data as follows: - Weather station IDs were mapped to corresponding airport IDs. - Weather stations not associated with the 70 busiest airports were filtered out. - The _Date_ column was split into separate columns: _Year_, _Month_ and _Day_. From the weather data, the following 26 columns were selected: _AirportID_, _Year_, _Month_, _Day_, _Time_, _TimeZone_, _SkyCondition_, _Visibility_, _WeatherType_, _DryBulbFarenheit_, _DryBulbCelsius_, _WetBulbFarenheit_, _WetBulbCelsius_, _DewPointFarenheit_, _DewPointCelsius_, _RelativeHumidity_, _WindSpeed_, _WindDirection_, _ValueForWindCharacter_, _StationPressure_, _PressureTendency_, _PressureChange_, _SeaLevelPressure_, _RecordType_, _HourlyPrecip_, _Altimeter_ **Airport Codes Dataset** The final dataset used in the experiment contains one row for each U.S. airport, including the airport ID number, airport name, the city, and state (columns: *airport_id*, _city_, _state_, _name_). ## <a name="anchor-2"></a> Pre-process data A dataset usually requires some pre-processing before it can be analyzed. ![screenshot_of_experiment]( **Flight Data Preprocessing** First, we used the [**Project Columns**]( module to exclude from the dataset columns that are possible target leakers: _DepDelay_, _DepDel15_, _ArrDelay_, _Cancelled_, _Year_. ![screenshot_of_experiment]( The columns _Carrier_, _OriginAirportID_, and _DestAirportID_ represent categorical attributes. However, because they are integers, they are initially parsed as continuous numbers; therefore, we used the [**Metadata Editor**]( module to convert them to categorical. ![screenshot_of_experiment]( We need to join the flight records with the hourly weather records, using the scheduled departure time as one of the join keys. To do this, the _CSRDepTime_ column must be rounded down to the nearest hour using two successive instances of the [**Apply Math Operation**]( module. ![screenshot_of_experiment]( **Weather Data Preprocessing** Columns that have a large proportion of missing values are excluded using the [**Project Columns**]( module. These include all string-valued columns: _ValueForWindCharacter_, _WetBulbFarenheit_, _WetBulbCelsius_, _PressureTendency_, _PressureChange_, _SeaLevelPressure_, and _StationPressure_. ![screenshot_of_experiment]( The [**Clean Missing Data**]( module is then applied to the remaining columns to remove rows with missing data. The time of the weather observation is rounded up to the nearest full hour, so that the column can be equi-joined with the scheduled flight departure time. Note that the scheduled flight time and the weather observation times are rounded in opposite directions. This is done to ensure that the model uses only weather observations that happened in the past, relative to flight time. Also note that the weather data is reported in local time, but the origin and destination may be in different time zones. Therefore, an adjustment to time zone difference must be made by subtracting the time zone columns from the scheduled departure time (_CRSDepTime_) and weather observation time (_Time_). These operations are done using the [**Execute R Script**]( module. <!-- ![screenshot_of_experiment]( ![screenshot_of_experiment]( --> The resulting columns are _Year_, _AdjustedMonth_, _AdjustedDay_, _AirportID_, _AdjustedHour_, _Timezone_, _Visibility_, _DryBulbFarenheit_, _DryBulbCelsius_, _DewPointFarenheit_, _DewPointCelsius_, _RelativeHumidity_, _WindSpeed_, _Altimeter_. **Joining Datasets** Flight records are joined with weather data at origin of the flight (_OriginAirportID_) by using the [**Join**]( module. ![screenshot_of_experiment]( <!-- ![screenshot_of_experiment]( ![screenshot_of_experiment]( --> Flight records are joined with weather data using the destination of the flight (_DestAirportID_). ![screenshot_of_experiment]( **Preparing Training and Validation Samples** The training and validation samples are created by using the [**Split**]( module to divide the data into April-September records for training, and October records for validation. ![screenshot_of_experiment]( Year and month columns are removed from the training dataset using the [**Project Columns**]( module. The training data is then separated into equal-height bins using the [**Quantize Data**]( module, and the same binning method was applied to the validation data. ![screenshot_of_experiment]( The training data is split once more, into a training dataset and an optional validation dataset. ![screenshot_of_experiment]( ## <a name="anchor-3"></a> Define Features In machine learning, *features* are individual measurable properties of something you’re interested in. Finding a good set of features for creating a predictive model requires experimentation and knowledge about the problem at hand. Some features are better for predicting the target than others. Also, some features have a strong correlation with other features, so they will not add much new information to the model and can be removed. In order to build a model, we can use all the features available, or we can select a subset of the features in the dataset. Typically you can try selecting different features, and then running the experiment again, to see if you get better results. The various features are the weather conditions at the arrival and destination airports, departure and arrival times, the airline carrier, the day of month, and the day of the week. [Step 4: Choose and Apply a Learning Algorithm]:#step-4-choose-and-apply-a-learning-algorithm ## <a name="anchor-4"></a> Choose and apply a learning algorithm. **Model Training and Validation** We created a model using the [**Two-Class Boosted Decision Tree**]( module and trained it using the training dataset. To determine the optimal parameters, we connected the output port of **Two-Class Boosted Decision Tree** to the [**Sweep Parameters**]( module. ![screenshot_of_experiment]( The model is optimized for the best AUC using 10-fold random parameter sweep. ![screenshot_of_experiment]( For comparison, we created a model using the [**Two-Class Logistic Regression**]( module, and optimized it in the same manner. The result of the experiment is a trained classification model that can be used to score new samples to make predictions. We used the validation set to generate scores from the trained models, and then used the [**Evaluate Model**]( module to analyze and compare the quality of the models. <a name="anchor-5"></a> ##Predict Using New Data Now that we've trained the model, we can use it to score the other part of our data (the last month (October) records that were set aside for validation) and to see how well our model predicts and classifies new data. Add the [**Score Model**]( module to the experiment canvas, and connect the left input port to the output of the **Train Model** module. Connect the right input port to the validation data (right port) of the [**Split**]( module. After you run the experiment, you can view the output from the **Score Model** module by clicking the output port and selecting **Visualize**. The output includes the scored labels and the probabilities for the labels. Finally, to test the quality of the results, add the [**Evaluate Model**]( module to the experiment canvas, and connect the left input port to the output of the **Score Model** module. Note that there are two input ports for **Evaluate Model**, because the module can be used to compare two models. In this experiment, we compare the performance of the two different algorithms: the one created using **Two-Class Boosted Decision Tree** and the one created using **Two-Class Logistic Regression**. Run the experiment and view the output of the **Evaluate Model** module, by clicking the output port and selecting **Visualize**. ##Results The boosted decision tree model has AUC of 0.697 on the validation set, which is slightly better than the logistic regression model, with AUC of 0.675. ![screenshot_of_experiment]( **Post-Processing** To make the results easier to analyze, we used the _airportID_ field to join the dataset that contains the airport names and locations.