Predictive Maintenance: Step 2C of 3, train and evaluation multi-class classification models

By for April 27, 2015
This experiment demonstrates the steps in building a predictive maintenance solution.
#Predictive Maintenance Template # Predictive maintenance encompasses a variety of topics, including but not limited to: failure prediction, failure diagnosis (root cause analysis), failure detection, failure type classification, and recommendation of mitigation or maintenance actions after failure. As part of the Azure Machine Learning offering, Microsoft provides a template that helps data scientists easily build and deploy a predictive maintenance solution. **This predictive maintenance template focuses on the techniques used to predict when an in-service machine will fail, so that maintenance can be planned in advance.** The template includes a collection of pre-configured machine learning modules, as well as custom R scripts in the *Execute R Script* module, to enable an end-to-end solution from data processing to deploying of the machine learning model. Three modeling solutions are provided in this template to accomplish the following tasks. - **Regression:** Predict the Remaining Useful Life (RUL), or Time to Failure (TTF). - **Binary classification:** Predict if an asset will fail within certain time frame (e.g. days). - **Multi-class classification:** Predict if an asset will fail in different time windows: E.g., fails in window [1, *w0*] days; fails in the window [*w0*+1,*w1*] days; not fail within *w1* days The time units mentioned above can be replaced by working hours, cycles, mileage, transactions, etc. based on the actual scenario. This template uses the example of simulated aircraft engine run-to-failure events to demonstrate the predictive maintenance modeling process. The implicit assumption of modeling data as done below is that the asset of interest has a progressing degradation pattern, which is reflected in the asset's sensor measurements. By examining the asset's sensor values over time, the machine learning algorithm can learn the relationship between the sensor values and changes in sensor values to the historical failures in order to predict failures in the future. We suggest examining the data format and going through all three steps of the template before replacing the data with your own. The template is divided into 3 separate steps with 7 experiments in total, where the first step has 1 experiment, and the other two steps each contains 3 experiments each addressing one of the modeling solutions. ![work flow](https://az712634.vo.msecnd.net/samplesimg/v1/T4/workflow.png) - [Step 1: Data preparation and feature engineering] - [Step 2: Train and evaluate model] - [Step 3: Deploy as web service] [Step 1: Data preparation and feature engineering]:#step1 [Step 2: Train and evaluate model]:#step2 [Step 3: Deploy as web service]:#step3 <!-- [Step 1: Data preparation and feature engineering](https://gallery.azureml-int.net/Details/1a449009f32b4c4786f15f10d622d216) [Step 2A: Regression Model](http://gallery.azureml-int.net/Details/6f90d2e7405c41c0afbac1fb14088196) [Step 2B: Binary Classification model](http://gallery.azureml-int.net/Details/175c78673e7343f79a0553ea1313b210) [Step 2C: Multiclass Classification model](http://gallery.azureml-int.net/Details/22ca39a0f9ec45158a64ca3d7d7d288d) [Step 3A: Publish a web service for regression model](http://gallery.azureml-int.net/Details/1f87403429b7403da34c7b57db9b1245) [Step 3B: Publish a web service for binary classification model](http://gallery.azureml-int.net/Details/5c0f529de64b4cada43efe26ece3b2c9) [Step 3C: Publish a web service for multi-class classification model](http://gallery.azureml-int.net/Details/bd90837a552741b198478afd9be229db) --> ##<a id="step1"></a> Step 1: Data preparation and feature engineering ## The name of the experiment in this step is: "Predictive Maintenance: Step 1 of 3, data preparation and feature engineering". The following figure shows the experiment for this step. ![step 1](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step1_1.png) ### Source data ### The template takes three datasets as inputs. - Training data: It is the aircraft engine run-to-failure data. - Testing data: It is the aircraft engine operating data without failure events recorded. - Ground truth data: It contains the information of true remaining cycles for each engine in the testing data. The data schema for the training and testing data is shown in the following table. ![training, testing data schema](https://az712634.vo.msecnd.net/samplesimg/v1/T4/dataSchema1.png) The input data consists of "train\_FD001.txt", "test\_FD001.txt", and "RUL\_FD001.txt" in the original data source [1]. The training data ("train\_FD001.txt") consists of multiple multivariate time series with "cycle" as the time unit, together with 21 sensor readings for each cycle. Each time series can be assumed as being generated from a different engine of the same type. Each engine is assumed to start with different degrees of initial wear and manufacturing variation, and this information is unknown to the user. In this simulated data, the engine is assumed to be operating normally at the start of each time series. It starts to degrade at some point during the series of the operating cycles. The degradation progresses and grows in magnitude. When a predefined threshold is reached, then the engine is considered unsafe for further operation. **In other words, the last cycle in each time series can be considered as the failure point of the corresponding engine.** Taking the sample training data shown in the following table as an example, the engine with id=1 fails at cycle 192, and engine with id=2 fails at cycle 287. The testing data ("test\_FD001.txt") has the same data schema as the training data. The only difference is that the data does not indicate when the failure occurs (in other words, the last time period does NOT represent the failure point). Taking the sample testing data shown in the following table as an example, the engine with id=1 runs from cycle 1 through cycle 31. It is not shown how many more cycles this engine can last before it fails. The ground truth data ("RUL\_FD001.txt") provides the number of remaining working cycles for the engines in the testing data. Taking the sample ground truth data shown in the following table as an example, the engine with id=1 in the testing data can run another 112 cycles before it fails. ![sample training/testing data](https://az712634.vo.msecnd.net/samplesimg/v1/T4/sampleData.png) ### Data Labeling ### Based on the input data description we have walked through in the previous section, an intuitive predictive maintenance question to ask is "Given these aircraft engine operation and failure events history, can we predict when an in-service engine will fail?" We re-formulate this question into three closely relevant questions and answer them using three different types of machine learning models. While we go through all three questions, likely only one would be chosen in practice depending on the business requirements. - Regression models: How many more cycles an in-service engine will last before it fails? - Binary classification: Is this engine going to fail within *w1* cycles? - Multi-class classification: Is this engine going to fail within the window [1, *w0*] cycles or to fail within the window [*w0*+1, *w1*] cycles, or it will not fail within *w1* cycles? Taking the example of engine with id=1, the following figure shows how the training data is labeled, where "RUL", label1", and "label2" are labels for regression, binary classification, and multi-class classification models respectively. Here *w0* and *w1* are predefined use case related parameters which are used to label the training data. The customer needs to decide how far ahead of time the alert of failure should trigger before the actual failure event. ![sample training/testing data](https://az712634.vo.msecnd.net/samplesimg/v1/T4/labeling_1.png) The labels for the testing data is generated based on the ground truth data, i.e. the right-hand side branch of the step 1 experiment. It follows the same procedure as the training data labeling. ### Feature engineering ### Another important task in step 1 is to generate training and testing features. In the above figure shown in step 1 experiment, the modules in the left-hand branch and the middle branch of the experiment canvas shows the feature engineering process on the training and testing datasets respectively. The features that are included or created in the training data can be grouped into two categories. More types of features can be created based on different use cases and data. - Selected raw features - Aggregate features **Selected raw features** The raw features are those that are included in the original input data. In order to decide which raw features should be included in the training data, both the detailed data field description and domain knowledge is helpful. In this template, all the sensor measurements (s1-s21) are included in the training data. Other raw features get used are: cycle, setting1-setting3. **Aggregate features** These features summarize the historical activity of each asset. In the template, two types of aggregate features are created for each of the 21 sensors. The description of these features are shown below. - a1-a21: the moving average of sensor values in the most *w* recent cycles - sd1-sd21: the standard deviation of sensor values in the most *w* recent cycles The time window parameter *w* is set *w=5* in the template, but it can be adjusted for different use cases. These aggregate features serve as examples of a large set of potential features that could be created. Other features such as change in the sensor values within a window, change from the initial value, velocity of change, number over a defined threshold, etc. could be included as well. Also, depending on the size of the data, it may prove most useful to aggregate all of the data itself rather than including any raw data. In other words, only include aggregated features for the time-varying features; for example, it may be best to aggergate sensor readings that occur every second - if failures occur only once every couple of months for example - to day or week values before modeling (to get a more balanced dataset between rows that represent failure and rows that represent non-failure). It is best to try different levels of aggregation and evaluate model performance in order to determine the optimal aggregation level. ### Prepare the testing data ### In this section we will summarize the process of preparing the testing data and explain relevant issues that were not previously discussed. The testing data is prepared mainly from two datasets, where the data from the second **Reader** module ("test\_FD001.txt") is used to generate aggregate features, and the data from the third **Reader** module ("RUL\_FD001.txt") is used to generate labels (RUL, label1, and label2) for three learning models. Similar to the training data, the time series data in the testing data helps to generate aggregate features during the feature engineering process. Instead of having all time series data in the testing data, we only keep the the record with the maximum cycle for each engine id. In other words, one record is retained for each engine. Eventually, the testing data contains 100 records, which matches the RUL in the ground truth data. ###Prepare for Step 2 and Step 3 Experiments### Because the output datasets of Step 1 will serve as the input datasets of Step 2 and Step 3, we need to save them during Step 1. There are two options to save the output datasets if the **Convert to CSV** module is used as shown in the following figure. First, by clicking the "Save as Dataset" option, the corresponding dataset will be saved in the workspace and appear in "Saved Datasets" tab on the left-hand side panel in the Azure ML studio. It can then be applied to another experiment by being dragged over to the experiment canvas. Second, by clicking the "Download" option, the output dataset can be downloaded to local computer, and then can be uploaded to a Azure blob storage. **Reader** module can then be used to import this dataset. Alternatively, if one only wants to use the "Save as Dataset" option, the dataset can also be saved by right-clicking the output port of the previous module (removing need to have **Convert to CSV** module). ![step 1 save datasets](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step1_saveDataSets.png) In order to apply any transformations which was previously applied in the training experiment into the scoring experiment, this transformation has to be saved in the training experiment. In this case, we performed data normalization to the training data using *Normalize Data* module. In order to apply the same data normalization on the testing data, we use the *Apply Transformation* module to the testing data. If the testing data need to be prepared in a separate experiment, we have to right click the second output port of the *Normalize Data* module and save this transformation by selecting "Save as Transformation" option in the menu. The following figure shows how a transformation is saved during Step 1, and the saved transformation will be shown in the "Transforms" tab on the left-hand side panel in the Azure ML studio. It can then be applied to the scoring experiments. ![step 1 save Transform](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step1_saveTransform.png) ##<a id="step2"></a> Step 2: Train and evaluate model ## This step consists of three parallel steps, each of which has a separate experiment. The names of these experiments are shown below. - Step 2A: Predictive Maintenance: Step 2A of 3, train and evaluate regression models - Step 2B: Predictive Maintenance: Step 2B of 3, train and evaluate binary classification models - Step 2C: Predictive Maintenance: Step 2C of 3, train and evaluation multi-class classification models There are several common steps when training three different types of models. We illustrate these similar steps in detail as follows. First of all, they share the same process and data source when importing the training data and testing data from the *Reader* modules (the output data from Step 1 experiment). Secondly, The *Project Columns* module excludes irreverent label columns for each model. For example, the Step 2A *Project Columns* module excludes columns "label1" and "label2" and only keep the label "RUL" in the training data to prepare train regression models on "RUL" column. Thirdly, the *Metadata Editor* is used to set the corresponding label column as "Labels". Finally, the module *Filter Based Feature Selection* is applied to select and include top 35 correlated features in the training data based on "Pearson Correlation" measure. When training machine learning models such as "Two-Class Boosted Decision Tree" or "Decision Forest Regression", initial model parameters are set as the default model parameters. In this template, we use the *Train Model* module to train the model with the default model parameters. In practice, these parameters can be tuned towards a certain performance metric. In Azure ML, the module *Sweep Parameters* is supplied to tune the model parameters. [Here](http://gallery.azureml.net/Details/59d98181062e47a3a14bcf9183c91acf) is a sample experiment that demonstrates how to use it. ###Step 2A### In this step, we train and evaluation four regression models: Decision Forest Regression, Boosted Decision Tree Regression, Poisson Regression, and Neural Network Regression. ![step 2A](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2A.png) The following figure shows the comparison results of four models, of which "Decision Forest Regression" and "Boosted Decision Tree Regression" perform best in terms of two major metrics: "Mean Absolute Error" and "Root Mean Squared Error". ![step 2A results](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2A_results.png) ###Step 2B### In this step, we illustrate the binary classification modeling from two aspects. First, as shown in box 1 of following figure, we train and evaluate four binary classification models: Two-Class Logistic Regression, Two-Class Boosted Decision Tree, Two-Class Decision Forest, and Two-Class Neural Network. Second, we show in box 2 how to balance the class distribution by down sampling the records with majority class. ![step 2B](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2B.png) The following figure compares the results from the four models to determine the best model. The algorithm "Two-Class Neural Network" performs best in terms of four metrics: "Accuracy", "Precision, "Recall", and "F-Score". ![step 2B results - comparison of four models](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2B_results.png) Imbalanced class distribution is a common issue in many classification tasks. For example, when classifying whether a patient has cancer or not, it is not unusual that the training data contains highly imbalanced positive/negative examples which however reflects the true class distribution in a general population. In predictive maintenance applications, the same issue exists: the imbalance of failure events to normal operation events. This issue is due to following two major reasons. First, the failure events usually rarely occurs compared to normal operation state for an in-service asset. Second, there are too few failure events. The business cannot afford to let the asset run-to-failure, as it is at the cost of equipment damage and equipment down time. There are two general sampling methods to help balance the class distribution: sampling down the majority class, or sampling up the minority class. The first method is implemented in the Step 2B expriment (box 2); while not implemented here, the *SMOTE* module in Azure ML is one implementation of the latter method. The *SMOTE* module is created based on algorithm "SMOTE: synthetic minority over-sampling technique" [2]. It is used to increase the size of the minority examples in a data set by synthesizing new examples with minority class. *SMOTE* module has two parameters: "SMOTE percentage" and "Number of nearest neighbors". The parameter "SMOTE percentage" should be in multiples of hundreds (100,200,300,400,…). This is fraction of new minority examples that gets added. For examples, we double our minority class by setting the value to 100, we triple the size of minority class by setting the value to 200, etc. The parameter "Number of nearest neighbors" is used to generate new examples from minority class. Each generated example is an average of the original example and its nearest neighbors from the same class. The figure below shows the results of the "Two-Class Boosted Decision Tree" when down sampling the examples with negative class to be the same size as examples with positive class (box 2 of Step 2B experiment shown above). The results is competitive with that when using the complete set of training data. The metric Recall=0.84, which outperforms all other four models. ![step 2B results - down sampling](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2B_results_downSampling.png) ###Step 2C### In this step, we train and evaluate two multiclass classification models: Multiclass Logistic Regression and Multiclass Neural Network, and two ordinal regression models on Two-Class Logistic Regression and Two-Class Neural Network. Ordinal regression is a type of regression analysis used to predict an ordinal variable. An ordinal variable is the variable whose value can be ranked or ordered, but the real distance between these values is unknown. In the multi-class classification problem formulated here, the class attribute "label2" is an ordinal variable, as its value reflects the severity of the failure progress. Therefore, we consider it is appropriate to use ordinal regression, which takes into account of this relative ordering information, instead of only treating the class attribute as an categorical variable. The experiment snapshot is shown in the following figure. ![sample training/testing data](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2C.png) The following figures compares the results of "Multiclass Logistic Regression" and "Multiclass Neural Network", where the latter performs better in terms of six metrics: "Overall accuracy", "Average accuracy", "Micro-averaged precision", "Macro-averaged precision", "Micro-averaged recall", and "Macro-averaged recall". ![step 2C results1](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step2C_results1.png) When evaluating the ordinal regression models, the resulting metrics are similar to the results from regression models. The following figure shows the comparison between the two models trained in the previous figure. ![multi-class classification using Ordinal Regression module](https://az712634.vo.msecnd.net/samplesimg/v1/T4/other_ordinalRegressionResult.png) We have chosen the optimal algorithm here by selecting a variety of options and evaluting the performance. For more details on choosing between different algorithms, see here: http://azure.microsoft.com/en-us/documentation/articles/machine-learning-algorithm-choice/ ###Prepare for Step 3 Experiments### In order to apply any trained models into the scoring experiment, these trained models have to be saved in the training experiment. The following figures show how trained models are saved during Step 2, and the saved models are shown in the "Trained Models" tab on the left-hand panel of the Azure ML studio. They can then be used into the scoring experiments. ![step 2 save trained model](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step3saveTrainedModel.png) ##<a id="step3"></a> Step 3: Deploy as a web service ## The trained model is now ready for deployment as a web service. To deploy a model, we need to connect all the data processing, feature engineering, scoring modules, saved transformations, and saved trained models to form one scoring experiment. Once a machine learning model is deployed as a web service, it can be consumed by a range of options, such as a mobile application, web site, Power BI dashboard, or even a Excel document. For details on consumtion, see http://azure.microsoft.com/en-us/documentation/articles/machine-learning-consume-web-services/. Because three different types were trained, we need to deploy three web services. These parallel steps each owns a separate experiment. The names of these experiments are shown below. - Step 3A: Predictive Maintenance: Step 3A of 3, deploy web service with a regression model - Step 3B: Predictive Maintenance: Step 3B of 3, deploy web service with a binary classification model - Step 3C: Predictive Maintenance: Step 3C of 3, deploy web service with a multi-class classification model These three scoring experiments share many common steps. The *Reader* module reads in the testing data ("test\_FD001.txt") similar to Step 1. The same feature engineering process is applied on it, and the saved data normalization is applied to it before scoring the model. The difference among these experiments is that different trained models are applied. The trained regression model (PM\_trainedModel\_PoissonRegreession), trained binary classification model (PM\_trainedModel\_LogisticRegression), and trained multi-class classification model (PM\_trainedModel\_MulticlassLogisticRegression) are applied in Step 3A, Step 3B, and Step 3C respectively. The web service input is set to be after the pre-processing steps where the data is aggregated to a single row to score for each unit. When the scoring experiment is published a web service, it thus expects the user to input the aggregated data rather than the raw features. This makes it easier to use RRS to score a given unit, as it is just a single row that is used for predictions. Alternatively, one could set the web service input to be before the pre-processing steps, and use batch scoring or multiple rows to predict for a given unit. ###Step 3A### ![step 3A](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step3A.png) ###Step 3B### ![step 3B](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step3B.png) ###Step 3C### ![step 3C](https://az712634.vo.msecnd.net/samplesimg/v1/T4/step3C.png) ##Summary ## Microsoft Azure Machine Learning provides a cloud-based platform for data scientists to easily build and deploy end-to-end machine learning solutions from the raw data input to consumable web service end point. This predictive maintenance template used the scenario of aircraft engine operation with failure conditions to illustrate the process of predicting future failure events. This template can be adapted to other predictive maintenance scenarios where the data representative of the asset is available in both operating conditions and failure conditions, and the failure probability shows an age related pattern. [1] A. Saxena and K. Goebel (2008). "Turbofan Engine Degradation Simulation Data Set", NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/tech/dash/pcoe/prognostic-data-repository/), NASA Ames Research Center, Moffett Field, CA [2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16(1), 321-357.