Binary Classification: Prediction of student performance

By for September 2, 2014
Predict if a student will solve a given problem from the first attempt
#Binary classification: Prediction of student performance In this experiment we show how to do feature engineering over the logs of user events in online system. ## Dataset and problem description In this experiment our dataset is "Algebra 2008-2009" training set from KDD Cup 2010. This dataset can be downloaded from KDD Cup 2010 [website]( The dataset contains log files of student tutoring system. The supplied features include problem id and its brief description, student id, timestamp and how many attempts student did before solving the problem in the right way. The learning problem is to predict if a student will solve a given problem from the first attempt. Following KDD Cup rules, we measure the accuracy of learning algorithms using Root Mean Square Error (RMSE). The original dataset has 8.9M records. To speed up the experiment we downsampled the dataset to the first 100K rows. The dataset has 23 columns of various types: numeric, categorical and timestamp. The columns are tab-separated. ## Initial data preprocessing and upload to Azure The dataset has non-ASCII characters that cannot be handled by **Execute R Script** module. Before using dataset in AzureML we removed non-ASCII characters using the following Powershell commands: gc algebra_2008_2009_train.txt –head 100000 | sc algebra_train_small.txt sc tmp.txt -Encoding ASCII -Value (gc algebra_train_small.txt) cat tmp.txt | %{$_ -replace "\?\?sqrt","+sqrt"} | sc student_performance.txt We upload the resulting file student_perfomance.txt into [Azure blob storage]( using the following Powershell commands: Add-AzureAccount $key = Get-AzureStorageKey -StorageAccountName <your storage account name> $ctxt = New-AzureStorageContext -StorageAccountName $key.StorageAccountName -StorageAccountKey $key.Primary Set-AzureStorageBlobContent –Container <container name in your storage account> -File "student_performance.txt" –Context $ctxt For the purpose of this sample experiment we uploaded student performance.txt file to 'datasets' public container of 'azuremlsampleexperiments' storage account. ##Importing dataset to studio and further preprocessing We read the dataset from the blob storage using **Reader** module: ![Reader][reader_screen] Notice that since our dataset is placed in the public container 'datasets' , we used PublicOrSAS authentication. To read datasets from private containers you need to choose Account option in the authentication field of **Reader** module. ![Reader][data_transformation] In the next steps, shown above, we do a number of transformations to get dataset into format that will fit our learning algorithms. Initially, using **Project Columns** module, we remove a number of columns that will not be used in experiment. We use **Clean Missing Data** to replace all missing values with 0. At the next step we split Unit Name, Section Name column into two columns, Unit Name and Section Name. This is done using the following R code in **Execute R Script** module: dataset <- maml.mapInputPort(1) b <- data.frame(,strsplit(as.vector(dataset[,3]),","))) names(b) <- c("Unit Name","Section Name") data.set <- cbind(dataset[,1:2],b,dataset[,4:15]) maml.mapOutputPort("data.set") ##Feature engineering and modeling After preprocessing the data we split experiment into 4 branches. In each branch we test a different set of features. These feature sets were previously used by [1]. ###Branch 1 - Features based on a single record In the first branch, depicted below, our features are the values of Student ID, Unit Name, Section Name, Problem Name, Problem View and Opportunity columns, as well as textual description of the problem in KC(SubSkills), KC(KTracedSkills) and KC(Rules) columns. Non-integer features are represented as categorical features. Note that in this branch of experiment each example is a single row in the dataset. ![Reader][branch1] In this branch we start with removing Hints column using **Project Columns**, because Hints column is not available when predicting the success of new student. The next step is to split the data into training and test set. In our dataset all examples are time-stamped, hence all splits will be time-based. Moreover, because of the temporal nature of our examples, we will use **Sweep Parameters** module with training and validation sets. The timestamps are located in First Transaction Time column. We use **Execute R Script** module to split the dataset into training, validation and test sets. these set will be subsequently used with **Sweep Parameters** and **Score Model** modules. Examples with timestamps before 1/1/2009 are in the training set, examples with timestamps between 1/1/2009 and 2/28/2009 are in the validation set, examples with timestamps starting from 3/1/2009 are in the test set. Sample R code for splitting the dataset according to a given time threshold: dataset <- maml.mapInputPort(1) test_threshold <- strptime("03/01/09 12:00:00AM", "%m/%d/%y %I:%M:%S%p") data.set <- dataset[dataset[['First Transaction Time']] < test_threshold,] data.set[['First Transaction Time']] <- as.character(as.POSIXct(data.set[['First Transaction Time']],origin="1970-01-01")) maml.mapOutputPort("data.set") Recall that our goal is to generate a model with the best possible root mean squared error (RMSE). RMSE can be optimized by **Sweep Parameters** module when it is used with regression algorithm. Hence, despite having a binary label, we choose to use regression algorithm. In this experiment we use **Boosted Decision Tree Regression** algorithm. The parameters of Sweep Parameters module are shown below: ![Reader][sweeper] ### Branch 2 - features from a single record and two previous records of the user In the second branch in addition to the features from the first branch we use names of the last two steps of the problem that is currently solved by the user along with their description. In the supplied dataset all user's activities are stored in the ascending order of timestamps. Also in the supplied dataset users' activities are not interleaved, namely initially there are all rows of first user, then there are all rows of second user and so on. Thus to find the last steps of the user we leverage Row ID column. For a fixed user this column serves as a time axis. By adding 1 to this column and using **Apply Math Operation**, *we shift each row one time unit forwards*. Then we use **Join** module to join the original dataset with the shifted one using Row ID, Student ID and Problem Name as keys. As a result we get an expanded dataset where each row has a record from times t and t-1 for the same Student ID and Problem Name. We use Left Outer Join to keep rows that don't have previous steps with the same Student ID and Problem ID. We apply the combination of **Apply Math Operation** and **Join** one more time to get features from two steps ago. This sequence of Apply Math Operation and Join modules is shown below: ![Reader][time_shift] The exact parameters of **Apply Math Operation** and **Join** are shown below: ![Reader][apply_math]![Reader][join] After performing the above sequence of **Apply Math Operation** and **Join** modules we get a number of columns that are identical. For example, due to our usage of **Join** module, ProblemName column is replicated 3 times, for steps t, t-1, and t-2. We use **Project Columns** module to remove redundant columns. Finally, since we used Left Outer Join, some of the rows that are generated by **Join** module can have missing values. We use **Clean Missing Data** to replace all missing valued with “0” string. After computing the new set of features the workflow of the second branch is identical to the one of the first branch. ### Branch 3 - features from a single record and two previous records of the user, quadratic and cubic features In the third branch, in addition to the features used in the second branch, we also use quadratic and cubic features that are concatenations of the original features from the first branch. Quadratic and cubic features are computed using **Execute R Script** module with the R code shown below. dataset <- maml.mapInputPort(1) f1 <- paste(dataset[,2],dataset[,3],sep="+") f2 <- paste(dataset[,3],dataset[,4],sep="+") f3 <- paste(dataset[,4],dataset[,5],sep="+") f4 <- paste(dataset[,2],dataset[,3],dataset[,7],sep="+") f5 <- paste(dataset[,3],dataset[,4],dataset[,5],sep="+") f6 <- paste(dataset[,2],dataset[,3],dataset[,4],dataset[,5],sep="+") f7 <- paste(dataset[,2],dataset[,4],sep="+") f8 <- paste(dataset[,2],dataset[,5],sep="+") f9 <- paste(dataset[,2],dataset[,7],sep="+") f10 <- paste(dataset[,2],dataset[,11],sep="+") f11 <- paste(dataset[,2],dataset[,13],sep="+") f12 <- paste(dataset[,2],dataset[,15],sep="+") data.set <- cbind(dataset,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12) maml.mapOutputPort("data.set") After computing the new feature set we proceed with training, scoring and evaluation in the same way as in the first two branches. ### Branch 4 - dense time-dependent features In the fourth branch we use features that are completely different from the ones in the first three branches. For each Student ID we compute the average value of Correct First Attempt up to (but not including) time t. We denote this value as CFA(Student ID,t). Similarly, we denote by Hints(Student ID,t) the average value of Hints for a fixed Student ID up to (but not including) time t. To speed up the computation of these averages, instead of considering the entire dataset we only look at 10 most recent records before time t. The definitions of CFA(Problem Name,t), CFA(Step Name,t), CFA(Description,t), Hints(Problem Name,t), Hints(Step Name,t) and Hints(Description,t) are similar. In summary, given an example x with First Transaction Time=t(x) and column values Student ID(x), Problem Name(x), Step Name(x) and Description(x), we generate the following 8 CFA and Hints features: CFA(StudentID(x),t(x)), CFA(Problem Name(x),t(x)), CFA(Step Name(x),t(x)), CFA(Description(x),t(x)), Hints(StudentID(x),t(x)), Hints(Problem Name(x),t(x)), Hints(Step Name(x),t(x)), Hints(Description(x),t(x)) Similarly we also compute 8 additional CFA and Hints features using concatenations of Student ID and Problem Name, Problem Name and Step Name, Student Id and Unit Name, Student ID and Problem Description. Overall we obtain 16 time-dependent dense features. Following [1], we also use Problem View and Opportunity(Rules) features. Overall, we use 18 features to predict the value of Correct First Attempt column. The computation of time-dependent dense features is done using R code within **Execute R Script** module. This code is highly-optimized and is well-documented inside the module. After computing the new feature set we proceed with training, scoring and evaluation in the same way as in the first three branches. ##Results We use **Evaluate Model** module to compute RMSE values in all four branches. Then we collect the results using **Add Row** module. Also we generate annotations using **Execute R Script** module. The workflow of this final part of experiment is depicted below: ![Reader][final] The final output of the experiment is the left output of the last **Execute R Script** module: ![Reader][results] In this table the first column is the name of the feature set and the second column is a root mean squared error (RMSE) as measured over the test examples. We conclude that dense time-dependent features results in the best performance in terms of RMSE. ##References [1] H.-F. Yu et al. Feature Engineering and Classifier Ensemble for KDD Cup 2010. KDD Cup 2010 Workshop, 2010. <!-- Images --> [reader_screen]: [data_transformation]: [branch1]: [sweeper]: [time_shift]: [apply_math]: [join]: [final]: [results]: