Compare Datasets

September 28, 2016

Report Abuse
This modules compares two datasets and summarizes the comparison results
This experiment shows how to quickly compare two datasets and find differences through bitmap, summary report and heatmap graphs This experiment uses new custom module "Compare Datasets" (source code located [here](https://gist.github.com/nk773/9eea2a0dd6e67a16fca37a917fcc9730)). The overall experiment is pretty simple. Experiment has 2 R modules to generate random binary matrix of 100 elements and these matrices are compared by the new "Compare Datasets" module. The experiment is shown below ![](http://neerajkh.blob.core.windows.net/images/CompareGraphCapture.PNG) The new module **Compare Datasets** takes only two datasets to be compared and has no other inputs. It generates three outputs. The first output is to provide boolean matrix indicating which position has data mismatch. ![](http://neerajkh.blob.core.windows.net/images/CompareOutputCapture3.PNG) The second output is to summarize the output on per column basis. That is to provide number of matches on per column basis. ![](http://neerajkh.blob.core.windows.net/images/CompareResultsCapture2.PNG) The third output is to show visual representation of results in the form of heatmap and as the bar graph of the data count per column that are not different. ![](http://neerajkh.blob.core.windows.net/images/CompareG2Capture.PNG) ![](http://neerajkh.blob.core.windows.net/images/Compareg1Capture.PNG) The module expects both datasets to be of same dimension. If datasets dimensions are different, then it finds the minimum set of rows/columns to compare.