Model selection, evaluation and validation

Besides the oversampler implementation, we have prepared some codes for model selection compatible with sklearn classifier interfaces.

Having a dataset, a bunch of candidate oversamplers and classifiers, the tools below enable customizable model selection.

Caching

The evaluation and comparison of oversampling techniques on many datasets might take enormous time. In order to increase the reliability of an evaluation process, make it stoppable and restartable and let the oversampling techniques utilize results already computed, we have implemented some model selection and evaluation scripts, both using some hard-disk cache directory to store partial and final results. These functions cannot be used without specifying some cache directory.

Parallelization

The evaluation and model selection scripts are executing oversampling and classification jobs in parallel. If the number of jobs specified is 1, they will call the sklearn algorithms to run in parallel, otherwise the sklearn implementations run in sequential, and the oversampling and classification jobs will be executed in parallel, using n_jobs processes.

Querying and filtering oversamplers

Evaluation and validation

Model selection