Model selection, evaluation and validation

Besides the oversampler implementation, we have prepared some codes for model selection compatible with sklearn classifier interfaces.

Having a dataset, a bunch of candidate oversamplers and classifiers, the tools below enable customizable model selection.

Caching

The evaluation and comparison of oversampling techniques on many datasets might take enormous time. In order to increase the reliability of an evaluation process, make it stoppable and restartable and let the oversampling techniques utilize results already computed, we have implemented some model selection and evaluation scripts, both using some hard-disk cache directory to store partial and final results. These functions cannot be used without specifying some cache directory.

Parallelization

The evaluation and model selection scripts are executing oversampling and classification jobs in parallel. If the number of jobs specified is 1, they will call the sklearn algorithms to run in parallel, otherwise the sklearn implementations run in sequential, and the oversampling and classification jobs will be executed in parallel, using n_jobs processes.

Querying and filtering oversamplers

smote_variants.get_all_oversamplers()[source]

Returns all oversampling classes

Returns:list of all oversampling classes
Return type:list(OverSampling)

Example:

import smote_variants as sv

oversamplers= sv.get_all_oversamplers()
smote_variants.get_n_quickest_oversamplers(n=10)[source]

Returns the n quickest oversamplers based on testing on the datasets of the imbalanced_databases package.

Parameters:n (int) – number of oversamplers to return
Returns:list of the n quickest oversampling classes
Return type:list(OverSampling)

Example:

import smote_variants as sv

oversamplers= sv.get_n_quickest_oversamplers(10)

Cross validation

smote_variants.cross_validate(dataset, sampler, classifier, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=StandardScaler(), random_state=None)[source]

Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance

Parameters:
  • dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
  • samplers (list) – list of oversampling classes/objects
  • classifiers (list) – list of classifier objects
  • validator (obj) – validator object
  • scaler (obj) – scaler object
  • random_state (int/np.random.RandomState/None) – initializer of the random state
Returns:

the cross-validation scores

Return type:

pd.DataFrame

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.neighbors import KNeighborsClassifier

dataset= imbd.load_glass2()
sampler= sv.SMOTE_ENN
classifier= KNeighborsClassifier(n_neighbors= 3)

sampler, classifier= model_selection(dataset,
                                     oversampler,
                                     classifier)

Evaluation and validation

smote_variants.evaluate_oversamplers(datasets, samplers, classifiers, cache_path, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=None, all_results=False, remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]
Evaluates oversampling techniques using various classifiers on various
datasets
Parameters:
  • datasets (list) – list of datasets and/or dataset loaders - a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
  • samplers (list) – list of oversampling classes/objects
  • classifiers (list) – list of classifier objects
  • cache_path (str) – path to a cache directory
  • validator (obj) – validator object
  • scaler (obj) – scaler object
  • all_results (bool) – True to return all results, False to return an aggregation
  • remove_cache (bool) – True to remove sampling objects after evaluation
  • max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
  • n_jobs (int) – number of parallel jobs
  • random_state (int/np.random.RandomState/None) – initializer of the random state
Returns:

all results or the aggregated results if all_results is

False

Return type:

pd.DataFrame

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

datasets= [imbd.load_glass2, imbd.load_ecoli4]
oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee]
classifiers= [KNeighborsClassifier(n_neighbors= 3),
              KNeighborsClassifier(n_neighbors= 5),
              DecisionTreeClassifier()]

cache_path= '/home/<user>/smote_validation/'

results= evaluate_oversamplers(datasets,
                               oversamplers,
                               classifiers,
                               cache_path)

Model selection

smote_variants.model_selection(dataset, samplers, classifiers, cache_path, score='auc', validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]

Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance

Parameters:
  • dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
  • samplers (list) – list of oversampling classes/objects
  • classifiers (list) – list of classifier objects
  • cache_path (str) – path to a cache directory
  • score (str) – ‘auc’/’acc’/’gacc’/’f1’/’brier’/’p_top20’
  • validator (obj) – validator object
  • all_results (bool) – True to return all results, False to return an aggregation
  • remove_cache (bool) – True to remove sampling objects after evaluation
  • max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
  • n_jobs (int) – number of parallel jobs
  • random_state (int/np.random.RandomState/None) – initializer of the random state
Returns:

the best performing sampler object and the best performing

classifier object

Return type:

obj, obj

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

datasets= imbd.load_glass2()
oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee]
classifiers= [KNeighborsClassifier(n_neighbors= 3),
              KNeighborsClassifier(n_neighbors= 5),
              DecisionTreeClassifier()]

cache_path= '/home/<user>/smote_validation/'

sampler, classifier= model_selection(dataset,
                                     oversamplers,
                                     classifiers,
                                     cache_path,
                                     'auc')