Model selection, evaluation and validation¶

Besides the oversampler implementation, we have prepared some codes for model selection compatible with sklearn classifier interfaces.

Having a dataset, a bunch of candidate oversamplers and classifiers, the tools below enable customizable model selection.

Caching¶

The evaluation and comparison of oversampling techniques on many datasets might take enormous time. In order to increase the reliability of an evaluation process, make it stoppable and restartable and let the oversampling techniques utilize results already computed, we have implemented some model selection and evaluation scripts, both using some hard-disk cache directory to store partial and final results. These functions cannot be used without specifying some cache directory.

Parallelization¶

The evaluation and model selection scripts are executing oversampling and classification jobs in parallel. If the number of jobs specified is 1, they will call the sklearn algorithms to run in parallel, otherwise the sklearn implementations run in sequential, and the oversampling and classification jobs will be executed in parallel, using n_jobs processes.

Querying and filtering oversamplers¶

smote_variants.get_all_oversamplers()[source]¶

Returns all oversampling classes

Returns:	list of all oversampling classes
Return type:	list(OverSampling)

Example:

import smote_variants as sv

oversamplers= sv.get_all_oversamplers()

smote_variants.get_n_quickest_oversamplers(n=10)[source]¶

Returns the n quickest oversamplers based on testing on the datasets of the imbalanced_databases package.

Parameters:	n (int) – number of oversamplers to return
Returns:	list of the n quickest oversampling classes
Return type:	list(OverSampling)

Example:

import smote_variants as sv

oversamplers= sv.get_n_quickest_oversamplers(10)

Cross validation¶

smote_variants.cross_validate(dataset, sampler, classifier, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=StandardScaler(), random_state=None)[source]¶

Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance

Parameters:	dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys samplers (list) – list of oversampling classes/objects classifiers (list) – list of classifier objects validator (obj) – validator object scaler (obj) – scaler object random_state (int/np.random.RandomState/None) – initializer of the random state
Returns:	the cross-validation scores
Return type:	pd.DataFrame

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.neighbors import KNeighborsClassifier

dataset= imbd.load_glass2()
sampler= sv.SMOTE_ENN
classifier= KNeighborsClassifier(n_neighbors= 3)

sampler, classifier= model_selection(dataset,
                                     oversampler,
                                     classifier)

Evaluation and validation¶

smote_variants.evaluate_oversamplers(datasets, samplers, classifiers, cache_path, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=None, all_results=False, remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]¶

Evaluates oversampling techniques using various classifiers on various: datasets

Parameters:

datasets (list) – list of datasets and/or dataset loaders - a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
samplers (list) – list of oversampling classes/objects
classifiers (list) – list of classifier objects
cache_path (str) – path to a cache directory
validator (obj) – validator object
scaler (obj) – scaler object
all_results (bool) – True to return all results, False to return an aggregation
remove_cache (bool) – True to remove sampling objects after evaluation
max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
n_jobs (int) – number of parallel jobs
random_state (int/np.random.RandomState/None) – initializer of the random state

Returns:

all results or the aggregated results if all_results is: False

Return type:

pd.DataFrame

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

datasets= [imbd.load_glass2, imbd.load_ecoli4]
oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee]
classifiers= [KNeighborsClassifier(n_neighbors= 3),
              KNeighborsClassifier(n_neighbors= 5),
              DecisionTreeClassifier()]

cache_path= '/home/<user>/smote_validation/'

results= evaluate_oversamplers(datasets,
                               oversamplers,
                               classifiers,
                               cache_path)

Model selection¶

smote_variants.model_selection(dataset, samplers, classifiers, cache_path, score='auc', validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]¶

Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance

Parameters:

dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
samplers (list) – list of oversampling classes/objects
classifiers (list) – list of classifier objects
cache_path (str) – path to a cache directory
score (str) – ‘auc’/’acc’/’gacc’/’f1’/’brier’/’p_top20’
validator (obj) – validator object
all_results (bool) – True to return all results, False to return an aggregation
remove_cache (bool) – True to remove sampling objects after evaluation
max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
n_jobs (int) – number of parallel jobs
random_state (int/np.random.RandomState/None) – initializer of the random state

Returns:

the best performing sampler object and the best performing: classifier object

Return type:

obj, obj

Example:

import smote_variants as sv
import imbalanced_datasets as imbd

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

datasets= imbd.load_glass2()
oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee]
classifiers= [KNeighborsClassifier(n_neighbors= 3),
              KNeighborsClassifier(n_neighbors= 5),
              DecisionTreeClassifier()]

cache_path= '/home/<user>/smote_validation/'

sampler, classifier= model_selection(dataset,
                                     oversamplers,
                                     classifiers,
                                     cache_path,
                                     'auc')