Model selection, evaluation and validation¶
Besides the oversampler implementation, we have prepared some codes for model selection compatible with sklearn
classifier interfaces.
Having a dataset, a bunch of candidate oversamplers and classifiers, the tools below enable customizable model selection.
Caching¶
The evaluation and comparison of oversampling techniques on many datasets might take enormous time. In order to increase the reliability of an evaluation process, make it stoppable and restartable and let the oversampling techniques utilize results already computed, we have implemented some model selection and evaluation scripts, both using some hard-disk cache directory to store partial and final results. These functions cannot be used without specifying some cache directory.
Parallelization¶
The evaluation and model selection scripts are executing oversampling and classification jobs in parallel. If the number of jobs specified
is 1, they will call the sklearn algorithms to run in parallel, otherwise the sklearn implementations run in sequential, and the oversampling
and classification jobs will be executed in parallel, using n_jobs
processes.
Querying and filtering oversamplers¶
-
smote_variants.
get_all_oversamplers
()[source]¶ Returns all oversampling classes
Returns: list of all oversampling classes Return type: list(OverSampling) Example:
import smote_variants as sv oversamplers= sv.get_all_oversamplers()
-
smote_variants.
get_n_quickest_oversamplers
(n=10)[source]¶ Returns the n quickest oversamplers based on testing on the datasets of the imbalanced_databases package.
Parameters: n (int) – number of oversamplers to return Returns: list of the n quickest oversampling classes Return type: list(OverSampling) Example:
import smote_variants as sv oversamplers= sv.get_n_quickest_oversamplers(10)
Cross validation¶
-
smote_variants.
cross_validate
(dataset, sampler, classifier, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=StandardScaler(), random_state=None)[source]¶ Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance
Parameters: - dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
- samplers (list) – list of oversampling classes/objects
- classifiers (list) – list of classifier objects
- validator (obj) – validator object
- scaler (obj) – scaler object
- random_state (int/np.random.RandomState/None) – initializer of the random state
Returns: the cross-validation scores
Return type: pd.DataFrame
Example:
import smote_variants as sv import imbalanced_datasets as imbd from sklearn.neighbors import KNeighborsClassifier dataset= imbd.load_glass2() sampler= sv.SMOTE_ENN classifier= KNeighborsClassifier(n_neighbors= 3) sampler, classifier= model_selection(dataset, oversampler, classifier)
Evaluation and validation¶
-
smote_variants.
evaluate_oversamplers
(datasets, samplers, classifiers, cache_path, validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), scaler=None, all_results=False, remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]¶ - Evaluates oversampling techniques using various classifiers on various
- datasets
Parameters: - datasets (list) – list of datasets and/or dataset loaders - a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
- samplers (list) – list of oversampling classes/objects
- classifiers (list) – list of classifier objects
- cache_path (str) – path to a cache directory
- validator (obj) – validator object
- scaler (obj) – scaler object
- all_results (bool) – True to return all results, False to return an aggregation
- remove_cache (bool) – True to remove sampling objects after evaluation
- max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
- n_jobs (int) – number of parallel jobs
- random_state (int/np.random.RandomState/None) – initializer of the random state
Returns: - all results or the aggregated results if all_results is
False
Return type: pd.DataFrame
Example:
import smote_variants as sv import imbalanced_datasets as imbd from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier datasets= [imbd.load_glass2, imbd.load_ecoli4] oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee] classifiers= [KNeighborsClassifier(n_neighbors= 3), KNeighborsClassifier(n_neighbors= 5), DecisionTreeClassifier()] cache_path= '/home/<user>/smote_validation/' results= evaluate_oversamplers(datasets, oversamplers, classifiers, cache_path)
Model selection¶
-
smote_variants.
model_selection
(dataset, samplers, classifiers, cache_path, score='auc', validator=RepeatedStratifiedKFold(n_repeats=3, n_splits=5, random_state=None), remove_cache=False, max_samp_par_comb=35, n_jobs=1, random_state=None)[source]¶ Evaluates oversampling techniques on various classifiers and a dataset and returns the oversampling and classifier objects giving the best performance
Parameters: - dataset (dict) – a dataset is a dict with ‘data’, ‘target’ and ‘name’ keys
- samplers (list) – list of oversampling classes/objects
- classifiers (list) – list of classifier objects
- cache_path (str) – path to a cache directory
- score (str) – ‘auc’/’acc’/’gacc’/’f1’/’brier’/’p_top20’
- validator (obj) – validator object
- all_results (bool) – True to return all results, False to return an aggregation
- remove_cache (bool) – True to remove sampling objects after evaluation
- max_samp_par_comb (int) – maximum number of sampler parameter combinations to be tested
- n_jobs (int) – number of parallel jobs
- random_state (int/np.random.RandomState/None) – initializer of the random state
Returns: - the best performing sampler object and the best performing
classifier object
Return type: obj, obj
Example:
import smote_variants as sv import imbalanced_datasets as imbd from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier datasets= imbd.load_glass2() oversamplers= [sv.SMOTE_ENN, sv.NEATER, sv.Lee] classifiers= [KNeighborsClassifier(n_neighbors= 3), KNeighborsClassifier(n_neighbors= 5), DecisionTreeClassifier()] cache_path= '/home/<user>/smote_validation/' sampler, classifier= model_selection(dataset, oversamplers, classifiers, cache_path, 'auc')