Oversamplers¶
NoSMOTE¶
API¶
-
class
smote_variants.
NoSMOTE
(random_state=None)[source]¶ -
__init__
(random_state=None)[source]¶ Constructor of the NoSMOTE object.
Parameters: random_state (int/np.random.RandomState/None) – dummy parameter for the compatibility of interfaces
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NoSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
The goal of this class is to provide a functionality to send data through on any model selection/evaluation pipeline with no oversampling carried out. It can be used to get baseline estimates on preformance.
SMOTE¶
API¶
-
class
smote_variants.
SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the SMOTE object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0
- that after sampling the number of minority samples will (means) – be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor technique
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote, author={Chawla, N. V. and Bowyer, K. W. and Hall, L. O. and Kegelmeyer, W. P.}, title={{SMOTE}: synthetic minority over-sampling technique}, journal={Journal of Artificial Intelligence Research}, volume={16}, year={2002}, pages={321--357} }
SMOTE_TomekLinks¶
API¶
-
class
smote_variants.
SMOTE_TomekLinks
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the SMOTE object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor technique
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_TomekLinks()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_tomeklinks_enn, author = {Batista, Gustavo E. A. P. A. and Prati, Ronaldo C. and Monard, Maria Carolina}, title = {A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data}, journal = {SIGKDD Explor. Newsl.}, issue_date = {June 2004}, volume = {6}, number = {1}, month = jun, year = {2004}, issn = {1931-0145}, pages = {20--29}, numpages = {10}, url = {http://doi.acm.org/10.1145/1007730.1007735}, doi = {10.1145/1007730.1007735}, acmid = {1007735}, publisher = {ACM}, address = {New York, NY, USA}, }
SMOTE_ENN¶
API¶
-
class
smote_variants.
SMOTE_ENN
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the SMOTE object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor technique
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_ENN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_tomeklinks_enn, author = {Batista, Gustavo E. A. P. A. and Prati, Ronaldo C. and Monard, Maria Carolina}, title = {A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data}, journal = {SIGKDD Explor. Newsl.}, issue_date = {June 2004}, volume = {6}, number = {1}, month = jun, year = {2004}, issn = {1931-0145}, pages = {20--29}, numpages = {10}, url = {http://doi.acm.org/10.1145/1007730.1007735}, doi = {10.1145/1007730.1007735}, acmid = {1007735}, publisher = {ACM}, address = {New York, NY, USA}, }
- Notes:
- Can remove too many of minority samples.
Borderline_SMOTE1¶
API¶
-
class
smote_variants.
Borderline_SMOTE1
(proportion=1.0, n_neighbors=5, k_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, k_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor technique for determining the borderline
- k_neighbors (int) – control parameter of the nearest neighbor technique for sampling
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Borderline_SMOTE1()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{borderlineSMOTE, author="Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan", editor="Huang, De-Shuang and Zhang, Xiao-Ping and Huang, Guang-Bin", title="Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning", booktitle="Advances in Intelligent Computing", year="2005", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="878--887", isbn="978-3-540-31902-3" }
Borderline_SMOTE2¶
API¶
-
class
smote_variants.
Borderline_SMOTE2
(proportion=1.0, n_neighbors=5, k_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, k_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor technique for determining the borderline
- k_neighbors (int) – control parameter of the nearest neighbor technique for sampling
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Borderline_SMOTE2()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{borderlineSMOTE, author="Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan", editor="Huang, De-Shuang and Zhang, Xiao-Ping and Huang, Guang-Bin", title="Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning", booktitle="Advances in Intelligent Computing", year="2005", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="878--887", isbn="978-3-540-31902-3" }
ADASYN¶
API¶
-
class
smote_variants.
ADASYN
(n_neighbors=5, d_th=0.9, beta=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(n_neighbors=5, d_th=0.9, beta=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - n_neighbors (int) – control parameter of the nearest neighbor component
- d_th (float) – tolerated deviation level from balancedness
- beta (float) – target level of balancedness, same as proportion in other techniques
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ADASYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{adasyn, author={He, H. and Bai, Y. and Garcia, E. A. and Li, S.}, title={{ADASYN}: adaptive synthetic sampling approach for imbalanced learning}, booktitle={Proceedings of IJCNN}, year={2008}, pages={1322--1328} }
AHC¶
API¶
-
class
smote_variants.
AHC
(strategy='min', n_jobs=1, random_state=None)[source]¶ -
__init__
(strategy='min', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
classmethod
parameter_combinations
(raw=False)[source]¶ Generates reasonable paramter combinations.
Returns: a list of meaningful paramter combinations Return type: list(dict)
-
sample
(X, y)[source]¶ Does the sample generation according to the class paramters.
Parameters: - X (np.ndarray) – training set
- y (np.array) – target labels
Returns: the extended training set and target labels
Return type: (np.ndarray, np.array)
-
Example¶
>>> oversampler= smote_variants.AHC()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{AHC, title = "Learning from imbalanced data in surveillance of nosocomial infection", journal = "Artificial Intelligence in Medicine", volume = "37", number = "1", pages = "7 - 18", year = "2006", note = "Intelligent Data Analysis in Medicine", issn = "0933-3657", doi = "https://doi.org/10.1016/j.artmed.2005.03.002", url = {http://www.sciencedirect.com/science/article/ pii/S0933365705000850}, author = "Gilles Cohen and Mélanie Hilario and Hugo Sax and Stéphane Hugonnet and Antoine Geissbuhler", keywords = "Nosocomial infection, Machine learning, Support vector machines, Data imbalance" }
LLE_SMOTE¶
API¶
-
class
smote_variants.
LLE_SMOTE
(proportion=1.0, n_neighbors=5, n_components=2, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_components=2, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- n_components (int) – dimensionality of the embedding space
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.LLE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{lle_smote, author={Wang, J. and Xu, M. and Wang, H. and Zhang, J.}, booktitle={2006 8th international Conference on Signal Processing}, title={Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding}, year={2006}, volume={3}, number={}, pages={}, keywords={artificial intelligence; biomedical imaging;medical computing; imbalanced data classification; SMOTE algorithm; locally linear embedding; medical imaging intelligence; synthetic minority oversampling technique; high-dimensional data; low-dimensional space; Biomedical imaging; Back;Training data; Data mining;Biomedical engineering; Research and development; Electronic mail;Pattern recognition; Performance analysis; Classification algorithms}, doi={10.1109/ICOSP.2006.345752}, ISSN={2164-5221}, month={Nov}}
- Notes:
- There might be numerical issues if the nearest neighbors contain
- some element multiple times.
distance_SMOTE¶
API¶
-
class
smote_variants.
distance_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.distance_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{distance_smote, author={de la Calleja, J. and Fuentes, O.}, booktitle={Proceedings of the Twentieth International Florida Artificial Intelligence}, title={A distance-based over-sampling method for learning from imbalanced data sets}, year={2007}, volume={3}, pages={634--635} }
- Notes:
- It is not clear what the authors mean by “weighted distance”.
SMMO¶
API¶
-
class
smote_variants.
SMMO
(proportion=1.0, n_neighbors=5, ensemble=[QuadraticDiscriminantAnalysis(), DecisionTreeClassifier(random_state=2), GaussianNB()], n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, ensemble=[QuadraticDiscriminantAnalysis(), DecisionTreeClassifier(random_state=2), GaussianNB()], n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- ensemble (list) – list of classifiers, if None, default list of classifiers is used
- n_jobs (int) – number of parallel jobs
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMMO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{smmo, author = {de la Calleja, Jorge and Fuentes, Olac and González, Jesús}, booktitle= {Proceedings of the Twenty-First International Florida Artificial Intelligence Research Society Conference}, year = {2008}, month = {01}, pages = {276-281}, title = {Selecting Minority Examples from Misclassified Data for Over-Sampling.} }
- Notes:
- In this paper the ensemble is not specified. I have selected
- some very fast, basic classifiers.
- Also, it is not clear what the authors mean by “weighted distance”.
- The original technique is not prepared for the case when no minority
- samples are classified correctly be the ensemble.
polynom_fit_SMOTE¶
API¶
-
class
smote_variants.
polynom_fit_SMOTE
(proportion=1.0, topology='star', random_state=None)[source]¶ -
__init__
(proportion=1.0, topology='star', random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- topoplogy (str) – ‘star’/’bus’/’mesh’
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.polynom_fit_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{polynomial_fit_smote, author={Gazzah, S. and Amara, N. E. B.}, booktitle={2008 The Eighth IAPR International Workshop on Document Analysis Systems}, title={New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets}, year={2008}, volume={}, number={}, pages={677-684}, keywords={curve fitting;learning (artificial intelligence);mesh generation;pattern classification;polynomials;sampling methods;support vector machines; oversampling approach;polynomial fitting function;imbalanced data set;pattern classification task; class-modular strategy;support vector machine;true negative rate; true positive rate;star topology; bus topology;polynomial curve topology;mesh topology;Polynomials; Topology;Support vector machines; Support vector machine classification; Pattern classification;Performance evaluation;Training data;Text analysis;Data engineering;Convergence; writer identification system;majority class;minority class;imbalanced data sets;polynomial fitting functions; class-modular strategy}, doi={10.1109/DAS.2008.74}, ISSN={}, month={Sept},}
Stefanowski¶
API¶
-
class
smote_variants.
Stefanowski
(strategy='weak_amp', n_jobs=1, random_state=None)[source]¶ -
__init__
(strategy='weak_amp', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Stefanowski()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{stefanowski, author = {Stefanowski, Jerzy and Wilk, Szymon}, title = {Selective Pre-processing of Imbalanced Data for Improving Classification Performance}, booktitle = {Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery}, series = {DaWaK '08}, year = {2008}, isbn = {978-3-540-85835-5}, location = {Turin, Italy}, pages = {283--292}, numpages = {10}, url = {http://dx.doi.org/10.1007/978-3-540-85836-2_27}, doi = {10.1007/978-3-540-85836-2_27}, acmid = {1430591}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, }
ADOMS¶
API¶
-
class
smote_variants.
ADOMS
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – parameter of the nearest neighbor component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ADOMS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{adoms, author={Tang, S. and Chen, S.}, booktitle={2008 International Conference on Information Technology and Applications in Biomedicine}, title={The generation mechanism of synthetic minority class examples}, year={2008}, volume={}, number={}, pages={444-447}, keywords={medical image processing; generation mechanism;synthetic minority class examples;class imbalance problem;medical image analysis;oversampling algorithm; Principal component analysis; Biomedical imaging;Medical diagnostic imaging;Information technology;Biomedical engineering; Noise generators;Concrete;Nearest neighbor searches;Data analysis; Image analysis}, doi={10.1109/ITAB.2008.4570642}, ISSN={2168-2194}, month={May}}
Safe_Level_SMOTE¶
API¶
-
class
smote_variants.
Safe_Level_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Safe_Level_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{safe_level_smote, author = { Bunkhumpornpat, Chumphol and Sinapiromsaran, Krung and Lursinsap, Chidchanok}, title = {Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem}, booktitle = {Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining}, series = {PAKDD '09}, year = {2009}, isbn = {978-3-642-01306-5}, location = {Bangkok, Thailand}, pages = {475--482}, numpages = {8}, url = {http://dx.doi.org/10.1007/978-3-642-01307-2_43}, doi = {10.1007/978-3-642-01307-2_43}, acmid = {1533904}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {Class Imbalanced Problem, Over-sampling, SMOTE, Safe Level}, }
- Notes:
- The original method was not prepared for the case when no minority
- sample has minority neighbors.
MSMOTE¶
API¶
-
class
smote_variants.
MSMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{msmote, author = {Hu, Shengguo and Liang, Yanfeng and Ma, Lintao and He, Ying}, title = {MSMOTE: Improving Classification Performance When Training Data is Imbalanced}, booktitle = {Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 02}, series = {IWCSE '09}, year = {2009}, isbn = {978-0-7695-3881-5}, pages = {13--17}, numpages = {5}, url = {https://doi.org/10.1109/WCSE.2009.756}, doi = {10.1109/WCSE.2009.756}, acmid = {1682710}, publisher = {IEEE Computer Society}, address = {Washington, DC, USA}, keywords = {imbalanced data, over-sampling, SMOTE, AdaBoost, samples groups, SMOTEBoost}, }
- Notes:
- The original method was not prepared for the case when all
- minority samples are noise.
DE_oversampling¶
API¶
-
class
smote_variants.
DE_oversampling
(proportion=1.0, n_neighbors=5, crossover_rate=0.5, similarity_threshold=0.5, n_clusters=30, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, crossover_rate=0.5, similarity_threshold=0.5, n_clusters=30, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – control parameter of the nearest neighbor component
- crossover_rate (float) – cross over rate of evoluation
- similarity_threshold (float) – similarity threshold paramter
- n_clusters (int) – number of clusters for cleansing
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.DE_oversampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{de_oversampling, author={Chen, L. and Cai, Z. and Chen, L. and Gu, Q.}, booktitle={2010 Third International Conference on Knowledge Discovery and Data Mining}, title={A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets}, year={2010}, volume={}, number={}, pages={81-85}, keywords={pattern clustering;sampling methods; support vector machines;differential evolution;clustering algorithm;hybrid resampling algorithm;imbalanced datasets;support vector machine; minority class;mutation operators; crossover operators;data cleaning method;F-measure criterion;ROC area criterion;Support vector machines; Intrusion detection;Support vector machine classification;Cleaning; Electronic mail;Clustering algorithms; Signal to noise ratio;Learning systems;Data mining;Geology;imbalanced datasets;hybrid resampling;clustering; differential evolution;support vector machine}, doi={10.1109/WKDD.2010.48}, ISSN={}, month={Jan},}
SMOBD¶
API¶
-
class
smote_variants.
SMOBD
(proportion=1.0, eta1=0.5, t=1.8, min_samples=5, max_eps=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, eta1=0.5, t=1.8, min_samples=5, max_eps=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- eta1 (float) – control parameter of density estimation
- t (float) – control parameter of noise filtering
- min_samples (int) – minimum samples parameter for OPTICS
- max_eps (float) – maximum environment radius paramter for OPTICS
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOBD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{smobd, author={Cao, Q. and Wang, S.}, booktitle={2011 International Conference on Information Management, Innovation Management and Industrial Engineering}, title={Applying Over-sampling Technique Based on Data Density and Cost-sensitive SVM to Imbalanced Learning}, year={2011}, volume={2}, number={}, pages={543-548}, keywords={data handling;learning (artificial intelligence);support vector machines; oversampling technique application; data density;cost sensitive SVM; imbalanced learning;SMOTE algorithm; data distribution;density information; Support vector machines;Classification algorithms;Noise measurement;Arrays; Noise;Algorithm design and analysis; Training;imbalanced learning; cost-sensitive SVM;SMOTE;data density; SMOBD}, doi={10.1109/ICIII.2011.276}, ISSN={2155-1456}, month={Nov},}
SUNDO¶
API¶
-
class
smote_variants.
SUNDO
(n_jobs=1, random_state=None)[source]¶ -
__init__
(n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SUNDO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{sundo, author={Cateni, S. and Colla, V. and Vannucci, M.}, booktitle={2011 11th International Conference on Intelligent Systems Design and Applications}, title={Novel resampling method for the classification of imbalanced datasets for industrial and other real-world problems}, year={2011}, volume={}, number={}, pages={402-407}, keywords={decision trees;pattern classification; sampling methods;support vector machines;resampling method;imbalanced dataset classification;industrial problem;real world problem; oversampling technique;undersampling technique;support vector machine; decision tree;binary classification; synthetic dataset;public dataset; industrial dataset;Support vector machines;Training;Accuracy;Databases; Intelligent systems;Breast cancer; Decision trees;oversampling; undersampling;imbalanced dataset}, doi={10.1109/ISDA.2011.6121689}, ISSN={2164-7151}, month={Nov}}
MSYN¶
API¶
-
class
smote_variants.
MSYN
(pressure=1.5, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(pressure=1.5, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - pressure (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in the SMOTE sampling
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MSYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{msyn, author="Fan, Xiannian and Tang, Ke and Weise, Thomas", editor="Huang, Joshua Zhexue and Cao, Longbing and Srivastava, Jaideep", title="Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets", booktitle="Advances in Knowledge Discovery and Data Mining", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="309--320", abstract="Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Over- sampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Margin-guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN.", isbn="978-3-642-20847-8" }
SVM_balance¶
API¶
-
class
smote_variants.
SVM_balance
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in the SMOTE sampling
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SVM_balance()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{svm_balance, author = {Farquad, M.A.H. and Bose, Indranil}, title = {Preprocessing Unbalanced Data Using Support Vector Machine}, journal = {Decis. Support Syst.}, issue_date = {April, 2012}, volume = {53}, number = {1}, month = apr, year = {2012}, issn = {0167-9236}, pages = {226--233}, numpages = {8}, url = {http://dx.doi.org/10.1016/j.dss.2012.01.016}, doi = {10.1016/j.dss.2012.01.016}, acmid = {2181554}, publisher = {Elsevier Science Publishers B. V.}, address = {Amsterdam, The Netherlands, The Netherlands}, keywords = {COIL data, Hybrid method, Preprocessor, SVM, Unbalanced data}, }
TRIM_SMOTE¶
API¶
-
class
smote_variants.
TRIM_SMOTE
(proportion=1.0, n_neighbors=5, min_precision=0.3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, min_precision=0.3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
determine_splitting_point
(X, y, split_on_border=False)[source]¶ Determines the splitting point.
Parameters: - X (np.matrix) – a subset of the training data
- y (np.array) – an array of target labels
- split_on_border (bool) – wether splitting on class borders is considered
Returns: - (splitting feature, splitting value),
make the split
Return type:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
classmethod
parameter_combinations
(raw=False)[source]¶ Generates reasonable paramter combinations.
Returns: a list of meaningful paramter combinations Return type: list(dict)
-
precision
(y)[source]¶ Determines the precision value.
Parameters: y (np.array) – array of target labels Returns: the precision value Return type: float
-
Example¶
>>> oversampler= smote_variants.TRIM_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{trim_smote, author="Puntumapon, Kamthorn and Waiyamai, Kitsana", editor="Tan, Pang-Ning and Chawla, Sanjay and Ho, Chin Kuan and Bailey, James", title="A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling", booktitle="Advances in Knowledge Discovery and Data Mining", year="2012", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="371--382", isbn="978-3-642-30220-6" }
- Notes:
- It is not described precisely how the filtered data is used for
- sample generation. The method is proposed to be a preprocessing step, and it states that it applies sample generation to each group extracted.
SMOTE_RSB¶
API¶
-
class
smote_variants.
SMOTE_RSB
(proportion=2.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=2.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in the SMOTE sampling
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_RSB()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{smote_rsb, author="Ramentol, Enislay and Caballero, Yail{'e} and Bello, Rafael and Herrera, Francisco", title="SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory", journal="Knowledge and Information Systems", year="2012", month="Nov", day="01", volume="33", number="2", pages="245--265", issn="0219-3116", doi="10.1007/s10115-011-0465-6", url="https://doi.org/10.1007/s10115-011-0465-6" }
- Notes:
- I think the description of the algorithm in Fig 5 of the paper
- is not correct. The set “resultSet” is initialized with the original instances, and then the While loop in the Algorithm run until resultSet is empty, which never holds. Also, the resultSet is only extended in the loop. Our implementation is changed in the following way: we generate twice as many instances are required to balance the dataset, and repeat the loop until the number of new samples added to the training set is enough to balance the dataset.
ProWSyn¶
API¶
-
class
smote_variants.
ProWSyn
(proportion=1.0, n_neighbors=5, L=5, theta=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, L=5, theta=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- L (int) – number of levels
- theta (float) – smoothing factor in weight formula
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ProWSyn()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{prowsyn, author="Barua, Sukarna and Islam, Md. Monirul and Murase, Kazuyuki", editor="Pei, Jian and Tseng, Vincent S. and Cao, Longbing and Motoda, Hiroshi and Xu, Guandong", title="ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning", booktitle="Advances in Knowledge Discovery and Data Mining", year="2013", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="317--328", isbn="978-3-642-37456-2" }
SL_graph_SMOTE¶
API¶
-
class
smote_variants.
SL_graph_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SL_graph_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{sl_graph_smote, author = {Bunkhumpornpat, Chumpol and Subpaiboonkit, Sitthichoke}, booktitle= {13th International Symposium on Communications and Information Technologies}, year = {2013}, month = {09}, pages = {570-575}, title = {Safe level graph for synthetic minority over-sampling techniques}, isbn = {978-1-4673-5578-0} }
NRSBoundary_SMOTE¶
API¶
-
class
smote_variants.
NRSBoundary_SMOTE
(proportion=1.0, n_neighbors=5, w=0.005, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, w=0.005, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- w (float) – used to set neighborhood radius
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NRSBoundary_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{nrsboundary_smote, author= {Feng, Hu and Hang, Li}, title= {A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE}, journal= {Mathematical Problems in Engineering}, year= {2013}, pages= {10}, doi= {10.1155/2013/694809}, url= {http://dx.doi.org/10.1155/694809} }
LVQ_SMOTE¶
API¶
-
class
smote_variants.
LVQ_SMOTE
(proportion=1.0, n_neighbors=5, n_clusters=10, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_clusters=10, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- n_clusters (int) – number of clusters in vector quantization
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.LVQ_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{lvq_smote, title={LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data}, author={Munehiro Nakamura and Yusuke Kajiwara and Atsushi Otsuka and Haruhiko Kimura}, booktitle={BioData Mining}, year={2013} }
- Notes:
- This implementation is only a rough approximation of the method
- described in the paper. The main problem is that the paper uses many datasets to find similar patterns in the codebooks and replicate patterns appearing in other datasets to the imbalanced datasets based on their relative position compared to the codebook elements. What we do is clustering the minority class to extract a codebook as kmeans cluster means, then, find pairs of codebook elements which have the most similar relative position to a randomly selected pair of codebook elements, and translate nearby minority samples from the neighborhood one pair of codebook elements to the neighborood of another pair of codebook elements.
SOI_CJ¶
API¶
-
class
smote_variants.
SOI_CJ
(proportion=1.0, n_neighbors=5, method='interpolation', n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, method='interpolation', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of nearest neighbors in the SMOTE sampling
- method (str) – ‘interpolation’/’jittering’
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
clustering
(X, y)[source]¶ Implementation of the clustering technique described in the paper.
Parameters: - X (np.matrix) – array of training instances
- y (np.array) – target labels
Returns: list of minority clusters
Return type:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SOI_CJ()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{soi_cj, author = {Sánchez, Atlántida I. and Morales, Eduardo and Gonzalez, Jesus}, year = {2013}, month = {01}, pages = {}, title = {Synthetic Oversampling of Instances Using Clustering}, volume = {22}, booktitle = {International Journal of Artificial Intelligence Tools} }
ROSE¶
API¶
-
class
smote_variants.
ROSE
(proportion=1.0, random_state=None)[source]¶ -
__init__
(proportion=1.0, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ROSE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{rose, author="Menardi, Giovanna and Torelli, Nicola", title="Training and assessing classification rules with imbalanced data", journal="Data Mining and Knowledge Discovery", year="2014", month="Jan", day="01", volume="28", number="1", pages="92--122", issn="1573-756X", doi="10.1007/s10618-012-0295-5", url="https://doi.org/10.1007/s10618-012-0295-5" }
- Notes:
- It is not entirely clear if the authors propose kernel density
- estimation or the fitting of simple multivariate Gaussians on the minority samples. The latter seems to be more likely, I implement that approach.
SMOTE_OUT¶
API¶
-
class
smote_variants.
SMOTE_OUT
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – parameter of the NearestNeighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_OUT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_out_smote_cosine_selected_smote, title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level}, author={Fajri Koto}, journal={2014 International Conference on Advanced Computer Science and Information System}, year={2014}, pages={280-284} }
SMOTE_Cosine¶
API¶
-
class
smote_variants.
SMOTE_Cosine
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – parameter of the NearestNeighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_Cosine()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_out_smote_cosine_selected_smote, title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level}, author={Fajri Koto}, journal={2014 International Conference on Advanced Computer Science and Information System}, year={2014}, pages={280-284} }
Selected_SMOTE¶
API¶
-
class
smote_variants.
Selected_SMOTE
(proportion=1.0, n_neighbors=5, perc_sign_attr=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, perc_sign_attr=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - strategy (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – parameter of the NearestNeighbors component
- perc_sign_attr (float) – [0,1] - percentage of significant attributes
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Selected_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
- BibTex:
- @article{smote_out_smote_cosine_selected_smote,
- title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An
- enhancement strategy to handle imbalance in data level},
author={Fajri Koto}, journal={2014 International Conference on Advanced
Computer Science and Information System},year={2014}, pages={280-284}
}
- Notes:
- Significant attribute selection was not described in the paper,
- therefore we have implemented something meaningful.
LN_SMOTE¶
API¶
-
class
smote_variants.
LN_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – parameter of the NearestNeighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.LN_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ln_smote, author={Maciejewski, T. and Stefanowski, J.}, booktitle={2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)}, title={Local neighbourhood extension of SMOTE for mining imbalanced data}, year={2011}, volume={}, number={}, pages={104-111}, keywords={Bayes methods;data mining;pattern classification;local neighbourhood extension;imbalanced data mining; focused resampling technique;SMOTE over-sampling method;naive Bayes classifiers;Noise measurement;Noise; Decision trees;Breast cancer; Sensitivity;Data mining;Training}, doi={10.1109/CIDM.2011.5949434}, ISSN={}, month={April}}
MWMOTE¶
API¶
-
class
smote_variants.
MWMOTE
(proportion=1.0, k1=5, k2=5, k3=5, M=10, cf_th=5.0, cmax=10.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, k1=5, k2=5, k3=5, M=10, cf_th=5.0, cmax=10.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- k1 (int) – parameter of the NearestNeighbors component
- k2 (int) – parameter of the NearestNeighbors component
- k3 (int) – parameter of the NearestNeighbors component
- M (int) – number of clusters
- cf_th (float) – cutoff threshold
- cmax (float) – maximum closeness value
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MWMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@ARTICLE{mwmote, author={Barua, S. and Islam, M. M. and Yao, X. and Murase, K.}, journal={IEEE Transactions on Knowledge and Data Engineering}, title={MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning}, year={2014}, volume={26}, number={2}, pages={405-425}, keywords={learning (artificial intelligence);pattern clustering;sampling methods;AUC;area under curve;ROC;receiver operating curve;G-mean; geometric mean;minority class cluster; clustering approach;weighted informative minority class samples;Euclidean distance; hard-to-learn informative minority class samples;majority class;synthetic minority class samples;synthetic oversampling methods;imbalanced learning problems; imbalanced data set learning; MWMOTE-majority weighted minority oversampling technique;Sampling methods; Noise measurement;Boosting;Simulation; Complexity theory;Interpolation;Abstracts; Imbalanced learning;undersampling; oversampling;synthetic sample generation; clustering}, doi={10.1109/TKDE.2012.232}, ISSN={1041-4347}, month={Feb}}
- Notes:
- The original method was not prepared for the case of having clusters
- of 1 elements.
PDFOS¶
API¶
-
class
smote_variants.
PDFOS
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.PDFOS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{pdfos, title = "PDFOS: PDF estimation based over-sampling for imbalanced two-class problems", journal = "Neurocomputing", volume = "138", pages = "248 - 259", year = "2014", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2014.02.006", author = "Ming Gao and Xia Hong and Sheng Chen and Chris J. Harris and Emad Khalaf", keywords = "Imbalanced classification, Probability density function based over-sampling, Radial basis function classifier, Orthogonal forward selection, Particle swarm optimisation" }
- Notes:
- Not prepared for low-rank data.
IPADE_ID¶
API¶
-
class
smote_variants.
IPADE_ID
(F=0.1, G=0.1, OT=20, max_it=40, dt_classifier=DecisionTreeClassifier(random_state=2), base_classifier=DecisionTreeClassifier(random_state=2), n_jobs=1, random_state=None)[source]¶ -
__init__
(F=0.1, G=0.1, OT=20, max_it=40, dt_classifier=DecisionTreeClassifier(random_state=2), base_classifier=DecisionTreeClassifier(random_state=2), n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - F (float) – control parameter of differential evolution
- G (float) – control parameter of the evolution
- OT (int) – number of optimizations
- max_it (int) – maximum number of iterations for DE_optimization
- dt_classifier (obj) – decision tree classifier object
- base_classifier (obj) – classifier object
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.IPADE_ID()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ipade_id, title = "Addressing imbalanced classification with instance generation techniques: IPADE-ID", journal = "Neurocomputing", volume = "126", pages = "15 - 28", year = "2014", note = "Recent trends in Intelligent Data Analysis Online Data Processing", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2013.01.050", author = "Victoria López and Isaac Triguero and Cristóbal J. Carmona and Salvador García and Francisco Herrera", keywords = "Differential evolution, Instance generation, Nearest neighbor, Decision tree, Imbalanced datasets" }
- Notes:
- According to the algorithm, if the addition of a majority sample
- doesn’t improve the AUC during the DE optimization process, the addition of no further majority points is tried.
- In the differential evolution the multiplication by a random number
- seems have a deteriorating effect, new scaling parameter added to fix this.
- It is not specified how to do the evaluation.
RWO_sampling¶
API¶
-
class
smote_variants.
RWO_sampling
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.RWO_sampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{rwo_sampling, author = {Zhang, Huaxzhang and Li, Mingfang}, year = {2014}, month = {11}, pages = {}, title = {RWO-Sampling: A Random Walk Over-Sampling Approach to Imbalanced Data Classification}, volume = {20}, booktitle = {Information Fusion} }
NEATER¶
API¶
-
class
smote_variants.
NEATER
(proportion=1.0, smote_n_neighbors=5, b=5, alpha=0.1, h=20, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, smote_n_neighbors=5, b=5, alpha=0.1, h=20, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- smote_n_neighbors (int) – number of neighbors in SMOTE sampling
- b (int) – number of neighbors
- alpha (float) – smoothing term
- h (int) – number of iterations in evolution
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NEATER()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{neater, author={Almogahed, B. A. and Kakadiaris, I. A.}, booktitle={2014 22nd International Conference on Pattern Recognition}, title={NEATER: Filtering of Over-sampled Data Using Non-cooperative Game Theory}, year={2014}, volume={}, number={}, pages={1371-1376}, keywords={data handling;game theory;information filtering;NEATER;imbalanced data problem;synthetic data;filtering of over-sampled data using non-cooperative game theory;Games;Game theory;Vectors; Sociology;Statistics;Silicon; Mathematical model}, doi={10.1109/ICPR.2014.245}, ISSN={1051-4651}, month={Aug}}
- Notes:
- Evolving both majority and minority probabilities as nothing ensures
- that the probabilities remain in the range [0,1], and they need to be normalized.
- The inversely weighted function needs to be cut at some value (like
- the alpha level), otherwise it will overemphasize the utility of having differing neighbors next to each other.
DEAGO¶
API¶
-
class
smote_variants.
DEAGO
(proportion=1.0, n_neighbors=5, e=100, h=0.3, sigma=0.1, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, e=100, h=0.3, sigma=0.1, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- e (int) – number of epochs
- h (float) – fraction of number of hidden units
- sigma (float) – training noise
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.DEAGO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{deago, author={Bellinger, C. and Japkowicz, N. and Drummond, C.}, booktitle={2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)}, title={Synthetic Oversampling for Advanced Radioactive Threat Detection}, year={2015}, volume={}, number={}, pages={948-953}, keywords={radioactive waste;advanced radioactive threat detection;gamma-ray spectral classification;industrial nuclear facilities;Health Canadas national monitoring networks;Vancouver 2010; Isotopes;Training;Monitoring; Gamma-rays;Machine learning algorithms; Security;Neural networks;machine learning;classification;class imbalance;synthetic oversampling; artificial neural networks; autoencoders;gamma-ray spectra}, doi={10.1109/ICMLA.2015.58}, ISSN={}, month={Dec}}
- Notes:
- There is no hint on the activation functions and amounts of noise.
Gazzah¶
API¶
-
class
smote_variants.
Gazzah
(proportion=1.0, n_components=2, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_components=2, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_components (int) – number of components in PCA analysis
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Gazzah()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{gazzah, author={Gazzah, S. and Hechkel, A. and Essoukri Ben Amara, N. }, booktitle={2015 IEEE 12th International Multi-Conference on Systems, Signals Devices (SSD15)}, title={A hybrid sampling method for imbalanced data}, year={2015}, volume={}, number={}, pages={1-6}, keywords={computer vision;image classification; learning (artificial intelligence); sampling methods;hybrid sampling method;imbalanced data; diversification;computer vision domain;classical machine learning systems;intraclass variations; system performances;classification accuracy;imbalanced training data; training data set;over-sampling; minority class;SMOTE star topology; feature vector deletion;intra-class variations;distribution criterion; biometric data;true positive rate; Training data;Principal component analysis;Databases;Support vector machines;Training;Feature extraction; Correlation;Imbalanced data sets; Intra-class variations;Data analysis; Principal component analysis; One-against-all SVM}, doi={10.1109/SSD.2015.7348093}, ISSN={}, month={March}}
MCT¶
API¶
-
class
smote_variants.
MCT
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MCT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{mct, author = {Jiang, Liangxiao and Qiu, Chen and Li, Chaoqun}, year = {2015}, month = {03}, pages = {1551004}, title = {A Novel Minority Cloning Technique for Cost-Sensitive Learning}, volume = {29}, booktitle = {International Journal of Pattern Recognition and Artificial Intelligence} }
- Notes:
- Mode is changed to median, distance is changed to Euclidean to
- support continuous features, and normalized.
ADG¶
API¶
-
class
smote_variants.
ADG
(proportion=1.0, kernel='inner', lam=1.0, mu=1.0, k=12, gamma=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, kernel='inner', lam=1.0, mu=1.0, k=12, gamma=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- kernel (str) – ‘inner’/’rbf_x’, where x is a float, the bandwidth
- lam (float) – lambda parameter of the method
- mu (float) – mu parameter of the method
- k (int) – number of samples to generate in each iteration
- gamma (float) – gamma parameter of the method
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ADG()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{adg, author = {Pourhabib, A. and Mallick, Bani K. and Ding, Yu}, year = {2015}, month = {16}, pages = {2695--2724}, title = {A Novel Minority Cloning Technique for Cost-Sensitive Learning}, volume = {16}, journal = {Journal of Machine Learning Research} }
- Notes:
- This method has a lot of parameters, it becomes fairly hard to
- cross-validate thoroughly.
- Fails if matrix is singular when computing alpha_star, fixed
- by PCA.
- Singularity might be caused by repeating samples.
- Maintaining the kernel matrix becomes unfeasible above a couple
- of thousand vectors.
SMOTE_IPF¶
API¶
-
class
smote_variants.
SMOTE_IPF
(proportion=1.0, n_neighbors=5, n_folds=9, k=3, p=0.01, voting='majority', classifier=DecisionTreeClassifier(random_state=2), n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_folds=9, k=3, p=0.01, voting='majority', classifier=DecisionTreeClassifier(random_state=2), n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in SMOTE sampling
- n_folds (int) – the number of partitions
- k (int) – used in stopping condition
- p (float) – percentage value ([0,1]) used in stopping condition
- voting (str) – ‘majority’/’consensus’
- classifier (obj) – classifier object
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_IPF()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_ipf, title = "SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering", journal = "Information Sciences", volume = "291", pages = "184 - 203", year = "2015", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2014.08.051", author = "José A. Sáez and Julián Luengo and Jerzy Stefanowski and Francisco Herrera", keywords = "Imbalanced classification, Borderline examples, Noisy data, Noise filters, SMOTE" }
KernelADASYN¶
API¶
-
class
smote_variants.
KernelADASYN
(proportion=1.0, k=5, h=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, k=5, h=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- k (int) – number of neighbors in the nearest neighbors component
- h (float) – kernel bandwidth
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.KernelADASYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{kernel_adasyn, author={Tang, B. and He, H.}, booktitle={2015 IEEE Congress on Evolutionary Computation (CEC)}, title={KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning}, year={2015}, volume={}, number={}, pages={664-671}, keywords={learning (artificial intelligence); pattern classification; sampling methods;KernelADASYN; kernel based adaptive synthetic data generation;imbalanced learning;standard classification algorithms;data distribution; minority class decision rule; expensive minority class data misclassification;kernel based adaptive synthetic over-sampling approach;imbalanced data classification problems;kernel density estimation methods;Kernel; Estimation;Accuracy;Measurement; Standards;Training data;Sampling methods;Imbalanced learning; adaptive over-sampling;kernel density estimation;pattern recognition;medical and healthcare data learning}, doi={10.1109/CEC.2015.7256954}, ISSN={1089-778X}, month={May}}
- Notes:
- The method of sampling was not specified, Markov Chain Monte Carlo
- has been implemented.
- Not prepared for improperly conditioned covariance matrix.
MOT2LD¶
API¶
-
class
smote_variants.
MOT2LD
(proportion=1.0, n_components=2, k=5, d_cut='auto', n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_components=2, k=5, d_cut='auto', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_components (int) – number of components for stochastic neighborhood embedding
- k (int) – number of neighbors in the nearest neighbor component
- d_cut (float/str) – distance cut value/’auto’ for automated selection
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MOT2LD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{mot2ld, author="Xie, Zhipeng and Jiang, Liyang and Ye, Tengju and Li, Xiaoli", editor="Renz, Matthias and Shahabi, Cyrus and Zhou, Xiaofang and Cheema, Muhammad Aamir", title="A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning", booktitle="Database Systems for Advanced Applications", year="2015", publisher="Springer International Publishing", address="Cham", pages="3--18", isbn="978-3-319-18123-3" }
- Notes:
- Clusters might contain 1 elements, and all points can be filtered
- as noise.
- Clusters might contain 0 elements as well, if all points are filtered
- as noise.
- The entire clustering can become empty.
- TSNE is very slow when the number of instances is over a couple
- of 1000
V_SYNTH¶
API¶
-
class
smote_variants.
V_SYNTH
(proportion=1.0, n_components=3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_components=3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_components (int) – number of components for PCA
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.V_SYNTH()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{v_synth, author = {Young,Ii, William A. and Nykl, Scott L. and Weckman, Gary R. and Chelberg, David M.}, title = {Using Voronoi Diagrams to Improve Classification Performances when Modeling Imbalanced Datasets}, journal = {Neural Comput. Appl.}, issue_date = {July 2015}, volume = {26}, number = {5}, month = jul, year = {2015}, issn = {0941-0643}, pages = {1041--1054}, numpages = {14}, url = {http://dx.doi.org/10.1007/s00521-014-1780-0}, doi = {10.1007/s00521-014-1780-0}, acmid = {2790665}, publisher = {Springer-Verlag}, address = {London, UK, UK}, keywords = {Data engineering, Data mining, Imbalanced datasets, Knowledge extraction, Numerical algorithms, Synthetic over-sampling}, }
- Notes:
- The proposed encompassing bounding box generation is incorrect.
- Voronoi diagram generation in high dimensional spaces is instable
OUPS¶
API¶
-
class
smote_variants.
OUPS
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.OUPS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{oups, title = "A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets", journal = "Expert Systems with Applications", volume = "66", pages = "124 - 135", year = "2016", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2016.09.010", author = "William A. Rivera and Petros Xanthopoulos", keywords = "SMOTE, OUPS, Class imbalance, Classification" }
- Notes:
- In the description of the algorithm a fractional number p (j) is
- used to index a vector.
SMOTE_D¶
API¶
-
class
smote_variants.
SMOTE_D
(proportion=1.0, k=3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, k=3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- k (int) – number of neighbors in nearest neighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_D()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{smote_d, author="Torres, Fredy Rodr{'i}guez and Carrasco-Ochoa, Jes{'u}s A. and Mart{'i}nez-Trinidad, Jos{'e} Fco.", editor="Mart{'i}nez-Trinidad, Jos{'e} Francisco and Carrasco-Ochoa, Jes{'u}s Ariel and Ayala Ramirez, Victor and Olvera-L{'o}pez, Jos{'e} Arturo and Jiang, Xiaoyi", title="SMOTE-D a Deterministic Version of SMOTE", booktitle="Pattern Recognition", year="2016", publisher="Springer International Publishing", address="Cham", pages="177--188", isbn="978-3-319-39393-3" }
- Notes:
- Copying happens if two points are the neighbors of each other.
SMOTE_PSO¶
API¶
-
class
smote_variants.
SMOTE_PSO
(k=3, eps=0.05, n_pop=10, w=1.0, c1=2.0, c2=2.0, num_it=10, n_jobs=1, random_state=None)[source]¶ -
__init__
(k=3, eps=0.05, n_pop=10, w=1.0, c1=2.0, c2=2.0, num_it=10, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - k (int) – number of neighbors in nearest neighbors component, this is also the multiplication factor of minority support vectors
- eps (float) – use to specify the initially generated support vectors along minority-majority lines
- n_pop (int) – size of population
- w (float) – intertia constant
- c1 (float) – acceleration constant of local optimum
- c2 (float) – acceleration constant of population optimum
- num_it (int) – number of iterations
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_PSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_pso, title = "PSO-based method for SVM classification on skewed data sets", journal = "Neurocomputing", volume = "228", pages = "187 - 197", year = "2017", note = "Advanced Intelligent Computing: Theory and Applications", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2016.10.041", author = "Jair Cervantes and Farid Garcia-Lamont and Lisbeth Rodriguez and Asdrúbal López and José Ruiz Castilla and Adrian Trueba", keywords = "Skew data sets, SVM, Hybrid algorithms" }
- Notes:
- I find the description of the technique a bit confusing, especially
- on the bounds of the search space of velocities and positions. Equations 15 and 16 specify the lower and upper bounds, the lower bound is in fact a vector while the upper bound is a distance. I tried to implement something meaningful.
- I also find the setting of accelerating constant 2.0 strange, most
- of the time the velocity will be bounded due to this choice.
- Also, training and predicting probabilities with a non-linear
- SVM as the evaluation function becomes fairly expensive when the number of training vectors reaches a couple of thousands. To reduce computational burden, minority and majority vectors far from the other class are removed to reduce the size of both classes to a maximum of 500 samples. Generally, this shouldn’t really affect the results as the technique focuses on the samples near the class boundaries.
CURE_SMOTE¶
API¶
-
class
smote_variants.
CURE_SMOTE
(proportion=1.0, n_clusters=5, noise_th=2, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_clusters=5, noise_th=2, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_clusters (int) – number of clusters to generate
- noise_th (int) – below this number of elements the cluster is considered as noise
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.CURE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{cure_smote, author="Ma, Li and Fan, Suohai", title="CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests", journal="BMC Bioinformatics", year="2017", month="Mar", day="14", volume="18", number="1", pages="169", issn="1471-2105", doi="10.1186/s12859-017-1578-z", url="https://doi.org/10.1186/s12859-017-1578-z" }
- Notes:
- It is not specified how to determine the cluster with the
- “slowest growth rate”
- All clusters can be removed as noise.
SOMO¶
API¶
-
class
smote_variants.
SOMO
(proportion=1.0, n_grid=10, sigma=0.2, learning_rate=0.5, n_iter=100, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_grid=10, sigma=0.2, learning_rate=0.5, n_iter=100, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_grid (int) – size of grid
- sigma (float) – sigma of SOM
- learning_rate (float) –
- n_iter (int) – number of iterations
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SOMO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{somo, title = "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", journal = "Expert Systems with Applications", volume = "82", pages = "40 - 52", year = "2017", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2017.03.073", author = "Georgios Douzas and Fernando Bacao" }
- Notes:
- It is not specified how to handle those cases when a cluster contains
- 1 minority samples, the mean of within-cluster distances is set to 100 in these cases.
ISOMAP_Hybrid¶
API¶
-
class
smote_variants.
ISOMAP_Hybrid
(proportion=1.0, n_neighbors=5, n_components=3, smote_n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_components=3, smote_n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- n_components (int) – number of components
- smote_n_neighbors (int) – number of neighbors in SMOTE sampling
- n_jobs (int) – number of parallel jobs
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
classmethod
parameter_combinations
(raw=False)[source]¶ Generates reasonable paramter combinations.
Returns: a list of meaningful paramter combinations Return type: list(dict)
-
Example¶
>>> oversampler= smote_variants.ISOMAP_Hybrid()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{isomap_hybrid, author = {Gu, Qiong and Cai, Zhihua and Zhu, Li}, title = {Classification of Imbalanced Data Sets by Using the Hybrid Re-sampling Algorithm Based on Isomap}, booktitle = {Proceedings of the 4th International Symposium on Advances in Computation and Intelligence}, series = {ISICA '09}, year = {2009}, isbn = {978-3-642-04842-5}, location = {Huangshi, China}, pages = {287--296}, numpages = {10}, doi = {10.1007/978-3-642-04843-2_31}, acmid = {1691478}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {Imbalanced data set, Isomap, NCR, Smote, re-sampling}, }
CE_SMOTE¶
API¶
-
class
smote_variants.
CE_SMOTE
(proportion=1.0, h=10, k=5, alpha=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, h=10, k=5, alpha=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- h (int) – size of ensemble
- k (int) – number of clusters/neighbors
- alpha (float) – [0,1] threshold to select boundary samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.CE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ce_smote, author={Chen, S. and Guo, G. and Chen, L.}, booktitle={2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops}, title={A New Over-Sampling Method Based on Cluster Ensembles}, year={2010}, volume={}, number={}, pages={599-604}, keywords={data mining;Internet;pattern classification;pattern clustering; over sampling method;cluster ensembles;classification method; imbalanced data handling;CE-SMOTE; clustering consistency index; cluster boundary minority samples; imbalanced public data set; Mathematics;Computer science; Electronic mail;Accuracy;Nearest neighbor searches;Application software;Data mining;Conferences; Web sites;Information retrieval; classification;imbalanced data sets;cluster ensembles; over-sampling}, doi={10.1109/WAINA.2010.40}, ISSN={}, month={April}}
Edge_Det_SMOTE¶
API¶
-
class
smote_variants.
Edge_Det_SMOTE
(proportion=1.0, k=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, k=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- k (int) – number of neighbors
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Edge_Det_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{Edge_Det_SMOTE, author={Kang, Y. and Won, S.}, booktitle={ICCAS 2010}, title={Weight decision algorithm for oversampling technique on class-imbalanced learning}, year={2010}, volume={}, number={}, pages={182-186}, keywords={edge detection;learning (artificial intelligence);weight decision algorithm;oversampling technique; class-imbalanced learning;class imbalanced data problem;edge detection algorithm;spatial space representation;Classification algorithms;Image edge detection; Training;Noise measurement;Glass; Training data;Machine learning; Imbalanced learning;Classification; Weight decision;Oversampling; Edge detection}, doi={10.1109/ICCAS.2010.5669889}, ISSN={}, month={Oct}}
- Notes:
- This technique is very loosely specified.
CBSO¶
API¶
-
class
smote_variants.
CBSO
(proportion=1.0, n_neighbors=5, C_p=1.3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, C_p=1.3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- C_p (float) – used to set the threshold of clustering
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.CBSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{cbso, author="Barua, Sukarna and Islam, Md. Monirul and Murase, Kazuyuki", editor="Lu, Bao-Liang and Zhang, Liqing and Kwok, James", title="A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning", booktitle="Neural Information Processing", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="735--744", isbn="978-3-642-24958-7" }
- Notes:
- Clusters containing 1 element induce cloning of samples.
E_SMOTE¶
API¶
-
class
smote_variants.
E_SMOTE
(proportion=1.0, n_neighbors=5, min_features=2, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, min_features=2, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in the nearest neighbors component
- min_features (int) – minimum number of features
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
classmethod
parameter_combinations
(raw=False)[source]¶ Generates reasonable paramter combinations.
Returns: a list of meaningful paramter combinations Return type: list(dict)
-
Example¶
>>> oversampler= smote_variants.E_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{e_smote, author={Deepa, T. and Punithavalli, M.}, booktitle={2011 3rd International Conference on Electronics Computer Technology}, title={An E-SMOTE technique for feature selection in High-Dimensional Imbalanced Dataset}, year={2011}, volume={2}, number={}, pages={322-324}, keywords={bioinformatics;data mining;pattern classification;support vector machines; E-SMOTE technique;feature selection; high-dimensional imbalanced dataset; data mining;bio-informatics;dataset balancing;SVM classification;micro array dataset;Feature extraction; Genetic algorithms;Support vector machines;Data mining;Machine learning; Bioinformatics;Cancer;Imbalanced dataset;Featue Selection;E-SMOTE; Support Vector Machine[SVM]}, doi={10.1109/ICECTECH.2011.5941710}, ISSN={}, month={April}}
- Notes:
- This technique is basically unreproducible. I try to implement
- something following the idea of applying some simple genetic algorithm for optimization.
- In my best understanding, the technique uses evolutionary algorithms
- for feature selection and then applies vanilla SMOTE on the selected features only.
DBSMOTE¶
API¶
-
class
smote_variants.
DBSMOTE
(proportion=1.0, eps=0.8, min_samples=3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, eps=0.8, min_samples=3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- eps (float) – eps paramter of DBSCAN
- min_samples (int) – min_samples paramter of DBSCAN
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.DBSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{dbsmote, author="Bunkhumpornpat, Chumphol and Sinapiromsaran, Krung and Lursinsap, Chidchanok", title="DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique", journal="Applied Intelligence", year="2012", month="Apr", day="01", volume="36", number="3", pages="664--684", issn="1573-7497", doi="10.1007/s10489-011-0287-y", url="https://doi.org/10.1007/s10489-011-0287-y" }
- Notes:
- Standardization is needed to use absolute eps values.
- The clustering is likely to identify all instances as noise, fixed
- by recursive call with increaseing eps.
ASMOBD¶
API¶
-
class
smote_variants.
ASMOBD
(proportion=1.0, min_samples=3, eps=0.8, eta=0.5, T_1=1.0, T_2=1.0, t_1=4.0, t_2=4.0, a=0.05, smoothing='linear', n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, min_samples=3, eps=0.8, eta=0.5, T_1=1.0, T_2=1.0, t_1=4.0, t_2=4.0, a=0.05, smoothing='linear', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- min_samples (int) – parameter of OPTICS
- eps (float) – parameter of OPTICS
- eta (float) – tradeoff paramter
- T_1 (float) – noise threshold (see paper)
- T_2 (float) – noise threshold (see paper)
- t_1 (float) – noise threshold (see paper)
- t_2 (float) – noise threshold (see paper)
- a (float) – smoothing factor (see paper)
- smoothing (str) – ‘sigmoid’/’linear’
- n_jobs (int) – number of parallel jobs
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ASMOBD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{asmobd, author={Senzhang Wang and Zhoujun Li and Wenhan Chao and Qinghua Cao}, booktitle={The 2012 International Joint Conference on Neural Networks (IJCNN)}, title={Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning}, year={2012}, volume={}, number={}, pages={1-8}, doi={10.1109/IJCNN.2012.6252696}, ISSN={2161-4407}, month={June}}
- Notes:
- In order to use absolute thresholds, the data is standardized.
- The technique has many parameters, not easy to find the right
- combination.
Assembled_SMOTE¶
API¶
-
class
smote_variants.
Assembled_SMOTE
(proportion=1.0, n_neighbors=5, pop=2, thres=0.3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, pop=2, thres=0.3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- pop (int) – lower threshold on cluster sizes
- thres (float) – threshold on angles
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Assembled_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{assembled_smote, author={Zhou, B. and Yang, C. and Guo, H. and Hu, J.}, booktitle={The 2013 International Joint Conference on Neural Networks (IJCNN)}, title={A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification}, year={2013}, volume={}, number={}, pages={1-7}, keywords={approximation theory;interpolation; pattern classification;sampling methods;support vector machines;trees (mathematics);quasilinear SVM; assembled SMOTE;imbalanced dataset classification problem;oversampling method;quasilinear kernel function; approximate nonlinear separation boundary;mulitlocal linear boundaries; interpolation;data distribution information;minimal spanning tree; local linear partitioning method; linear separation boundary;synthetic minority class samples;oversampled dataset classification;standard SVM; composite quasilinear kernel function; artificial data datasets;benchmark datasets;classification performance improvement;synthetic minority over-sampling technique;Support vector machines;Kernel;Merging;Standards; Sociology;Statistics;Interpolation}, doi={10.1109/IJCNN.2013.6707035}, ISSN={2161-4407}, month={Aug}}
- Notes:
- Absolute value of the angles extracted should be taken.
- (implemented this way)
- It is not specified how many samples are generated in the various
- clusters.
SDSMOTE¶
API¶
-
class
smote_variants.
SDSMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SDSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{sdsmote, author={Li, K. and Zhang, W. and Lu, Q. and Fang, X.}, booktitle={2014 International Conference on Identification, Information and Knowledge in the Internet of Things}, title={An Improved SMOTE Imbalanced Data Classification Method Based on Support Degree}, year={2014}, volume={}, number={}, pages={34-38}, keywords={data mining;pattern classification; sampling methods;improved SMOTE imbalanced data classification method;support degree;data mining; class distribution;imbalanced data-set classification;over sampling method;minority class sample generation;minority class sample selection;minority class boundary sample identification;Classification algorithms;Training;Bagging;Computers; Testing;Algorithm design and analysis; Data mining;Imbalanced data-sets; Classification;Boundary sample;Support degree;SMOTE}, doi={10.1109/IIKI.2014.14}, ISSN={}, month={Oct}}
DSMOTE¶
API¶
-
class
smote_variants.
DSMOTE
(proportion=1.0, n_neighbors=5, rate=0.1, n_step=50, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, rate=0.1, n_step=50, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- rate (float) – [0,1] rate of minority samples to turn into majority
- n_step (int) – number of random configurations to check for new samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.DSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{dsmote, author={Mahmoudi, S. and Moradi, P. and Akhlaghian, F. and Moradi, R.}, booktitle={2014 4th International Conference on Computer and Knowledge Engineering (ICCKE)}, title={Diversity and separable metrics in over-sampling technique for imbalanced data classification}, year={2014}, volume={}, number={}, pages={152-158}, keywords={learning (artificial intelligence); pattern classification;sampling methods;diversity metric;separable metric;over-sampling technique; imbalanced data classification; class distribution techniques; under-sampling technique;DSMOTE method; imbalanced learning problem;diversity measure;separable measure;Iran University of Medical Science;UCI dataset;Accuracy;Classification algorithms;Vectors;Educational institutions;Euclidean distance; Data mining;Diversity measure; Separable Measure;Over-Sampling; Imbalanced Data;Classification problems}, doi={10.1109/ICCKE.2014.6993409}, ISSN={}, month={Oct}}
- Notes:
- The method is highly inefficient when the number of minority samples
- is high, time complexity is O(n^3), with 1000 minority samples it takes about 1e9 objective function evaluations to find 1 new sample points. Adding 1000 samples would take about 1e12 evaluations of the objective function, which is unfeasible. We introduce a new parameter, n_step, and during the search for the new sample at most n_step combinations of minority samples are tried.
- Abnormality of minority points is defined in the paper as
- D_maj/D_min, high abnormality means that the minority point is close to other minority points and very far from majority points. This is definitely not abnormality, I have implemented the opposite.
- Nothing ensures that the fisher statistics and the variance from
- the geometric mean remain comparable, which might skew the optimization towards one of the sub-objectives.
- MinMax normalization doesn’t work, each attribute will have a 0
- value, which will make the geometric mean of all attribute 0.
G_SMOTE¶
API¶
-
class
smote_variants.
G_SMOTE
(proportion=1.0, n_neighbors=5, method='linear', n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, method='linear', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbors component
- method (str) – ‘linear’/’non-linear_2.0’ - the float can be any number: standard deviation in the Gaussian-kernel
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.G_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{g_smote, author={Sandhan, T. and Choi, J. Y.}, booktitle={2014 22nd International Conference on Pattern Recognition}, title={Handling Imbalanced Datasets by Partially Guided Hybrid Sampling for Pattern Recognition}, year={2014}, volume={}, number={}, pages={1449-1453}, keywords={Gaussian processes;learning (artificial intelligence);pattern classification; regression analysis;sampling methods; support vector machines;imbalanced datasets;partially guided hybrid sampling;pattern recognition;real-world domains;skewed datasets;dataset rebalancing;learning algorithm; extremely low minority class samples; classification tasks;extracted hidden patterns;support vector machine; logistic regression;nearest neighbor; Gaussian process classifier;Support vector machines;Proteins;Pattern recognition;Kernel;Databases;Gaussian processes;Vectors;Imbalanced dataset; protein classification;ensemble classifier;bootstrapping;Sat-image classification;medical diagnoses}, doi={10.1109/ICPR.2014.258}, ISSN={1051-4651}, month={Aug}}
- Notes:
- the non-linear approach is inefficient
NT_SMOTE¶
API¶
-
class
smote_variants.
NT_SMOTE
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NT_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{nt_smote, author={Xu, Y. H. and Li, H. and Le, L. P. and Tian, X. Y.}, booktitle={2014 Seventh International Joint Conference on Computational Sciences and Optimization}, title={Neighborhood Triangular Synthetic Minority Over-sampling Technique for Imbalanced Prediction on Small Samples of Chinese Tourism and Hospitality Firms}, year={2014}, volume={}, number={}, pages={534-538}, keywords={financial management;pattern classification;risk management;sampling methods;travel industry;Chinese tourism; hospitality firms;imbalanced risk prediction;minority class samples; up-sampling approach;neighborhood triangular synthetic minority over-sampling technique;NT-SMOTE; nearest neighbor idea;triangular area sampling idea;single classifiers;data excavation principles;hospitality industry;missing financial indicators; financial data filtering;financial risk prediction;MDA;DT;LSVM;logit;probit; firm risk prediction;Joints; Optimization;imbalanced datasets; NT-SMOTE;neighborhood triangular; random sampling}, doi={10.1109/CSO.2014.104}, ISSN={}, month={July}}
Lee¶
API¶
-
class
smote_variants.
Lee
(proportion=1.0, n_neighbors=5, rejection_level=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, rejection_level=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in nearest neighbor component
- rejection_level (float) – the rejection level of generated samples, if the fraction of majority labels in the local environment is higher than this number, the generated point is rejected
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Lee()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{lee, author = {Lee, Jaedong and Kim, Noo-ri and Lee, Jee-Hyong}, title = {An Over-sampling Technique with Rejection for Imbalanced Class Learning}, booktitle = {Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication}, series = {IMCOM '15}, year = {2015}, isbn = {978-1-4503-3377-1}, location = {Bali, Indonesia}, pages = {102:1--102:6}, articleno = {102}, numpages = {6}, doi = {10.1145/2701126.2701181}, acmid = {2701181}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {data distribution, data preprocessing, imbalanced problem, rejection rule, synthetic minority oversampling technique} }
SPY¶
API¶
-
class
smote_variants.
SPY
(n_neighbors=5, threshold=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(n_neighbors=5, threshold=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SPY()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{spy, author={Dang, X. T. and Tran, D. H. and Hirose, O. and Satou, K.}, booktitle={2015 Seventh International Conference on Knowledge and Systems Engineering (KSE)}, title={SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data}, year={2015}, volume={}, number={}, pages={280-285}, keywords={decision making;learning (artificial intelligence);pattern classification; sampling methods;SPY;resampling method;decision-making process; biomedical data classification; class imbalance learning method; SMOTE;oversampling method;UCI machine learning repository;G-mean value;borderline-SMOTE; safe-level-SMOTE;Support vector machines;Training;Bioinformatics; Proteins;Protein engineering;Radio frequency;Sensitivity;Imbalanced dataset;Over-sampling; Under-sampling;SMOTE; borderline-SMOTE}, doi={10.1109/KSE.2015.24}, ISSN={}, month={Oct}}
SMOTE_PSOBAT¶
API¶
-
class
smote_variants.
SMOTE_PSOBAT
(maxit=50, c1=0.3, c2=0.1, c3=0.1, alpha=0.9, gamma=0.9, method='bat', n_jobs=1, random_state=None)[source]¶ -
__init__
(maxit=50, c1=0.3, c2=0.1, c3=0.1, alpha=0.9, gamma=0.9, method='bat', n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - maxit (int) – maximum number of iterations
- c1 (float) – intertia weight of PSO
- c2 (float) – attraction of local maximums in PSO
- c3 (float) – attraction of global maximum in PSO
- alpha (float) – alpha parameter of the method
- gamma (float) – gamma parameter of the method
- method (str) – optimization technique to be used
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_PSOBAT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{smote_psobat, author={Li, J. and Fong, S. and Zhuang, Y.}, booktitle={2015 3rd International Symposium on Computational and Business Intelligence (ISCBI)}, title={Optimizing SMOTE by Metaheuristics with Neural Network and Decision Tree}, year={2015}, volume={}, number={}, pages={26-32}, keywords={data mining;particle swarm optimisation;pattern classification; data mining;classifier;metaherustics; SMOTE parameters;performance indicators;selection optimization; PSO;particle swarm optimization algorithm;BAT;bat-inspired algorithm; metaheuristic optimization algorithms; nearest neighbors;imbalanced dataset problem;synthetic minority over-sampling technique;decision tree; neural network;Classification algorithms;Neural networks;Decision trees;Training;Optimization;Particle swarm optimization;Data mining;SMOTE; Swarm Intelligence;parameter selection optimization}, doi={10.1109/ISCBI.2015.12}, ISSN={}, month={Dec}}
- Notes:
- The parameters of the memetic algorithms are not specified.
- I have checked multiple paper describing the BAT algorithm, but the
- meaning of “Generate a new solution by flying randomly” is still unclear.
- It is also unclear if best solutions are recorded for each bat, or
- the entire population.
MDO¶
API¶
-
class
smote_variants.
MDO
(proportion=1.0, K2=5, K1_frac=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, K2=5, K1_frac=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- K2 (int) – number of neighbors
- K1_frac (float) – the fraction of K2 to set K1
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.MDO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@ARTICLE{mdo, author={Abdi, L. and Hashemi, S.}, journal={IEEE Transactions on Knowledge and Data Engineering}, title={To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques}, year={2016}, volume={28}, number={1}, pages={238-251}, keywords={covariance analysis;learning (artificial intelligence);modelling;pattern classification;sampling methods; statistical distributions;minority class instance modelling;probability contour;covariance structure;MDO; Mahalanobis distance-based oversampling technique;data-oriented technique; model-oriented solution;machine learning algorithm;data skewness;multiclass imbalanced problem;Mathematical model; Training;Accuracy;Eigenvalues and eigenfunctions;Machine learning algorithms;Algorithm design and analysis; Benchmark testing;Multi-class imbalance problems;over-sampling techniques; Mahalanobis distance;Multi-class imbalance problems;over-sampling techniques; Mahalanobis distance}, doi={10.1109/TKDE.2015.2458858}, ISSN={1041-4347}, month={Jan}}
Random_SMOTE¶
API¶
-
class
smote_variants.
Random_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Random_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{random_smote, author="Dong, Yanjie and Wang, Xuehua", editor="Xiong, Hui and Lee, W. B.", title="A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets", booktitle="Knowledge Science, Engineering and Management", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="343--352", isbn="978-3-642-25975-3" }
ISMOTE¶
API¶
-
class
smote_variants.
ISMOTE
(n_neighbors=5, minority_weight=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(n_neighbors=5, minority_weight=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ISMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{ismote, author="Li, Hu and Zou, Peng and Wang, Xiang and Xia, Rongze", editor="Sun, Zengqi and Deng, Zhidong", title="A New Combination Sampling Method for Imbalanced Data", booktitle="Proceedings of 2013 Chinese Intelligent Automation Conference", year="2013", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="547--554", isbn="978-3-642-38466-0" }
VIS_RST¶
API¶
-
class
smote_variants.
VIS_RST
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.VIS_RST()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{vis_rst, author="Borowska, Katarzyna and Stepaniuk, Jaroslaw", editor="Saeed, Khalid and Homenda, Wladyslaw", title="Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets", booktitle="Computer Information Systems and Industrial Management", year="2016", publisher="Springer International Publishing", address="Cham", pages="31--42", isbn="978-3-319-45378-1" }
- Notes:
- Replication of DANGER samples will be removed by the last step of
- noise filtering.
GASMOTE¶
API¶
-
class
smote_variants.
GASMOTE
(n_neighbors=5, maxn=7, n_pop=10, popl3=5, pm=0.3, pr=0.2, Ge=10, n_jobs=1, random_state=None)[source]¶ -
__init__
(n_neighbors=5, maxn=7, n_pop=10, popl3=5, pm=0.3, pr=0.2, Ge=10, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - n_neighbors (int) – number of neighbors
- maxn (int) – maximum number of samples to generate per minority instances
- n_pop (int) – size of population
- popl3 (int) – number of crossovers
- pm (float) – mutation probability
- pr (float) – selection probability
- Ge (int) – number of generations
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.GASMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{gasmote, author="Jiang, Kun and Lu, Jing and Xia, Kuiliang", title="A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE", journal="Arabian Journal for Science and Engineering", year="2016", month="Aug", day="01", volume="41", number="8", pages="3255--3266", issn="2191-4281", doi="10.1007/s13369-016-2179-2", url="https://doi.org/10.1007/s13369-016-2179-2" }
A_SUWO¶
API¶
-
class
smote_variants.
A_SUWO
(proportion=1.0, n_neighbors=5, n_clus_maj=7, c_thres=0.8, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_clus_maj=7, c_thres=0.8, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- n_clus_maj (int) – number of majority clusters
- c_thres (float) – threshold on distances
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.A_SUWO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{a_suwo, title = "Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets", journal = "Expert Systems with Applications", volume = "46", pages = "405 - 416", year = "2016", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2015.10.031", author = "Iman Nekooeimehr and Susana K. Lai-Yuen", keywords = "Imbalanced dataset, Classification, Clustering, Oversampling" }
- Notes:
- Equation (7) misses a division by R_j.
- It is not specified how to sample from clusters with 1 instances.
SMOTE_FRST_2T¶
API¶
-
class
smote_variants.
SMOTE_FRST_2T
(n_neighbors=5, gamma_S=0.7, gamma_M=0.03, n_jobs=1, random_state=None)[source]¶ -
__init__
(n_neighbors=5, gamma_S=0.7, gamma_M=0.03, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SMOTE_FRST_2T()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_frst_2t, title = "Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm", journal = "Engineering Applications of Artificial Intelligence", volume = "48", pages = "134 - 139", year = "2016", issn = "0952-1976", doi = "https://doi.org/10.1016/j.engappai.2015.10.009", author = "Ramentol, E. and Gondres, I. and Lajes, S. and Bello, R. and Caballero,Y. and Cornelis, C. and Herrera, F.", keywords = "High Voltage Circuit Breaker (HVCB), Imbalanced learning, Fuzzy rough set theory, Resampling methods" }
- Notes:
- Unlucky setting of parameters might result 0 points added, we have
- fixed this by increasing the gamma_S threshold if the number of samples accepted is low.
- Similarly, unlucky setting of parameters might result all majority
- samples turned into minority.
- In my opinion, in the algorithm presented in the paper the
- relations are incorrect. The authors talk about accepting samples having POS score below a threshold, and in the algorithm in both places POS >= gamma is used.
AND_SMOTE¶
API¶
-
class
smote_variants.
AND_SMOTE
(proportion=1.0, K=15, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, K=15, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- K (int) – maximum number of nearest neighbors
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.AND_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{and_smote, author = {Yun, Jaesub and Ha, Jihyun and Lee, Jong-Seok}, title = {Automatic Determination of Neighborhood Size in SMOTE}, booktitle = {Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication}, series = {IMCOM '16}, year = {2016}, isbn = {978-1-4503-4142-4}, location = {Danang, Viet Nam}, pages = {100:1--100:8}, articleno = {100}, numpages = {8}, doi = {10.1145/2857546.2857648}, acmid = {2857648}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {SMOTE, imbalanced learning, synthetic data generation}, }
NRAS¶
API¶
-
class
smote_variants.
NRAS
(proportion=1.0, n_neighbors=5, t=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, t=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- t (float) – [0,1] fraction of n_neighbors as threshold
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NRAS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{nras, title = "Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets", journal = "Information Sciences", volume = "408", pages = "146 - 161", year = "2017", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2017.04.046", author = "William A. Rivera", keywords = "NRAS, SMOTE, OUPS, Class imbalance, Classification" }
AMSCO¶
API¶
-
class
smote_variants.
AMSCO
(n_pop=5, n_iter=15, omega=0.1, r1=0.1, r2=0.1, n_jobs=1, classifier=DecisionTreeClassifier(random_state=2), random_state=None)[source]¶ -
__init__
(n_pop=5, n_iter=15, omega=0.1, r1=0.1, r2=0.1, n_jobs=1, classifier=DecisionTreeClassifier(random_state=2), random_state=None)[source]¶ Constructor of the sampling object
Parameters:
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.AMSCO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{amsco, title = "Adaptive multi-objective swarm fusion for imbalanced data classification", journal = "Information Fusion", volume = "39", pages = "1 - 24", year = "2018", issn = "1566-2535", doi = "https://doi.org/10.1016/j.inffus.2017.03.007", author = "Jinyan Li and Simon Fong and Raymond K. Wong and Victor W. Chu", keywords = "Swarm fusion, Swarm intelligence algorithm, Multi-objective, Crossover rebalancing, Imbalanced data classification" }
- Notes:
- It is not clear how the kappa threshold is used, I do use the RA
score to drive all the evolution. Particularly:
“In the last phase of each iteration, the average Kappa value in current non-inferior set is compare with the latest threshold value, the threshold is then increase further if the average value increases, and vice versa. By doing so, the non-inferior region will be progressively reduced as the Kappa threshold lifts up.”
I don’t see why would the Kappa threshold lift up if the kappa thresholds are decreased if the average Kappa decreases (“vice versa”).
- Due to the interpretation of kappa threshold and the lack of detailed
- description of the SIS process, the implementation is not exactly what is described in the paper, but something very similar.
SSO¶
API¶
-
class
smote_variants.
SSO
(proportion=1.0, n_neighbors=5, h=10, n_iter=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, h=10, n_iter=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- h (int) – number of hidden units
- n_iter (int) – optimization steps
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{sso, author="Rong, Tongwen and Gong, Huachang and Ng, Wing W. Y.", editor="Wang, Xizhao and Pedrycz, Witold and Chan, Patrick and He, Qiang", title="Stochastic Sensitivity Oversampling Technique for Imbalanced Data", booktitle="Machine Learning and Cybernetics", year="2014", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="161--171", isbn="978-3-662-45652-1" }
- Notes:
- In the algorithm step 2d adds a constant to a vector. I have
- changed it to a componentwise adjustment, and also used the normalized STSM as I don’t see any reason why it would be some reasonable, bounded value.
NDO_sampling¶
API¶
-
class
smote_variants.
NDO_sampling
(proportion=1.0, n_neighbors=5, T=0.5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, T=0.5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- T (float) – threshold parameter
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.NDO_sampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ndo_sampling, author={Zhang, L. and Wang, W.}, booktitle={2011 International Conference of Information Technology, Computer Engineering and Management Sciences}, title={A Re-sampling Method for Class Imbalance Learning with Credit Data}, year={2011}, volume={1}, number={}, pages={393-397}, keywords={data handling;sampling methods; resampling method;class imbalance learning;credit rating;imbalance problem;synthetic minority over-sampling technique;sample distribution;synthetic samples; credit data set;Training; Measurement;Support vector machines; Logistics;Testing;Noise;Classification algorithms;class imbalance;credit rating;SMOTE;sample distribution}, doi={10.1109/ICM.2011.34}, ISSN={}, month={Sept}}
DSRBF¶
API¶
-
class
smote_variants.
DSRBF
(proportion=1.0, n_neighbors=5, m_min=4, m_max=10, Ib=2, Ob=2, n_pop=500, n_init_pop=5000, n_iter=40, n_sampling_epoch=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, m_min=4, m_max=10, Ib=2, Ob=2, n_pop=500, n_init_pop=5000, n_iter=40, n_sampling_epoch=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in the SMOTE sampling
- m_min (int) – minimum number of hidden units
- m_max (int) – maximum number of hidden units
- Ib (float) – input weight range
- Ob (float) – output weight range
- n_pop (int) – size of population
- n_init_pop (int) – size of initial population
- n_iter (int) – number of iterations
- n_sampling_epoch (int) – resampling after this many iterations
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.DSRBF()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{dsrbf, title = "A dynamic over-sampling procedure based on sensitivity for multi-class problems", journal = "Pattern Recognition", volume = "44", number = "8", pages = "1821 - 1833", year = "2011", issn = "0031-3203", doi = "https://doi.org/10.1016/j.patcog.2011.02.019", author = "Francisco Fernández-Navarro and César Hervás-Martínez and Pedro Antonio Gutiérrez", keywords = "Classification, Multi-class, Sensitivity, Accuracy, Memetic algorithm, Imbalanced datasets, Over-sampling method, SMOTE" }
- Notes:
- It is not entirely clear why J-1 output is supposed where J is the
- number of classes.
- The fitness function is changed to a balanced mean loss, as I found
- that it just ignores classification on minority samples (class label +1) in the binary case.
- The iRprop+ optimization is not implemented.
- The original paper proposes using SMOTE incrementally. Instead of
- that, this implementation applies SMOTE to generate all samples needed in the sampling epochs and the evolution of RBF networks is used to select the sampling providing the best results.
Gaussian_SMOTE¶
API¶
-
class
smote_variants.
Gaussian_SMOTE
(proportion=1.0, n_neighbors=5, sigma=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, sigma=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- sigma (float) – variance
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Gaussian_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{gaussian_smote, title={Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions}, author={Hansoo Lee and Jonggeun Kim and Sungshin Kim}, journal={Int. J. Fuzzy Logic and Intelligent Systems}, year={2017}, volume={17}, pages={229-234} }
kmeans_SMOTE¶
API¶
-
class
smote_variants.
kmeans_SMOTE
(proportion=1.0, n_neighbors=5, n_clusters=10, irt=2.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_clusters=10, irt=2.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors
- n_clusters (int) – number of clusters
- irt (float) – imbalanced ratio threshold
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.kmeans_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{kmeans_smote, title = "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", journal = "Information Sciences", volume = "465", pages = "1 - 20", year = "2018", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2018.06.056", author = "Georgios Douzas and Fernando Bacao and Felix Last", keywords = "Class-imbalanced learning, Oversampling, Classification, Clustering, Supervised learning, Within-class imbalance" }
Supervised_SMOTE¶
API¶
-
class
smote_variants.
Supervised_SMOTE
(proportion=1.0, th_lower=0.5, th_upper=1.0, classifier=RandomForestClassifier(n_estimators=50, n_jobs=1, random_state=5), n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, th_lower=0.5, th_upper=1.0, classifier=RandomForestClassifier(n_estimators=50, n_jobs=1, random_state=5), n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- th_lower (float) – lower bound of the confidence interval
- th_upper (float) – upper bound of the confidence interval
- classifier (obj) – classifier used to estimate class memberships
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.Supervised_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{supervised_smote, author = {Hu, Jun AND He, Xue AND Yu, Dong-Jun AND Yang, Xi-Bei AND Yang, Jing-Yu AND Shen, Hong-Bin}, journal = {PLOS ONE}, publisher = {Public Library of Science}, title = {A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction}, year = {2014}, month = {09}, volume = {9}, url = {https://doi.org/10.1371/journal.pone.0107676}, pages = {1-10}, number = {9}, doi = {10.1371/journal.pone.0107676} }
SN_SMOTE¶
API¶
-
class
smote_variants.
SN_SMOTE
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=5, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (float) – number of neighbors
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.SN_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{sn_smote, author="Garc{'i}a, V. and S{'a}nchez, J. S. and Mart{'i}n-F{'e}lez, R. and Mollineda, R. A.", title="Surrounding neighborhood-based SMOTE for learning from imbalanced data sets", journal="Progress in Artificial Intelligence", year="2012", month="Dec", day="01", volume="1", number="4", pages="347--362", issn="2192-6360", doi="10.1007/s13748-012-0027-5", url="https://doi.org/10.1007/s13748-012-0027-5" }
CCR¶
API¶
-
class
smote_variants.
CCR
(proportion=1.0, energy=1.0, scaling=0.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, energy=1.0, scaling=0.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- energy (float) – energy parameter
- scaling (float) – scaling factor
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.CCR()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ccr, author = {Koziarski, Michał and Wozniak, Michal}, year = {2017}, month = {12}, pages = {727–736}, title = {CCR: A combined cleaning and resampling algorithm for imbalanced data classification}, volume = {27}, journal = {International Journal of Applied Mathematics and Computer Science} }
- Notes:
- Adapted from https://github.com/michalkoziarski/CCR
ANS¶
API¶
-
class
smote_variants.
ANS
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.ANS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ans, author = {Siriseriwan, W and Sinapiromsaran, Krung}, year = {2017}, month = {09}, pages = {565-576}, title = {Adaptive neighbor synthetic minority oversampling technique under 1NN outcast handling}, volume = {39}, booktitle = {Songklanakarin Journal of Science and Technology} }
- Notes:
- The method is not prepared for the case when there is no c satisfying
- the condition in line 25 of the algorithm, fixed.
- The method is not prepared for empty Pused sets, fixed.
cluster_SMOTE¶
API¶
-
class
smote_variants.
cluster_SMOTE
(proportion=1.0, n_neighbors=3, n_clusters=3, n_jobs=1, random_state=None)[source]¶ -
__init__
(proportion=1.0, n_neighbors=3, n_clusters=3, n_jobs=1, random_state=None)[source]¶ Constructor of the sampling object
Parameters: - proportion (float) – proportion of the difference of n_maj and n_min to sample e.g. 1.0 means that after sampling the number of minority samples will be equal to the number of majority samples
- n_neighbors (int) – number of neighbors in SMOTE
- n_clusters (int) – number of clusters
- n_jobs (int) – number of parallel jobs
- random_state (int/RandomState/None) – initializer of random_state, like in sklearn
-
get_params
(deep=False)[source]¶ Returns: the parameters of the current sampling object Return type: dict
-
Example¶
>>> oversampler= smote_variants.cluster_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{cluster_SMOTE, author={Cieslak, D. A. and Chawla, N. V. and Striegel, A.}, booktitle={2006 IEEE International Conference on Granular Computing}, title={Combating imbalance in network intrusion datasets}, year={2006}, volume={}, number={}, pages={732-737}, keywords={Intelligent networks;Intrusion detection; Telecommunication traffic;Data mining; Computer networks;Data security; Machine learning;Counting circuits; Computer security;Humans}, doi={10.1109/GRC.2006.1635905}, ISSN={}, month={May}}