Oversamplers
NoSMOTE
API
Example
>>> oversampler= smote_variants.NoSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
The goal of this class is to provide a functionality to send data through on any model selection/evaluation pipeline with no oversampling carried out. It can be used to get baseline estimates on preformance.
SMOTE
API
Example
>>> oversampler= smote_variants.SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote, author={Chawla, N. V. and Bowyer, K. W. and Hall, L. O. and Kegelmeyer, W. P.}, title={{SMOTE}: synthetic minority over-sampling technique}, journal={Journal of Artificial Intelligence Research}, volume={16}, year={2002}, pages={321--357} }
SMOTE_TomekLinks
API
Example
>>> oversampler= smote_variants.SMOTE_TomekLinks()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_tomeklinks_enn, author = {Batista, Gustavo E. A. P. A. and Prati, Ronaldo C. and Monard, Maria Carolina}, title = {A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data}, journal = {SIGKDD Explor. Newsl.}, issue_date = {June 2004}, volume = {6}, number = {1}, month = jun, year = {2004}, issn = {1931-0145}, pages = {20--29}, numpages = {10}, url = {http://doi.acm.org/10.1145/1007730.1007735}, doi = {10.1145/1007730.1007735}, acmid = {1007735}, publisher = {ACM}, address = {New York, NY, USA}, }
SMOTE_ENN
API
Example
>>> oversampler= smote_variants.SMOTE_ENN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_tomeklinks_enn, author = {Batista, Gustavo E. A. P. A. and Prati, Ronaldo C. and Monard, Maria Carolina}, title = {A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data}, journal = {SIGKDD Explor. Newsl.}, issue_date = {June 2004}, volume = {6}, number = {1}, month = jun, year = {2004}, issn = {1931-0145}, pages = {20--29}, numpages = {10}, url = {http://doi.acm.org/10.1145/1007730.1007735}, doi = {10.1145/1007730.1007735}, acmid = {1007735}, publisher = {ACM}, address = {New York, NY, USA}, }
- Notes:
Can remove too many of minority samples.
Borderline_SMOTE1
API
Example
>>> oversampler= smote_variants.Borderline_SMOTE1()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{borderlineSMOTE, author="Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan", editor="Huang, De-Shuang and Zhang, Xiao-Ping and Huang, Guang-Bin", title="Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning", booktitle="Advances in Intelligent Computing", year="2005", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="878--887", isbn="978-3-540-31902-3" }
Borderline_SMOTE2
API
Example
>>> oversampler= smote_variants.Borderline_SMOTE2()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{borderlineSMOTE, author="Han, Hui and Wang, Wen-Yuan and Mao, Bing-Huan", editor="Huang, De-Shuang and Zhang, Xiao-Ping and Huang, Guang-Bin", title="Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning", booktitle="Advances in Intelligent Computing", year="2005", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="878--887", isbn="978-3-540-31902-3" }
ADASYN
API
Example
>>> oversampler= smote_variants.ADASYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{adasyn, author={He, H. and Bai, Y. and Garcia, E. A. and Li, S.}, title={{ADASYN}: adaptive synthetic sampling approach for imbalanced learning}, booktitle={Proceedings of IJCNN}, year={2008}, pages={1322--1328} }
AHC
API
Example
>>> oversampler= smote_variants.AHC()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{AHC, title = "Learning from imbalanced data in surveillance of nosocomial infection", journal = "Artificial Intelligence in Medicine", volume = "37", number = "1", pages = "7 - 18", year = "2006", note = "Intelligent Data Analysis in Medicine", issn = "0933-3657", doi = "https://doi.org/10.1016/j.artmed.2005.03.002", url = {http://www.sciencedirect.com/science/article/ pii/S0933365705000850}, author = "Gilles Cohen and Mélanie Hilario and Hugo Sax and Stéphane Hugonnet and Antoine Geissbuhler", keywords = "Nosocomial infection, Machine learning, Support vector machines, Data imbalance" }
LLE_SMOTE
API
Example
>>> oversampler= smote_variants.LLE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{lle_smote, author={Wang, J. and Xu, M. and Wang, H. and Zhang, J.}, booktitle={2006 8th international Conference on Signal Processing}, title={Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding}, year={2006}, volume={3}, number={}, pages={}, keywords={artificial intelligence; biomedical imaging;medical computing; imbalanced data classification; SMOTE algorithm; locally linear embedding; medical imaging intelligence; synthetic minority oversampling technique; high-dimensional data; low-dimensional space; Biomedical imaging; Back;Training data; Data mining;Biomedical engineering; Research and development; Electronic mail;Pattern recognition; Performance analysis; Classification algorithms}, doi={10.1109/ICOSP.2006.345752}, ISSN={2164-5221}, month={Nov}}
- Notes:
- There might be numerical issues if the nearest neighbors contain
some element multiple times.
distance_SMOTE
API
Example
>>> oversampler= smote_variants.distance_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{distance_smote, author={de la Calleja, J. and Fuentes, O.}, booktitle={Proceedings of the Twentieth International Florida Artificial Intelligence}, title={A distance-based over-sampling method for learning from imbalanced data sets}, year={2007}, volume={3}, pages={634--635} }
- Notes:
It is not clear what the authors mean by “weighted distance”.
SMMO
API
Example
>>> oversampler= smote_variants.SMMO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{smmo, author = {de la Calleja, Jorge and Fuentes, Olac and González, Jesús}, booktitle= {Proceedings of the Twenty-First International Florida Artificial Intelligence Research Society Conference}, year = {2008}, month = {01}, pages = {276-281}, title = {Selecting Minority Examples from Misclassified Data for Over-Sampling.} }
- Notes:
- In this paper the ensemble is not specified. I have selected
some very fast, basic classifiers.
Also, it is not clear what the authors mean by “weighted distance”.
- The original technique is not prepared for the case when no minority
samples are classified correctly be the ensemble.
polynom_fit_SMOTE_bus
API
Example
>>> oversampler= smote_variants.polynom_fit_SMOTE_bus()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{polynomial_fit_smote, author={Gazzah, S. and Amara, N. E. B.}, booktitle={2008 The Eighth IAPR International Workshop on Document Analysis Systems}, title={New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets}, year={2008}, volume={}, number={}, pages={677-684}, keywords={curve fitting;learning (artificial intelligence);mesh generation;pattern classification;polynomials;sampling methods;support vector machines; oversampling approach;polynomial fitting function;imbalanced data set;pattern classification task; class-modular strategy;support vector machine;true negative rate; true positive rate;star topology; bus topology;polynomial curve topology;mesh topology;Polynomials; Topology;Support vector machines; Support vector machine classification; Pattern classification;Performance evaluation;Training data;Text analysis;Data engineering;Convergence; writer identification system;majority class;minority class;imbalanced data sets;polynomial fitting functions; class-modular strategy}, doi={10.1109/DAS.2008.74}, ISSN={}, month={Sept},}
polynom_fit_SMOTE_mesh
API
Example
>>> oversampler= smote_variants.polynom_fit_SMOTE_mesh()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{polynomial_fit_smote, author={Gazzah, S. and Amara, N. E. B.}, booktitle={2008 The Eighth IAPR International Workshop on Document Analysis Systems}, title={New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets}, year={2008}, volume={}, number={}, pages={677-684}, keywords={curve fitting;learning (artificial intelligence);mesh generation;pattern classification;polynomials;sampling methods;support vector machines; oversampling approach;polynomial fitting function;imbalanced data set;pattern classification task; class-modular strategy;support vector machine;true negative rate; true positive rate;star topology; bus topology;polynomial curve topology;mesh topology;Polynomials; Topology;Support vector machines; Support vector machine classification; Pattern classification;Performance evaluation;Training data;Text analysis;Data engineering;Convergence; writer identification system;majority class;minority class;imbalanced data sets;polynomial fitting functions; class-modular strategy}, doi={10.1109/DAS.2008.74}, ISSN={}, month={Sept},}
polynom_fit_SMOTE_star
API
Example
>>> oversampler= smote_variants.polynom_fit_SMOTE_star()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{polynomial_fit_smote, author={Gazzah, S. and Amara, N. E. B.}, booktitle={2008 The Eighth IAPR International Workshop on Document Analysis Systems}, title={New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets}, year={2008}, volume={}, number={}, pages={677-684}, keywords={curve fitting;learning (artificial intelligence);mesh generation;pattern classification;polynomials;sampling methods;support vector machines; oversampling approach;polynomial fitting function;imbalanced data set;pattern classification task; class-modular strategy;support vector machine;true negative rate; true positive rate;star topology; bus topology;polynomial curve topology;mesh topology;Polynomials; Topology;Support vector machines; Support vector machine classification; Pattern classification;Performance evaluation;Training data;Text analysis;Data engineering;Convergence; writer identification system;majority class;minority class;imbalanced data sets;polynomial fitting functions; class-modular strategy}, doi={10.1109/DAS.2008.74}, ISSN={}, month={Sept},}
polynom_fit_SMOTE_poly
API
Example
>>> oversampler= smote_variants.polynom_fit_SMOTE_poly()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{polynomial_fit_smote, author={Gazzah, S. and Amara, N. E. B.}, booktitle={2008 The Eighth IAPR International Workshop on Document Analysis Systems}, title={New Oversampling Approaches Based on Polynomial Fitting for Imbalanced Data Sets}, year={2008}, volume={}, number={}, pages={677-684}, keywords={curve fitting;learning (artificial intelligence);mesh generation;pattern classification;polynomials;sampling methods;support vector machines; oversampling approach;polynomial fitting function;imbalanced data set;pattern classification task; class-modular strategy;support vector machine;true negative rate; true positive rate;star topology; bus topology;polynomial curve topology;mesh topology;Polynomials; Topology;Support vector machines; Support vector machine classification; Pattern classification;Performance evaluation;Training data;Text analysis;Data engineering;Convergence; writer identification system;majority class;minority class;imbalanced data sets;polynomial fitting functions; class-modular strategy}, doi={10.1109/DAS.2008.74}, ISSN={}, month={Sept},}
Stefanowski
API
Example
>>> oversampler= smote_variants.Stefanowski()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{stefanowski, author = {Stefanowski, Jerzy and Wilk, Szymon}, title = {Selective Pre-processing of Imbalanced Data for Improving Classification Performance}, booktitle = {Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery}, series = {DaWaK '08}, year = {2008}, isbn = {978-3-540-85835-5}, location = {Turin, Italy}, pages = {283--292}, numpages = {10}, url = {http://dx.doi.org/10.1007/978-3-540-85836-2_27}, doi = {10.1007/978-3-540-85836-2_27}, acmid = {1430591}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, }
Safe_Level_SMOTE
API
Example
>>> oversampler= smote_variants.Safe_Level_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{safe_level_smote, author = { Bunkhumpornpat, Chumphol and Sinapiromsaran, Krung and Lursinsap, Chidchanok}, title = {Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem}, booktitle = {Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining}, series = {PAKDD '09}, year = {2009}, isbn = {978-3-642-01306-5}, location = {Bangkok, Thailand}, pages = {475--482}, numpages = {8}, url = {http://dx.doi.org/10.1007/978-3-642-01307-2_43}, doi = {10.1007/978-3-642-01307-2_43}, acmid = {1533904}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {Class Imbalanced Problem, Over-sampling, SMOTE, Safe Level}, }
- Notes:
- The original method was not prepared for the case when no minority
sample has minority neighbors.
MSMOTE
API
Example
>>> oversampler= smote_variants.MSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{msmote, author = {Hu, Shengguo and Liang, Yanfeng and Ma, Lintao and He, Ying}, title = {MSMOTE: Improving Classification Performance When Training Data is Imbalanced}, booktitle = {Proceedings of the 2009 Second International Workshop on Computer Science and Engineering - Volume 02}, series = {IWCSE '09}, year = {2009}, isbn = {978-0-7695-3881-5}, pages = {13--17}, numpages = {5}, url = {https://doi.org/10.1109/WCSE.2009.756}, doi = {10.1109/WCSE.2009.756}, acmid = {1682710}, publisher = {IEEE Computer Society}, address = {Washington, DC, USA}, keywords = {imbalanced data, over-sampling, SMOTE, AdaBoost, samples groups, SMOTEBoost}, }
- Notes:
- The original method was not prepared for the case when all
minority samples are noise.
DE_oversampling
API
Example
>>> oversampler= smote_variants.DE_oversampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{de_oversampling, author={Chen, L. and Cai, Z. and Chen, L. and Gu, Q.}, booktitle={2010 Third International Conference on Knowledge Discovery and Data Mining}, title={A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets}, year={2010}, volume={}, number={}, pages={81-85}, keywords={pattern clustering;sampling methods; support vector machines;differential evolution;clustering algorithm;hybrid resampling algorithm;imbalanced datasets;support vector machine; minority class;mutation operators; crossover operators;data cleaning method;F-measure criterion;ROC area criterion;Support vector machines; Intrusion detection;Support vector machine classification;Cleaning; Electronic mail;Clustering algorithms; Signal to noise ratio;Learning systems;Data mining;Geology;imbalanced datasets;hybrid resampling;clustering; differential evolution;support vector machine}, doi={10.1109/WKDD.2010.48}, ISSN={}, month={Jan},}
SMOBD
API
Example
>>> oversampler= smote_variants.SMOBD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{smobd, author={Cao, Q. and Wang, S.}, booktitle={2011 International Conference on Information Management, Innovation Management and Industrial Engineering}, title={Applying Over-sampling Technique Based on Data Density and Cost-sensitive SVM to Imbalanced Learning}, year={2011}, volume={2}, number={}, pages={543-548}, keywords={data handling;learning (artificial intelligence);support vector machines; oversampling technique application; data density;cost sensitive SVM; imbalanced learning;SMOTE algorithm; data distribution;density information; Support vector machines;Classification algorithms;Noise measurement;Arrays; Noise;Algorithm design and analysis; Training;imbalanced learning; cost-sensitive SVM;SMOTE;data density; SMOBD}, doi={10.1109/ICIII.2011.276}, ISSN={2155-1456}, month={Nov},}
SUNDO
API
Example
>>> oversampler= smote_variants.SUNDO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{sundo, author={Cateni, S. and Colla, V. and Vannucci, M.}, booktitle={2011 11th International Conference on Intelligent Systems Design and Applications}, title={Novel resampling method for the classification of imbalanced datasets for industrial and other real-world problems}, year={2011}, volume={}, number={}, pages={402-407}, keywords={decision trees;pattern classification; sampling methods;support vector machines;resampling method;imbalanced dataset classification;industrial problem;real world problem; oversampling technique;undersampling technique;support vector machine; decision tree;binary classification; synthetic dataset;public dataset; industrial dataset;Support vector machines;Training;Accuracy;Databases; Intelligent systems;Breast cancer; Decision trees;oversampling; undersampling;imbalanced dataset}, doi={10.1109/ISDA.2011.6121689}, ISSN={2164-7151}, month={Nov}}
MSYN
API
Example
>>> oversampler= smote_variants.MSYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{msyn, author="Fan, Xiannian and Tang, Ke and Weise, Thomas", editor="Huang, Joshua Zhexue and Cao, Longbing and Srivastava, Jaideep", title="Margin-Based Over-Sampling Method for Learning from Imbalanced Datasets", booktitle="Advances in Knowledge Discovery and Data Mining", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="309--320", abstract="Learning from imbalanced datasets has drawn more and more attentions from both theoretical and practical aspects. Over- sampling is a popular and simple method for imbalanced learning. In this paper, we show that there is an inherently potential risk associated with the over-sampling algorithms in terms of the large margin principle. Then we propose a new synthetic over sampling method, named Margin-guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN improves learning with respect to the data distributions guided by the margin-based rule. Empirical study verities the efficacy of MSYN.", isbn="978-3-642-20847-8" }
SVM_balance
API
Example
>>> oversampler= smote_variants.SVM_balance()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{svm_balance, author = {Farquad, M.A.H. and Bose, Indranil}, title = {Preprocessing Unbalanced Data Using Support Vector Machine}, journal = {Decis. Support Syst.}, issue_date = {April, 2012}, volume = {53}, number = {1}, month = apr, year = {2012}, issn = {0167-9236}, pages = {226--233}, numpages = {8}, url = {http://dx.doi.org/10.1016/j.dss.2012.01.016}, doi = {10.1016/j.dss.2012.01.016}, acmid = {2181554}, publisher = {Elsevier Science Publishers B. V.}, address = {Amsterdam, The Netherlands, The Netherlands}, keywords = {COIL data, Hybrid method, Preprocessor, SVM, Unbalanced data}, }
TRIM_SMOTE
API
Example
>>> oversampler= smote_variants.TRIM_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{trim_smote, author="Puntumapon, Kamthorn and Waiyamai, Kitsana", editor="Tan, Pang-Ning and Chawla, Sanjay and Ho, Chin Kuan and Bailey, James", title="A Pruning-Based Approach for Searching Precise and Generalized Region for Synthetic Minority Over-Sampling", booktitle="Advances in Knowledge Discovery and Data Mining", year="2012", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="371--382", isbn="978-3-642-30220-6" }
- Notes:
- It is not described precisely how the filtered data is used for
sample generation. The method is proposed to be a preprocessing step, and it states that it applies sample generation to each group extracted.
SMOTE_RSB
API
Example
>>> oversampler= smote_variants.SMOTE_RSB()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{smote_rsb, author="Ramentol, Enislay and Caballero, Yail{'e} and Bello, Rafael and Herrera, Francisco", title="SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory", journal="Knowledge and Information Systems", year="2012", month="Nov", day="01", volume="33", number="2", pages="245--265", issn="0219-3116", doi="10.1007/s10115-011-0465-6", url="https://doi.org/10.1007/s10115-011-0465-6" }
- Notes:
- I think the description of the algorithm in Fig 5 of the paper
is not correct. The set “resultSet” is initialized with the original instances, and then the While loop in the Algorithm run until resultSet is empty, which never holds. Also, the resultSet is only extended in the loop. Our implementation is changed in the following way: we generate twice as many instances are required to balance the dataset, and repeat the loop until the number of new samples added to the training set is enough to balance the dataset.
ProWSyn
API
Example
>>> oversampler= smote_variants.ProWSyn()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{prowsyn, author="Barua, Sukarna and Islam, Md. Monirul and Murase, Kazuyuki", editor="Pei, Jian and Tseng, Vincent S. and Cao, Longbing and Motoda, Hiroshi and Xu, Guandong", title="ProWSyn: Proximity Weighted Synthetic Oversampling Technique for Imbalanced Data Set Learning", booktitle="Advances in Knowledge Discovery and Data Mining", year="2013", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="317--328", isbn="978-3-642-37456-2" }
SL_graph_SMOTE
API
Example
>>> oversampler= smote_variants.SL_graph_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{sl_graph_smote, author = {Bunkhumpornpat, Chumpol and Subpaiboonkit, Sitthichoke}, booktitle= {13th International Symposium on Communications and Information Technologies}, year = {2013}, month = {09}, pages = {570-575}, title = {Safe level graph for synthetic minority over-sampling techniques}, isbn = {978-1-4673-5578-0} }
NRSBoundary_SMOTE
API
Example
>>> oversampler= smote_variants.NRSBoundary_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{nrsboundary_smote, author= {Feng, Hu and Hang, Li}, title= {A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE}, journal= {Mathematical Problems in Engineering}, year= {2013}, pages= {10}, doi= {10.1155/2013/694809}, url= {http://dx.doi.org/10.1155/694809} }
LVQ_SMOTE
API
Example
>>> oversampler= smote_variants.LVQ_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{lvq_smote, title={LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data}, author={Munehiro Nakamura and Yusuke Kajiwara and Atsushi Otsuka and Haruhiko Kimura}, booktitle={BioData Mining}, year={2013} }
- Notes:
- This implementation is only a rough approximation of the method
described in the paper. The main problem is that the paper uses many datasets to find similar patterns in the codebooks and replicate patterns appearing in other datasets to the imbalanced datasets based on their relative position compared to the codebook elements. What we do is clustering the minority class to extract a codebook as kmeans cluster means, then, find pairs of codebook elements which have the most similar relative position to a randomly selected pair of codebook elements, and translate nearby minority samples from the neighborhood one pair of codebook elements to the neighborood of another pair of codebook elements.
SOI_CJ
API
Example
>>> oversampler= smote_variants.SOI_CJ()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{soi_cj, author = {Sánchez, Atlántida I. and Morales, Eduardo and Gonzalez, Jesus}, year = {2013}, month = {01}, pages = {}, title = {Synthetic Oversampling of Instances Using Clustering}, volume = {22}, booktitle = {International Journal of Artificial Intelligence Tools} }
ROSE
API
Example
>>> oversampler= smote_variants.ROSE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{rose, author="Menardi, Giovanna and Torelli, Nicola", title="Training and assessing classification rules with imbalanced data", journal="Data Mining and Knowledge Discovery", year="2014", month="Jan", day="01", volume="28", number="1", pages="92--122", issn="1573-756X", doi="10.1007/s10618-012-0295-5", url="https://doi.org/10.1007/s10618-012-0295-5" }
- Notes:
- It is not entirely clear if the authors propose kernel density
estimation or the fitting of simple multivariate Gaussians on the minority samples. The latter seems to be more likely, I implement that approach.
SMOTE_OUT
API
Example
>>> oversampler= smote_variants.SMOTE_OUT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_out_smote_cosine_selected_smote, title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level}, author={Fajri Koto}, journal={2014 International Conference on Advanced Computer Science and Information System}, year={2014}, pages={280-284} }
SMOTE_Cosine
API
Example
>>> oversampler= smote_variants.SMOTE_Cosine()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_out_smote_cosine_selected_smote, title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level}, author={Fajri Koto}, journal={2014 International Conference on Advanced Computer Science and Information System}, year={2014}, pages={280-284} }
Selected_SMOTE
API
Example
>>> oversampler= smote_variants.Selected_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_out_smote_cosine_selected_smote, title={SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level}, author={Fajri Koto}, journal={2014 International Conference on Advanced Computer Science and Information System}, year={2014}, pages={280-284} }
- Notes:
- Significant attribute selection was not described in the paper,
therefore we have implemented something meaningful.
LN_SMOTE
API
Example
>>> oversampler= smote_variants.LN_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ln_smote, author={Maciejewski, T. and Stefanowski, J.}, booktitle={2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)}, title={Local neighbourhood extension of SMOTE for mining imbalanced data}, year={2011}, volume={}, number={}, pages={104-111}, keywords={Bayes methods;data mining;pattern classification;local neighbourhood extension;imbalanced data mining; focused resampling technique;SMOTE over-sampling method;naive Bayes classifiers;Noise measurement;Noise; Decision trees;Breast cancer; Sensitivity;Data mining;Training}, doi={10.1109/CIDM.2011.5949434}, ISSN={}, month={April}}
MWMOTE
API
Example
>>> oversampler= smote_variants.MWMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@ARTICLE{mwmote, author={Barua, S. and Islam, M. M. and Yao, X. and Murase, K.}, journal={IEEE Transactions on Knowledge and Data Engineering}, title={MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning}, year={2014}, volume={26}, number={2}, pages={405-425}, keywords={learning (artificial intelligence);pattern clustering;sampling methods;AUC;area under curve;ROC;receiver operating curve;G-mean; geometric mean;minority class cluster; clustering approach;weighted informative minority class samples;Euclidean distance; hard-to-learn informative minority class samples;majority class;synthetic minority class samples;synthetic oversampling methods;imbalanced learning problems; imbalanced data set learning; MWMOTE-majority weighted minority oversampling technique;Sampling methods; Noise measurement;Boosting;Simulation; Complexity theory;Interpolation;Abstracts; Imbalanced learning;undersampling; oversampling;synthetic sample generation; clustering}, doi={10.1109/TKDE.2012.232}, ISSN={1041-4347}, month={Feb}}
- Notes:
- The original method was not prepared for the case of having clusters
of 1 elements.
PDFOS
API
Example
>>> oversampler= smote_variants.PDFOS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{pdfos, title = "PDFOS: PDF estimation based over-sampling for imbalanced two-class problems", journal = "Neurocomputing", volume = "138", pages = "248 - 259", year = "2014", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2014.02.006", author = "Ming Gao and Xia Hong and Sheng Chen and Chris J. Harris and Emad Khalaf", keywords = "Imbalanced classification, Probability density function based over-sampling, Radial basis function classifier, Orthogonal forward selection, Particle swarm optimisation" }
- Notes:
Not prepared for low-rank data.
IPADE_ID
API
Example
>>> oversampler= smote_variants.IPADE_ID()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ipade_id, title = "Addressing imbalanced classification with instance generation techniques: IPADE-ID", journal = "Neurocomputing", volume = "126", pages = "15 - 28", year = "2014", note = "Recent trends in Intelligent Data Analysis Online Data Processing", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2013.01.050", author = "Victoria López and Isaac Triguero and Cristóbal J. Carmona and Salvador García and Francisco Herrera", keywords = "Differential evolution, Instance generation, Nearest neighbor, Decision tree, Imbalanced datasets" }
- Notes:
- According to the algorithm, if the addition of a majority sample
doesn’t improve the AUC during the DE optimization process, the addition of no further majority points is tried.
- In the differential evolution the multiplication by a random number
seems have a deteriorating effect, new scaling parameter added to fix this.
It is not specified how to do the evaluation.
RWO_sampling
API
Example
>>> oversampler= smote_variants.RWO_sampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{rwo_sampling, author = {Zhang, Huaxzhang and Li, Mingfang}, year = {2014}, month = {11}, pages = {}, title = {RWO-Sampling: A Random Walk Over-Sampling Approach to Imbalanced Data Classification}, volume = {20}, booktitle = {Information Fusion} }
NEATER
API
Example
>>> oversampler= smote_variants.NEATER()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{neater, author={Almogahed, B. A. and Kakadiaris, I. A.}, booktitle={2014 22nd International Conference on Pattern Recognition}, title={NEATER: Filtering of Over-sampled Data Using Non-cooperative Game Theory}, year={2014}, volume={}, number={}, pages={1371-1376}, keywords={data handling;game theory;information filtering;NEATER;imbalanced data problem;synthetic data;filtering of over-sampled data using non-cooperative game theory;Games;Game theory;Vectors; Sociology;Statistics;Silicon; Mathematical model}, doi={10.1109/ICPR.2014.245}, ISSN={1051-4651}, month={Aug}}
- Notes:
- Evolving both majority and minority probabilities as nothing ensures
that the probabilities remain in the range [0,1], and they need to be normalized.
- The inversely weighted function needs to be cut at some value (like
the alpha level), otherwise it will overemphasize the utility of having differing neighbors next to each other.
DEAGO
API
Example
>>> oversampler= smote_variants.DEAGO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{deago, author={Bellinger, C. and Japkowicz, N. and Drummond, C.}, booktitle={2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)}, title={Synthetic Oversampling for Advanced Radioactive Threat Detection}, year={2015}, volume={}, number={}, pages={948-953}, keywords={radioactive waste;advanced radioactive threat detection;gamma-ray spectral classification;industrial nuclear facilities;Health Canadas national monitoring networks;Vancouver 2010; Isotopes;Training;Monitoring; Gamma-rays;Machine learning algorithms; Security;Neural networks;machine learning;classification;class imbalance;synthetic oversampling; artificial neural networks; autoencoders;gamma-ray spectra}, doi={10.1109/ICMLA.2015.58}, ISSN={}, month={Dec}}
- Notes:
There is no hint on the activation functions and amounts of noise.
Gazzah
API
Example
>>> oversampler= smote_variants.Gazzah()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{gazzah, author={Gazzah, S. and Hechkel, A. and Essoukri Ben Amara, N. }, booktitle={2015 IEEE 12th International Multi-Conference on Systems, Signals Devices (SSD15)}, title={A hybrid sampling method for imbalanced data}, year={2015}, volume={}, number={}, pages={1-6}, keywords={computer vision;image classification; learning (artificial intelligence); sampling methods;hybrid sampling method;imbalanced data; diversification;computer vision domain;classical machine learning systems;intraclass variations; system performances;classification accuracy;imbalanced training data; training data set;over-sampling; minority class;SMOTE star topology; feature vector deletion;intra-class variations;distribution criterion; biometric data;true positive rate; Training data;Principal component analysis;Databases;Support vector machines;Training;Feature extraction; Correlation;Imbalanced data sets; Intra-class variations;Data analysis; Principal component analysis; One-against-all SVM}, doi={10.1109/SSD.2015.7348093}, ISSN={}, month={March}}
MCT
API
Example
>>> oversampler= smote_variants.MCT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{mct, author = {Jiang, Liangxiao and Qiu, Chen and Li, Chaoqun}, year = {2015}, month = {03}, pages = {1551004}, title = {A Novel Minority Cloning Technique for Cost-Sensitive Learning}, volume = {29}, booktitle = {International Journal of Pattern Recognition and Artificial Intelligence} }
- Notes:
- Mode is changed to median, distance is changed to Euclidean to
support continuous features, and normalized.
ADG
API
Example
>>> oversampler= smote_variants.ADG()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{adg, author = {Pourhabib, A. and Mallick, Bani K. and Ding, Yu}, year = {2015}, month = {16}, pages = {2695--2724}, title = {A Novel Minority Cloning Technique for Cost-Sensitive Learning}, volume = {16}, journal = {Journal of Machine Learning Research} }
- Notes:
- This method has a lot of parameters, it becomes fairly hard to
cross-validate thoroughly.
- Fails if matrix is singular when computing alpha_star, fixed
by PCA.
Singularity might be caused by repeating samples.
- Maintaining the kernel matrix becomes unfeasible above a couple
of thousand vectors.
SMOTE_IPF
API
Example
>>> oversampler= smote_variants.SMOTE_IPF()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_ipf, title = "SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering", journal = "Information Sciences", volume = "291", pages = "184 - 203", year = "2015", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2014.08.051", author = "José A. Sáez and Julián Luengo and Jerzy Stefanowski and Francisco Herrera", keywords = "Imbalanced classification, Borderline examples, Noisy data, Noise filters, SMOTE" }
KernelADASYN
API
Example
>>> oversampler= smote_variants.KernelADASYN()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{kernel_adasyn, author={Tang, B. and He, H.}, booktitle={2015 IEEE Congress on Evolutionary Computation (CEC)}, title={KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning}, year={2015}, volume={}, number={}, pages={664-671}, keywords={learning (artificial intelligence); pattern classification; sampling methods;KernelADASYN; kernel based adaptive synthetic data generation;imbalanced learning;standard classification algorithms;data distribution; minority class decision rule; expensive minority class data misclassification;kernel based adaptive synthetic over-sampling approach;imbalanced data classification problems;kernel density estimation methods;Kernel; Estimation;Accuracy;Measurement; Standards;Training data;Sampling methods;Imbalanced learning; adaptive over-sampling;kernel density estimation;pattern recognition;medical and healthcare data learning}, doi={10.1109/CEC.2015.7256954}, ISSN={1089-778X}, month={May}}
- Notes:
- The method of sampling was not specified, Markov Chain Monte Carlo
has been implemented.
Not prepared for improperly conditioned covariance matrix.
MOT2LD
API
Example
>>> oversampler= smote_variants.MOT2LD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{mot2ld, author="Xie, Zhipeng and Jiang, Liyang and Ye, Tengju and Li, Xiaoli", editor="Renz, Matthias and Shahabi, Cyrus and Zhou, Xiaofang and Cheema, Muhammad Aamir", title="A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning", booktitle="Database Systems for Advanced Applications", year="2015", publisher="Springer International Publishing", address="Cham", pages="3--18", isbn="978-3-319-18123-3" }
- Notes:
- Clusters might contain 1 elements, and all points can be filtered
as noise.
- Clusters might contain 0 elements as well, if all points are filtered
as noise.
The entire clustering can become empty.
- TSNE is very slow when the number of instances is over a couple
of 1000
V_SYNTH
API
Example
>>> oversampler= smote_variants.V_SYNTH()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{v_synth, author = {Young,Ii, William A. and Nykl, Scott L. and Weckman, Gary R. and Chelberg, David M.}, title = {Using Voronoi Diagrams to Improve Classification Performances when Modeling Imbalanced Datasets}, journal = {Neural Comput. Appl.}, issue_date = {July 2015}, volume = {26}, number = {5}, month = jul, year = {2015}, issn = {0941-0643}, pages = {1041--1054}, numpages = {14}, url = {http://dx.doi.org/10.1007/s00521-014-1780-0}, doi = {10.1007/s00521-014-1780-0}, acmid = {2790665}, publisher = {Springer-Verlag}, address = {London, UK, UK}, keywords = {Data engineering, Data mining, Imbalanced datasets, Knowledge extraction, Numerical algorithms, Synthetic over-sampling}, }
- Notes:
The proposed encompassing bounding box generation is incorrect.
Voronoi diagram generation in high dimensional spaces is instable
OUPS
API
Example
>>> oversampler= smote_variants.OUPS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{oups, title = "A priori synthetic over-sampling methods for increasing classification sensitivity in imbalanced data sets", journal = "Expert Systems with Applications", volume = "66", pages = "124 - 135", year = "2016", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2016.09.010", author = "William A. Rivera and Petros Xanthopoulos", keywords = "SMOTE, OUPS, Class imbalance, Classification" }
- Notes:
- In the description of the algorithm a fractional number p (j) is
used to index a vector.
SMOTE_D
API
Example
>>> oversampler= smote_variants.SMOTE_D()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{smote_d, author="Torres, Fredy Rodr{'i}guez and Carrasco-Ochoa, Jes{'u}s A. and Mart{'i}nez-Trinidad, Jos{'e} Fco.", editor="Mart{'i}nez-Trinidad, Jos{'e} Francisco and Carrasco-Ochoa, Jes{'u}s Ariel and Ayala Ramirez, Victor and Olvera-L{'o}pez, Jos{'e} Arturo and Jiang, Xiaoyi", title="SMOTE-D a Deterministic Version of SMOTE", booktitle="Pattern Recognition", year="2016", publisher="Springer International Publishing", address="Cham", pages="177--188", isbn="978-3-319-39393-3" }
- Notes:
Copying happens if two points are the neighbors of each other.
SMOTE_PSO
API
Example
>>> oversampler= smote_variants.SMOTE_PSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_pso, title = "PSO-based method for SVM classification on skewed data sets", journal = "Neurocomputing", volume = "228", pages = "187 - 197", year = "2017", note = "Advanced Intelligent Computing: Theory and Applications", issn = "0925-2312", doi = "https://doi.org/10.1016/j.neucom.2016.10.041", author = "Jair Cervantes and Farid Garcia-Lamont and Lisbeth Rodriguez and Asdrúbal López and José Ruiz Castilla and Adrian Trueba", keywords = "Skew data sets, SVM, Hybrid algorithms" }
- Notes:
- I find the description of the technique a bit confusing, especially
on the bounds of the search space of velocities and positions. Equations 15 and 16 specify the lower and upper bounds, the lower bound is in fact a vector while the upper bound is a distance. I tried to implement something meaningful.
- I also find the setting of accelerating constant 2.0 strange, most
of the time the velocity will be bounded due to this choice.
- Also, training and predicting probabilities with a non-linear
SVM as the evaluation function becomes fairly expensive when the number of training vectors reaches a couple of thousands. To reduce computational burden, minority and majority vectors far from the other class are removed to reduce the size of both classes to a maximum of 500 samples. Generally, this shouldn’t really affect the results as the technique focuses on the samples near the class boundaries.
CURE_SMOTE
API
Example
>>> oversampler= smote_variants.CURE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{cure_smote, author="Ma, Li and Fan, Suohai", title="CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests", journal="BMC Bioinformatics", year="2017", month="Mar", day="14", volume="18", number="1", pages="169", issn="1471-2105", doi="10.1186/s12859-017-1578-z", url="https://doi.org/10.1186/s12859-017-1578-z" }
- Notes:
- It is not specified how to determine the cluster with the
“slowest growth rate”
All clusters can be removed as noise.
SOMO
API
Example
>>> oversampler= smote_variants.SOMO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{somo, title = "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", journal = "Expert Systems with Applications", volume = "82", pages = "40 - 52", year = "2017", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2017.03.073", author = "Georgios Douzas and Fernando Bacao" }
- Notes:
- It is not specified how to handle those cases when a cluster contains
1 minority samples, the mean of within-cluster distances is set to 100 in these cases.
CE_SMOTE
API
Example
>>> oversampler= smote_variants.CE_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ce_smote, author={Chen, S. and Guo, G. and Chen, L.}, booktitle={2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops}, title={A New Over-Sampling Method Based on Cluster Ensembles}, year={2010}, volume={}, number={}, pages={599-604}, keywords={data mining;Internet;pattern classification;pattern clustering; over sampling method;cluster ensembles;classification method; imbalanced data handling;CE-SMOTE; clustering consistency index; cluster boundary minority samples; imbalanced public data set; Mathematics;Computer science; Electronic mail;Accuracy;Nearest neighbor searches;Application software;Data mining;Conferences; Web sites;Information retrieval; classification;imbalanced data sets;cluster ensembles; over-sampling}, doi={10.1109/WAINA.2010.40}, ISSN={}, month={April}}
ISOMAP_Hybrid
API
Example
>>> oversampler= smote_variants.ISOMAP_Hybrid()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{isomap_hybrid, author = {Gu, Qiong and Cai, Zhihua and Zhu, Li}, title = {Classification of Imbalanced Data Sets by Using the Hybrid Re-sampling Algorithm Based on Isomap}, booktitle = {Proceedings of the 4th International Symposium on Advances in Computation and Intelligence}, series = {ISICA '09}, year = {2009}, isbn = {978-3-642-04842-5}, location = {Huangshi, China}, pages = {287--296}, numpages = {10}, doi = {10.1007/978-3-642-04843-2_31}, acmid = {1691478}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, keywords = {Imbalanced data set, Isomap, NCR, Smote, re-sampling}, }
Edge_Det_SMOTE
API
Example
>>> oversampler= smote_variants.Edge_Det_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{Edge_Det_SMOTE, author={Kang, Y. and Won, S.}, booktitle={ICCAS 2010}, title={Weight decision algorithm for oversampling technique on class-imbalanced learning}, year={2010}, volume={}, number={}, pages={182-186}, keywords={edge detection;learning (artificial intelligence);weight decision algorithm;oversampling technique; class-imbalanced learning;class imbalanced data problem;edge detection algorithm;spatial space representation;Classification algorithms;Image edge detection; Training;Noise measurement;Glass; Training data;Machine learning; Imbalanced learning;Classification; Weight decision;Oversampling; Edge detection}, doi={10.1109/ICCAS.2010.5669889}, ISSN={}, month={Oct}}
- Notes:
This technique is very loosely specified.
CBSO
API
Example
>>> oversampler= smote_variants.CBSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{cbso, author="Barua, Sukarna and Islam, Md. Monirul and Murase, Kazuyuki", editor="Lu, Bao-Liang and Zhang, Liqing and Kwok, James", title="A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning", booktitle="Neural Information Processing", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="735--744", isbn="978-3-642-24958-7" }
- Notes:
Clusters containing 1 element induce cloning of samples.
DBSMOTE
API
Example
>>> oversampler= smote_variants.DBSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{dbsmote, author="Bunkhumpornpat, Chumphol and Sinapiromsaran, Krung and Lursinsap, Chidchanok", title="DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique", journal="Applied Intelligence", year="2012", month="Apr", day="01", volume="36", number="3", pages="664--684", issn="1573-7497", doi="10.1007/s10489-011-0287-y", url="https://doi.org/10.1007/s10489-011-0287-y" }
- Notes:
Standardization is needed to use absolute eps values.
- The clustering is likely to identify all instances as noise, fixed
by recursive call with increaseing eps.
ASMOBD
API
Example
>>> oversampler= smote_variants.ASMOBD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{asmobd, author={Senzhang Wang and Zhoujun Li and Wenhan Chao and Qinghua Cao}, booktitle={The 2012 International Joint Conference on Neural Networks (IJCNN)}, title={Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning}, year={2012}, volume={}, number={}, pages={1-8}, doi={10.1109/IJCNN.2012.6252696}, ISSN={2161-4407}, month={June}}
- Notes:
In order to use absolute thresholds, the data is standardized.
- The technique has many parameters, not easy to find the right
combination.
Assembled_SMOTE
API
Example
>>> oversampler= smote_variants.Assembled_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{assembled_smote, author={Zhou, B. and Yang, C. and Guo, H. and Hu, J.}, booktitle={The 2013 International Joint Conference on Neural Networks (IJCNN)}, title={A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification}, year={2013}, volume={}, number={}, pages={1-7}, keywords={approximation theory;interpolation; pattern classification;sampling methods;support vector machines;trees (mathematics);quasilinear SVM; assembled SMOTE;imbalanced dataset classification problem;oversampling method;quasilinear kernel function; approximate nonlinear separation boundary;mulitlocal linear boundaries; interpolation;data distribution information;minimal spanning tree; local linear partitioning method; linear separation boundary;synthetic minority class samples;oversampled dataset classification;standard SVM; composite quasilinear kernel function; artificial data datasets;benchmark datasets;classification performance improvement;synthetic minority over-sampling technique;Support vector machines;Kernel;Merging;Standards; Sociology;Statistics;Interpolation}, doi={10.1109/IJCNN.2013.6707035}, ISSN={2161-4407}, month={Aug}}
- Notes:
- Absolute value of the angles extracted should be taken.
(implemented this way)
- It is not specified how many samples are generated in the various
clusters.
SDSMOTE
API
Example
>>> oversampler= smote_variants.SDSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{sdsmote, author={Li, K. and Zhang, W. and Lu, Q. and Fang, X.}, booktitle={2014 International Conference on Identification, Information and Knowledge in the Internet of Things}, title={An Improved SMOTE Imbalanced Data Classification Method Based on Support Degree}, year={2014}, volume={}, number={}, pages={34-38}, keywords={data mining;pattern classification; sampling methods;improved SMOTE imbalanced data classification method;support degree;data mining; class distribution;imbalanced data-set classification;over sampling method;minority class sample generation;minority class sample selection;minority class boundary sample identification;Classification algorithms;Training;Bagging;Computers; Testing;Algorithm design and analysis; Data mining;Imbalanced data-sets; Classification;Boundary sample;Support degree;SMOTE}, doi={10.1109/IIKI.2014.14}, ISSN={}, month={Oct}}
DSMOTE
API
Example
>>> oversampler= smote_variants.DSMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{dsmote, author={Mahmoudi, S. and Moradi, P. and Akhlaghian, F. and Moradi, R.}, booktitle={2014 4th International Conference on Computer and Knowledge Engineering (ICCKE)}, title={Diversity and separable metrics in over-sampling technique for imbalanced data classification}, year={2014}, volume={}, number={}, pages={152-158}, keywords={learning (artificial intelligence); pattern classification;sampling methods;diversity metric;separable metric;over-sampling technique; imbalanced data classification; class distribution techniques; under-sampling technique;DSMOTE method; imbalanced learning problem;diversity measure;separable measure;Iran University of Medical Science;UCI dataset;Accuracy;Classification algorithms;Vectors;Educational institutions;Euclidean distance; Data mining;Diversity measure; Separable Measure;Over-Sampling; Imbalanced Data;Classification problems}, doi={10.1109/ICCKE.2014.6993409}, ISSN={}, month={Oct}}
- Notes:
- The method is highly inefficient when the number of minority samples
is high, time complexity is O(n^3), with 1000 minority samples it takes about 1e9 objective function evaluations to find 1 new sample points. Adding 1000 samples would take about 1e12 evaluations of the objective function, which is unfeasible. We introduce a new parameter, n_step, and during the search for the new sample at most n_step combinations of minority samples are tried.
- Abnormality of minority points is defined in the paper as
D_maj/D_min, high abnormality means that the minority point is close to other minority points and very far from majority points. This is definitely not abnormality, I have implemented the opposite.
- Nothing ensures that the fisher statistics and the variance from
the geometric mean remain comparable, which might skew the optimization towards one of the sub-objectives.
- MinMax normalization doesn’t work, each attribute will have a 0
value, which will make the geometric mean of all attribute 0.
G_SMOTE
API
Example
>>> oversampler= smote_variants.G_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{g_smote, author={Sandhan, T. and Choi, J. Y.}, booktitle={2014 22nd International Conference on Pattern Recognition}, title={Handling Imbalanced Datasets by Partially Guided Hybrid Sampling for Pattern Recognition}, year={2014}, volume={}, number={}, pages={1449-1453}, keywords={Gaussian processes;learning (artificial intelligence);pattern classification; regression analysis;sampling methods; support vector machines;imbalanced datasets;partially guided hybrid sampling;pattern recognition;real-world domains;skewed datasets;dataset rebalancing;learning algorithm; extremely low minority class samples; classification tasks;extracted hidden patterns;support vector machine; logistic regression;nearest neighbor; Gaussian process classifier;Support vector machines;Proteins;Pattern recognition;Kernel;Databases;Gaussian processes;Vectors;Imbalanced dataset; protein classification;ensemble classifier;bootstrapping;Sat-image classification;medical diagnoses}, doi={10.1109/ICPR.2014.258}, ISSN={1051-4651}, month={Aug}}
- Notes:
the non-linear approach is inefficient
NT_SMOTE
API
Example
>>> oversampler= smote_variants.NT_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{nt_smote, author={Xu, Y. H. and Li, H. and Le, L. P. and Tian, X. Y.}, booktitle={2014 Seventh International Joint Conference on Computational Sciences and Optimization}, title={Neighborhood Triangular Synthetic Minority Over-sampling Technique for Imbalanced Prediction on Small Samples of Chinese Tourism and Hospitality Firms}, year={2014}, volume={}, number={}, pages={534-538}, keywords={financial management;pattern classification;risk management;sampling methods;travel industry;Chinese tourism; hospitality firms;imbalanced risk prediction;minority class samples; up-sampling approach;neighborhood triangular synthetic minority over-sampling technique;NT-SMOTE; nearest neighbor idea;triangular area sampling idea;single classifiers;data excavation principles;hospitality industry;missing financial indicators; financial data filtering;financial risk prediction;MDA;DT;LSVM;logit;probit; firm risk prediction;Joints; Optimization;imbalanced datasets; NT-SMOTE;neighborhood triangular; random sampling}, doi={10.1109/CSO.2014.104}, ISSN={}, month={July}}
Lee
API
Example
>>> oversampler= smote_variants.Lee()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{lee, author = {Lee, Jaedong and Kim, Noo-ri and Lee, Jee-Hyong}, title = {An Over-sampling Technique with Rejection for Imbalanced Class Learning}, booktitle = {Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication}, series = {IMCOM '15}, year = {2015}, isbn = {978-1-4503-3377-1}, location = {Bali, Indonesia}, pages = {102:1--102:6}, articleno = {102}, numpages = {6}, doi = {10.1145/2701126.2701181}, acmid = {2701181}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {data distribution, data preprocessing, imbalanced problem, rejection rule, synthetic minority oversampling technique} }
SPY
API
Example
>>> oversampler= smote_variants.SPY()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{spy, author={Dang, X. T. and Tran, D. H. and Hirose, O. and Satou, K.}, booktitle={2015 Seventh International Conference on Knowledge and Systems Engineering (KSE)}, title={SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data}, year={2015}, volume={}, number={}, pages={280-285}, keywords={decision making;learning (artificial intelligence);pattern classification; sampling methods;SPY;resampling method;decision-making process; biomedical data classification; class imbalance learning method; SMOTE;oversampling method;UCI machine learning repository;G-mean value;borderline-SMOTE; safe-level-SMOTE;Support vector machines;Training;Bioinformatics; Proteins;Protein engineering;Radio frequency;Sensitivity;Imbalanced dataset;Over-sampling; Under-sampling;SMOTE; borderline-SMOTE}, doi={10.1109/KSE.2015.24}, ISSN={}, month={Oct}}
SMOTE_PSOBAT
API
Example
>>> oversampler= smote_variants.SMOTE_PSOBAT()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{smote_psobat, author={Li, J. and Fong, S. and Zhuang, Y.}, booktitle={2015 3rd International Symposium on Computational and Business Intelligence (ISCBI)}, title={Optimizing SMOTE by Metaheuristics with Neural Network and Decision Tree}, year={2015}, volume={}, number={}, pages={26-32}, keywords={data mining;particle swarm optimisation;pattern classification; data mining;classifier;metaherustics; SMOTE parameters;performance indicators;selection optimization; PSO;particle swarm optimization algorithm;BAT;bat-inspired algorithm; metaheuristic optimization algorithms; nearest neighbors;imbalanced dataset problem;synthetic minority over-sampling technique;decision tree; neural network;Classification algorithms;Neural networks;Decision trees;Training;Optimization;Particle swarm optimization;Data mining;SMOTE; Swarm Intelligence;parameter selection optimization}, doi={10.1109/ISCBI.2015.12}, ISSN={}, month={Dec}}
- Notes:
The parameters of the memetic algorithms are not specified.
- I have checked multiple paper describing the BAT algorithm, but the
meaning of “Generate a new solution by flying randomly” is still unclear.
- It is also unclear if best solutions are recorded for each bat, or
the entire population.
MDO
API
Example
>>> oversampler= smote_variants.MDO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@ARTICLE{mdo, author={Abdi, L. and Hashemi, S.}, journal={IEEE Transactions on Knowledge and Data Engineering}, title={To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques}, year={2016}, volume={28}, number={1}, pages={238-251}, keywords={covariance analysis;learning (artificial intelligence);modelling;pattern classification;sampling methods; statistical distributions;minority class instance modelling;probability contour;covariance structure;MDO; Mahalanobis distance-based oversampling technique;data-oriented technique; model-oriented solution;machine learning algorithm;data skewness;multiclass imbalanced problem;Mathematical model; Training;Accuracy;Eigenvalues and eigenfunctions;Machine learning algorithms;Algorithm design and analysis; Benchmark testing;Multi-class imbalance problems;over-sampling techniques; Mahalanobis distance;Multi-class imbalance problems;over-sampling techniques; Mahalanobis distance}, doi={10.1109/TKDE.2015.2458858}, ISSN={1041-4347}, month={Jan}}
Random_SMOTE
API
Example
>>> oversampler= smote_variants.Random_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{random_smote, author="Dong, Yanjie and Wang, Xuehua", editor="Xiong, Hui and Lee, W. B.", title="A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets", booktitle="Knowledge Science, Engineering and Management", year="2011", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="343--352", isbn="978-3-642-25975-3" }
ISMOTE
API
Example
>>> oversampler= smote_variants.ISMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{ismote, author="Li, Hu and Zou, Peng and Wang, Xiang and Xia, Rongze", editor="Sun, Zengqi and Deng, Zhidong", title="A New Combination Sampling Method for Imbalanced Data", booktitle="Proceedings of 2013 Chinese Intelligent Automation Conference", year="2013", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="547--554", isbn="978-3-642-38466-0" }
VIS_RST
API
Example
>>> oversampler= smote_variants.VIS_RST()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{vis_rst, author="Borowska, Katarzyna and Stepaniuk, Jaroslaw", editor="Saeed, Khalid and Homenda, Wladyslaw", title="Imbalanced Data Classification: A Novel Re-sampling Approach Combining Versatile Improved SMOTE and Rough Sets", booktitle="Computer Information Systems and Industrial Management", year="2016", publisher="Springer International Publishing", address="Cham", pages="31--42", isbn="978-3-319-45378-1" }
- Notes:
- Replication of DANGER samples will be removed by the last step of
noise filtering.
GASMOTE
API
Example
>>> oversampler= smote_variants.GASMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{gasmote, author="Jiang, Kun and Lu, Jing and Xia, Kuiliang", title="A Novel Algorithm for Imbalance Data Classification Based on Genetic Algorithm Improved SMOTE", journal="Arabian Journal for Science and Engineering", year="2016", month="Aug", day="01", volume="41", number="8", pages="3255--3266", issn="2191-4281", doi="10.1007/s13369-016-2179-2", url="https://doi.org/10.1007/s13369-016-2179-2" }
A_SUWO
API
Example
>>> oversampler= smote_variants.A_SUWO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{a_suwo, title = "Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets", journal = "Expert Systems with Applications", volume = "46", pages = "405 - 416", year = "2016", issn = "0957-4174", doi = "https://doi.org/10.1016/j.eswa.2015.10.031", author = "Iman Nekooeimehr and Susana K. Lai-Yuen", keywords = "Imbalanced dataset, Classification, Clustering, Oversampling" }
- Notes:
Equation (7) misses a division by R_j.
It is not specified how to sample from clusters with 1 instances.
SMOTE_FRST_2T
API
Example
>>> oversampler= smote_variants.SMOTE_FRST_2T()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{smote_frst_2t, title = "Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm", journal = "Engineering Applications of Artificial Intelligence", volume = "48", pages = "134 - 139", year = "2016", issn = "0952-1976", doi = "https://doi.org/10.1016/j.engappai.2015.10.009", author = "Ramentol, E. and Gondres, I. and Lajes, S. and Bello, R. and Caballero,Y. and Cornelis, C. and Herrera, F.", keywords = "High Voltage Circuit Breaker (HVCB), Imbalanced learning, Fuzzy rough set theory, Resampling methods" }
- Notes:
- Unlucky setting of parameters might result 0 points added, we have
fixed this by increasing the gamma_S threshold if the number of samples accepted is low.
- Similarly, unlucky setting of parameters might result all majority
samples turned into minority.
- In my opinion, in the algorithm presented in the paper the
relations are incorrect. The authors talk about accepting samples having POS score below a threshold, and in the algorithm in both places POS >= gamma is used.
AND_SMOTE
API
Example
>>> oversampler= smote_variants.AND_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@inproceedings{and_smote, author = {Yun, Jaesub and Ha, Jihyun and Lee, Jong-Seok}, title = {Automatic Determination of Neighborhood Size in SMOTE}, booktitle = {Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication}, series = {IMCOM '16}, year = {2016}, isbn = {978-1-4503-4142-4}, location = {Danang, Viet Nam}, pages = {100:1--100:8}, articleno = {100}, numpages = {8}, doi = {10.1145/2857546.2857648}, acmid = {2857648}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {SMOTE, imbalanced learning, synthetic data generation}, }
NRAS
API
Example
>>> oversampler= smote_variants.NRAS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{nras, title = "Noise Reduction A Priori Synthetic Over-Sampling for class imbalanced data sets", journal = "Information Sciences", volume = "408", pages = "146 - 161", year = "2017", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2017.04.046", author = "William A. Rivera", keywords = "NRAS, SMOTE, OUPS, Class imbalance, Classification" }
AMSCO
API
Example
>>> oversampler= smote_variants.AMSCO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{amsco, title = "Adaptive multi-objective swarm fusion for imbalanced data classification", journal = "Information Fusion", volume = "39", pages = "1 - 24", year = "2018", issn = "1566-2535", doi = "https://doi.org/10.1016/j.inffus.2017.03.007", author = "Jinyan Li and Simon Fong and Raymond K. Wong and Victor W. Chu", keywords = "Swarm fusion, Swarm intelligence algorithm, Multi-objective, Crossover rebalancing, Imbalanced data classification" }
- Notes:
- It is not clear how the kappa threshold is used, I do use the RA
score to drive all the evolution. Particularly:
“In the last phase of each iteration, the average Kappa value in current non-inferior set is compare with the latest threshold value, the threshold is then increase further if the average value increases, and vice versa. By doing so, the non-inferior region will be progressively reduced as the Kappa threshold lifts up.”
I don’t see why would the Kappa threshold lift up if the kappa thresholds are decreased if the average Kappa decreases (“vice versa”).
- Due to the interpretation of kappa threshold and the lack of detailed
description of the SIS process, the implementation is not exactly what is described in the paper, but something very similar.
SSO
API
Example
>>> oversampler= smote_variants.SSO()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@InProceedings{sso, author="Rong, Tongwen and Gong, Huachang and Ng, Wing W. Y.", editor="Wang, Xizhao and Pedrycz, Witold and Chan, Patrick and He, Qiang", title="Stochastic Sensitivity Oversampling Technique for Imbalanced Data", booktitle="Machine Learning and Cybernetics", year="2014", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="161--171", isbn="978-3-662-45652-1" }
- Notes:
- In the algorithm step 2d adds a constant to a vector. I have
changed it to a componentwise adjustment, and also used the normalized STSM as I don’t see any reason why it would be some reasonable, bounded value.
DSRBF
API
Example
>>> oversampler= smote_variants.DSRBF()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{dsrbf, title = "A dynamic over-sampling procedure based on sensitivity for multi-class problems", journal = "Pattern Recognition", volume = "44", number = "8", pages = "1821 - 1833", year = "2011", issn = "0031-3203", doi = "https://doi.org/10.1016/j.patcog.2011.02.019", author = "Francisco Fernández-Navarro and César Hervás-Martínez and Pedro Antonio Gutiérrez", keywords = "Classification, Multi-class, Sensitivity, Accuracy, Memetic algorithm, Imbalanced datasets, Over-sampling method, SMOTE" }
- Notes:
- It is not entirely clear why J-1 output is supposed where J is the
number of classes.
- The fitness function is changed to a balanced mean loss, as I found
that it just ignores classification on minority samples (class label +1) in the binary case.
The iRprop+ optimization is not implemented.
- The original paper proposes using SMOTE incrementally. Instead of
that, this implementation applies SMOTE to generate all samples needed in the sampling epochs and the evolution of RBF networks is used to select the sampling providing the best results.
NDO_sampling
API
Example
>>> oversampler= smote_variants.NDO_sampling()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{ndo_sampling, author={Zhang, L. and Wang, W.}, booktitle={2011 International Conference of Information Technology, Computer Engineering and Management Sciences}, title={A Re-sampling Method for Class Imbalance Learning with Credit Data}, year={2011}, volume={1}, number={}, pages={393-397}, keywords={data handling;sampling methods; resampling method;class imbalance learning;credit rating;imbalance problem;synthetic minority over-sampling technique;sample distribution;synthetic samples; credit data set;Training; Measurement;Support vector machines; Logistics;Testing;Noise;Classification algorithms;class imbalance;credit rating;SMOTE;sample distribution}, doi={10.1109/ICM.2011.34}, ISSN={}, month={Sept}}
Gaussian_SMOTE
API
Example
>>> oversampler= smote_variants.Gaussian_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{gaussian_smote, title={Gaussian-Based SMOTE Algorithm for Solving Skewed Class Distributions}, author={Hansoo Lee and Jonggeun Kim and Sungshin Kim}, journal={Int. J. Fuzzy Logic and Intelligent Systems}, year={2017}, volume={17}, pages={229-234} }
kmeans_SMOTE
API
Example
>>> oversampler= smote_variants.kmeans_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{kmeans_smote, title = "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", journal = "Information Sciences", volume = "465", pages = "1 - 20", year = "2018", issn = "0020-0255", doi = "https://doi.org/10.1016/j.ins.2018.06.056", author = "Georgios Douzas and Fernando Bacao and Felix Last", keywords = "Class-imbalanced learning, Oversampling, Classification, Clustering, Supervised learning, Within-class imbalance" }
Supervised_SMOTE
API
Example
>>> oversampler= smote_variants.Supervised_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{supervised_smote, author = {Hu, Jun AND He, Xue AND Yu, Dong-Jun AND Yang, Xi-Bei AND Yang, Jing-Yu AND Shen, Hong-Bin}, journal = {PLOS ONE}, publisher = {Public Library of Science}, title = {A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction}, year = {2014}, month = {09}, volume = {9}, url = {https://doi.org/10.1371/journal.pone.0107676}, pages = {1-10}, number = {9}, doi = {10.1371/journal.pone.0107676} }
SN_SMOTE
API
Example
>>> oversampler= smote_variants.SN_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@Article{sn_smote, author="Garc{'i}a, V. and S{'a}nchez, J. S. and Mart{'i}n-F{'e}lez, R. and Mollineda, R. A.", title="Surrounding neighborhood-based SMOTE for learning from imbalanced data sets", journal="Progress in Artificial Intelligence", year="2012", month="Dec", day="01", volume="1", number="4", pages="347--362", issn="2192-6360", doi="10.1007/s13748-012-0027-5", url="https://doi.org/10.1007/s13748-012-0027-5" }
CCR
API
Example
>>> oversampler= smote_variants.CCR()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ccr, author = {Koziarski, Michał and Wozniak, Michal}, year = {2017}, month = {12}, pages = {727–736}, title = {CCR: A combined cleaning and resampling algorithm for imbalanced data classification}, volume = {27}, journal = {International Journal of Applied Mathematics and Computer Science} }
- Notes:
Adapted from https://github.com/michalkoziarski/CCR
ANS
API
Example
>>> oversampler= smote_variants.ANS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@article{ans, author = {Siriseriwan, W and Sinapiromsaran, Krung}, year = {2017}, month = {09}, pages = {565-576}, title = {Adaptive neighbor synthetic minority oversampling technique under 1NN outcast handling}, volume = {39}, booktitle = {Songklanakarin Journal of Science and Technology} }
- Notes:
- The method is not prepared for the case when there is no c satisfying
the condition in line 25 of the algorithm, fixed.
The method is not prepared for empty Pused sets, fixed.
cluster_SMOTE
API
Example
>>> oversampler= smote_variants.cluster_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{cluster_SMOTE, author={Cieslak, D. A. and Chawla, N. V. and Striegel, A.}, booktitle={2006 IEEE International Conference on Granular Computing}, title={Combating imbalance in network intrusion datasets}, year={2006}, volume={}, number={}, pages={732-737}, keywords={Intelligent networks;Intrusion detection; Telecommunication traffic;Data mining; Computer networks;Data security; Machine learning;Counting circuits; Computer security;Humans}, doi={10.1109/GRC.2006.1635905}, ISSN={}, month={May}}
E_SMOTE
API
Example
>>> oversampler= smote_variants.E_SMOTE()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{e_smote, author={Deepa, T. and Punithavalli, M.}, booktitle={2011 3rd International Conference on Electronics Computer Technology}, title={An E-SMOTE technique for feature selection in High-Dimensional Imbalanced Dataset}, year={2011}, volume={2}, number={}, pages={322-324}, keywords={bioinformatics;data mining;pattern classification;support vector machines; E-SMOTE technique;feature selection; high-dimensional imbalanced dataset; data mining;bio-informatics;dataset balancing;SVM classification;micro array dataset;Feature extraction; Genetic algorithms;Support vector machines;Data mining;Machine learning; Bioinformatics;Cancer;Imbalanced dataset;Featue Selection;E-SMOTE; Support Vector Machine[SVM]}, doi={10.1109/ICECTECH.2011.5941710}, ISSN={}, month={April}}
- Notes:
- This technique is basically unreproducible. I try to implement
something following the idea of applying some simple genetic algorithm for optimization.
- In my best understanding, the technique uses evolutionary algorithms
for feature selection and then applies vanilla SMOTE on the selected features only.
ADOMS
API
Example
>>> oversampler= smote_variants.ADOMS()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
BibTex:
@INPROCEEDINGS{adoms, author={Tang, S. and Chen, S.}, booktitle={2008 International Conference on Information Technology and Applications in Biomedicine}, title={The generation mechanism of synthetic minority class examples}, year={2008}, volume={}, number={}, pages={444-447}, keywords={medical image processing; generation mechanism;synthetic minority class examples;class imbalance problem;medical image analysis;oversampling algorithm; Principal component analysis; Biomedical imaging;Medical diagnostic imaging;Information technology;Biomedical engineering; Noise generators;Concrete;Nearest neighbor searches;Data analysis; Image analysis}, doi={10.1109/ITAB.2008.4570642}, ISSN={2168-2194}, month={May}}
SYMPROD
API
Example
>>> oversampler= smote_variants.SYMPROD()
>>> X_samp, y_samp= oversampler.sample(X, y)
- References:
Bibtex:
@article{kunakorntum2020synthetic, title={A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets}, author={Kunakorntum, Intouch and Hinthong, Woranich and Phunchongharn, Phond}, journal={IEEE Access}, volume={8}, pages={114692--114704}, year={2020}, publisher={IEEE} }