Detection of Rare Events: Cluster Based Preprocessing of the Training Set: The Case on Complaints for Invoice Time Series

Authors

  • Huseyin Carpanali Kadir Has University, Istanbul 34083, Turkey
  • Ayse Humeyra Bilge Kadir Has University, Istanbul 34083, Turkey
  • Arif Selcuk Ogrenci Kadir Has University, Istanbul 34083, Turkey
  • Tarkan Ozmen Kadir Has University, Istanbul 34083, Turkey
  • Ayse Tosun Kadir Has University, Istanbul 34083, Turkey
  • Kubra Cakar Kadir Has University, Istanbul 34083, Turkey

Keywords:

Unbalanced data, majority class, hierarchical clustering, heuristics

Abstract

Detection of rare events is a major problem when dealing with unbalanced data. In the application of machine learning tools, data is split into training and test samples and preprocessing is applied to the training set, with the aim of obtaining a more balanced sample. In this paper we discuss preprocessing methods applied to heterogenous data clustered with respect to expected anomaly types. We propose a method for deciding on oversampling and under-sampling from each cluster, based on the variability of the items in each cluster, using Principal Component Analysis. The method is applied to the problem of detecting anomalies in a time series invoices, with an average rate of complaints of orders 10-4

References

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches”, IEEE Trans. Syst. Man Cybern. – Part C, 42 (4), 463–484, 2012

C. Beyan and R. Fisher, “Classifying imbalanced data sets using similarity-based hierarchical decomposition,” Pattern Recognition, vol. 48, no. 5, pp. 1653–1672, 2015.

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. – Part C 42 (4) (2012) 463–484.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.

Azhar, N.A., M.S.M. Pozi, A.M. Din, and A. Jatowt. “An Investigation of SMOTE Based Methods for Imbalanced Datasets with Data Complexity Analysis.” IEEE Transactions on Knowledge and Data Engineering, Knowledge and Data Engineering, IEEE Transactions on, IEEE Trans. Knowl. Data Eng 35, no. 7 (July 1, 2023): 6651–72. doi:10.1109/TKDE.2022.3179381.

Jason Van Hulse, Taghi M Khoshgoftaar, and Amri Napolitano. “An empirical comparison of repetitive undersampling techniques.” In Information Reuse & Integration, 2009. IRI’09. IEEE International Conference

on, pages 29–34. IEEE, 2009.

Hasanin, T., & Khoshgoftaar, T. (2018). “The Effects of Random Undersampling with Simulated Class Imbalance for Big Data.” 2018 IEEE International Conference on Information Reuse and Integration (IRI), Information Reuse and Integration (IRI), 2018 IEEE International Conference on, IRI, 70–79. https://icproxy.khas.edu.tr:2071/10.1109/IRI.2018.00018

C. Seiffert, T. Khoshgoftaar, J. Van Hulse, A. Napolitano, “RUSBoost: a hybrid approach to alleviating class imbalance”, IEEE Trans. Syst. Man Cybern. – Part A 40 (1), 185–197, 2010

R. Barandela, R.M. Valdovinos, J.S. Sanchez, “New applications of ensembles of classifiers”, Pattern Anal. , Appl. 6 ,245–256, 2003.

K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit Card Fraud Detection Using AdaBoost and Majority Voting”, IEEE Access, Vol. 6, pp. 14277–14284, 2018.

M. Zareapoor and P. Shamsolmoali, “Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier”, Procedia Computer Science, Vol. 48, pp. 679–685, 2015.

J. O. Awoyemi, A. O. Adetunmbi, and S. A. Oluwadare, “Credit card fraud detection using machine learning techniques: A comparative analysis”, In: Proc. of 2017 International Conference on Computing Networking and Informatics (ICCNI), pp. 1–9, 2017.

Bilge, A. H. ., Ogrenci, A. S. ., Carpanali, H. ., Aktunc, E. A. ., Atas, F., Ozmen, T. ., & Kaya, B. E. . (2022). Detection of Expenditure Trends in the Telecommunication Sector. American Scientific Research Journal for Engineering, Technology, and Sciences, 90(1), 340–350.

Downloads

Published

2024-03-29

How to Cite

Huseyin Carpanali, Ayse Humeyra Bilge, Arif Selcuk Ogrenci, Tarkan Ozmen, Ayse Tosun, & Kubra Cakar. (2024). Detection of Rare Events: Cluster Based Preprocessing of the Training Set: The Case on Complaints for Invoice Time Series. American Scientific Research Journal for Engineering, Technology, and Sciences, 97(1), 188–202. Retrieved from https://www.asrjetsjournal.org/index.php/American_Scientific_Journal/article/view/9700

Issue

Section

Articles