Text Categorization Model Based on Linear Support Vector Machine
Keywords:Support vector machine, spam, email, ham, model, feature extraction
Spam mails constitute a lot of nuisances in our electronic mail boxes, as they occupy huge spaces which could rather be used for storing relevant data. They also slow down network connection speed and make communication over a network slow. Attackers have often employed spam mails as a means of sending phishing mails to their targets in order to perpetrate data breach attacks and other forms of cybercrimes. Researchers have developed models using machine learning algorithms and other techniques to filter spam mails from relevant mails, however, some algorithms and classifiers are weak, not robust, and lack visualization models which would make the results interpretable by even non-tech savvy people. In this work, Linear Support Vector Machine (LSVM) was used to develop a text categorization model for email texts based on two categories: Ham and Spam. The processes involved were dataset import, preprocessing (removal of stop words, vectorization), feature selection (weighing and selection), development of classification model (splitting data into train (80%) and test sets (20%), importing classifier, training classifier), evaluation of model, deployment of model and spam filtering application on a server (Heroku) using Flask framework. The Agile methodology was adopted for the system design; the Python programming language was implemented for model development. HTML and CSS was used for the development of the web application. The results from the system testing showed that the system had an overall accuracy of 98.56%, recall: 96.5%, F1-score: 97% and F-beta score of 96.23%. This study therefore could be beneficial to e-mail users, to data analysts, and to researchers in the field of NLP.
. Abebaw, T. “Applying thesaurus based semantic compression for improving the performance of amharic text retrieval”. M.Sc Thesis, University of Gondar. pp. 1-80, 2014.
. Aseervatham, S., Antoniadis, A., Gaussier, E., Burlet, M. & Denneulin, Yves. “A sparse version of the ridge logistic regression for large-scale text categorization”. Pattern Recognition Letters. 32. 101-106. 10.1016/j.patrec.2010.09.023, 2011.
. Aski, A. S. & Sourati, N. K. “Proposed efficient algorithm to filter spam using machine learning techniques.” Pacific Science Review A: Natural Science and Engineering, 18(2), 145-149, 2016.
. Bahgat, E. M., Rady, S., Gad, W. & Moawad, I. F. “Efficient email classification approach based on semantic methods.” Ain Shams Engineering Journal.9, 3259-3269, 2018.
. Bhaskar, M., Fernando, D. & Nick, C. “Learning to match using local and distributed representation of text for web search. ” International World Wide Web Conference, Committee. Austrialia: ACM, 2017.
. Christina, V., Karpagavalli, S., & Suganya, G. “Email spam filtering using supervised machine learning techniques.” International Journal of Computing. Science and Engineering,2(9), 3126–3129, 2010.
. Dada, E. C., Bassi, J.S., Chiroma, H., Abdulhamid, S.M., Adetunmbi, A.O. & Ajibuwa, O.E. “Machine learning for e-mail spam filtering: review, approaches and open research problem.” Elsevier, 5(1), 1-24, 2019.
. Deng, S. & Peng, H. “Document classification based on support vector machine using a concept vector model.:” Proceeding of the 2006 IEEE/WIC/ACM International Conference on Web Intelligenc. 2006.
. Do, C. B. & Andrew, Y. N. Transfer learning for text classification. IEEE Open Access. 1-8, 2005.
. Drias, H., Khennak, I. & Boukhedra, A. A.. “Hybrid genetic algorithm for large scale information retrieval.” IEEE, 1-10. 2009.
. Fonseca, D. M., Fazzion, O. H., Cunha, E., Las-Casas, I., Guedes, P. D., Meira, W. & Chaves, M. “Measuring characterizing, and avoiding spam traffic costs.” IEEE Int. Comp.99. 121-112, 2016.
. Gaurav, D. Tiwari, S.M. Goyal, A., Gandhi, N. & Abraham, A. “Machine intelligence-based algorithms for spam filtering on document labeling”. Soft Computing, 24(13). 9625- 9638, 2020.
. Goudjil, M., Koudil, M., Bedda, M. & Ghoggali, N. “A Novel Active Learning Method Using SVM for Text Classification”. International Journal of Automation and Computing, 15, 1110-1121. 10.1007/s11633-015-0912-z., 2016.
. Guo, G., Wang, H., Bell, D., Bi, Y. & Greer, K.. Using kNN model-based approach for automatic text categorization. 1-15, 2001.
. Joachims, T. “Text categorization with support vector machines: Learning with many Relevant features”. 1-7, 1998.
. Karabadji, N., Seridi-Bouchelaghem, H., Bousetouane, F., Dhifli, W. & Aridhi, S. “An evolutionary scheme for decision tree construction.” Knowledge-Based Systems, 119, 166-177. 10.1016/j.knosys.2016.12.011, 2017.
. Khaoula, T., Salma, B., Ksantini, R. & Lachiri, Z. “RBF kernel based SVM classification for landmine detection and discrimination”. Doi: 10.1109/IPAS.2016.7880146, 2016.
. Klabbankoh, B. & Pinngern, O. “Applied genetic algorithms in information retrieval.” IJCIM, 2, 1-10, 1999.
. Kolluni, J., Razia, S. & Ranjan, S. N. “Text classification using machine learning and deep learning models.” International Conference on Artificial Intelligence in Manufacturing & Renewable Energy, 1-7, 2019.
. Korde, V. “Text classification and classifiers: A survey.” International Journal of Artificial Proceedings of the 2016 Federated Conference on Computer Science and Information 2016.
. Rosso, P., Ferretti, E., Jimenez, D. & Vidal, V. (2003). Text categorization and information retrieval using WordNet senses. Proceedings from GWC 2004, 299-304.
. Tazzite, N., Yousfi, A., & Bouyakhf, E. “Design and implementation of IR system by integrating semantic knowledge in the indexing phase”. ICGST-AIML Journal, 9(1), 49-56, 2009.
. Thangaraj, M., & Sivakami, M. “Text classification techniques: A literature review”. Interdisciplinary Journal of Information, Knowledge, and Management, 13, 117-135, 2018.
. Thorsten, J. Text categorization with Support Vector Machines: Machine Learning: ECML-98. 1398. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0026683, 1998.
How to Cite
Authors who submit papers with this journal agree to the following terms.