SEMSCORE-TFIDF: A Lightweight Semantic-Statistical Retrieval Framework for Multilingual FAQ Systems in Higher Education

Mohammad Ali; Jin Xie; Wu Wenhuan

Authors

Mohammad Ali
Jin Xie
Wu Wenhuan

Keywords:

FAQ Retrieval, Semantic TF-IDF, Word2Vec Embeddings, Multilingual Query Processing, Information Retrieval, Academic QA Systems, BM25 Baseline

Abstract

The template is used to format your paper and style the text. All margins, column widths, line spaces, The proliferation of international student enrolments at universities worldwide has created an acute demand for information retrieval systems capable of interpreting linguistically diverse, grammatically variable queries. Conventional FAQ retrieval engines—primarily grounded in term-frequency heuristics such as TF-IDF and cosine similarity—systematically fail when confronted with paraphrased, code-switched, or non-native-speaker formulations. This paper presents SEMSCORE-TFIDF (Semantic Scoring with Contextual TF-IDF Weighting), a novel hybrid retrieval algorithm that augments statistical term weighting with Word2Vec-based semantic similarity scoring and a contextual proximity weighting mechanism. Implemented in MATLAB for deployment on standard CPU-based infrastructure, the framework requires no GPU acceleration and no task-specific neural pretraining, making it immediately deployable in resource-constrained institutional environments. Experiments on a 500-query visa-domain corpus demonstrate statistically significant improvements in Precision@5, Recall, F1-Score, and Mean Reciprocal Rank over TF-IDF, BM25, TF-IDF+Word2Vec, and a simulated sentence encoder baseline. An additional error analysis on 80 paraphrased non-native queries identifies residual failure categories and maps a concrete path toward further refinement. SEMSCORE-TFIDF offers a transparent, scalable, and practically viable solution for multilingual FAQ retrieval in higher education contexts.

Author Biographies

Mohammad Ali

School of Intelligent Connected Vehicle, Hubei University of Automotive Technology, Shiyan, 442000, China
Jin Xie

School of Intelligent Connected Vehicle, Hubei University of Automotive Technology, Shiyan, 442000, China
Wu Wenhuan

School of Intelligent Connected Vehicle, Hubei University of Automotive Technology, Shiyan, 442000, China

References

[1] V. Lapshin, "Question-answering systems: A survey," Automatic Documentation and Mathematical Linguistics, vol. 46, no. 2, pp. 61–75, 2012.

[2] E. M. Voorhees, "The TREC question answering track," Natural Language Engineering, vol. 7, no. 4, pp. 361–378, 2001.

[3] W. Lehnert, "Human and computational question answering," Cognitive Science, vol. 1, no. 1, pp. 47–73, 1977.

[4] K. Arai and A. Handayani, "FAQ-based QA systems for collaborative learning," Journal of Information Systems in Education, vol. 7, no. 2, pp. 89–104, 2012.

[5] H. Suryanto, E. P. Lim, and R. H. L. Chiang, "Quality-aware collaborative question answering," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 7, pp. 969–982, 2009.

[6] H. K. M. Al-Chalabi, S. Syed, and W. Martin, "Challenges faced by international students in academic query formulation," Journal of Educational Technology, vol. 12, no. 3, pp. 45–62, 2015.

[7] K. Lee, L. Zettlemoyer, and O. Levy, "Latent retrieval for weakly supervised QA," Proc. ACL, pp. 6086–6096, 2019.

[8] M. Ali, A. Rahman, and S. Khan, "Keyword matching limitations in academic QA systems," Computational Linguistics, vol. 25, no. 2, pp. 112–130, 2018.

[9] C. Carpineto and G. Romano, "A survey of automatic query expansion in IR," ACM Computing Surveys, vol. 44, no. 1, pp. 1–50, 2012.

[10] S. Robertson and H. Zaragoza, "The probabilistic relevance framework: BM25 and beyond," Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009.

[11] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge: Cambridge Univ. Press, 2008.

[12] R. Collobert, J. Weston et al., "Natural language processing (almost) from scratch," Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.

[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Cambridge, MA: MIT Press, 2016.

[14] D. Jurafsky and J. H. Martin. Speech and Language Processing, 3rd ed. Pearson, 2023.

[15] D. Moldovan, C. Clark, S. Harabagiu, and S. Maiorano, "COGEX: A logic prover for question answering," Proceedings of HLT-NAACL, pp. 87–93, 2003.

[16] G. A. Miller et al., "WordNet: An online lexical database," International Journal of Lexicography, vol. 3, no. 4, pp. 235–244, 1990.

[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.

[18] J. Pennington, R. Socher, and C. Manning, "GloVe: Global vectors for word representation," Proc. EMNLP, pp. 1532–1543, 2014.

[19] J. Guo, Y. Fan, Q. Ai, and W. B. Croft, "A deep relevance matching model for ad-hoc retrieval," Proc. CIKM, pp. 55–64, 2016.

[20] Y. Yang, W. Yih, and C. Meek, "WikiQA: A challenge dataset for open-domain question answering," Proc. EMNLP, pp. 2013–2018, 2015.

[21] T. Kenter and M. de Rijke, "Short text similarity with Word2Vec," Proc. CIKM, pp. 1411–1420, 2015.

[22] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," Proceedings of NAACL-HLT, pp. 4171–4186, 2019.

[23] A. Vaswani et al., "Attention is all you need," Proc. NeurIPS, vol. 30, pp. 5998–6008, 2017.

[24] S. Allabun and B. Soufiene, "Resource-efficient BERT-like attention mechanisms for lightweight NLP," IEEE Transactions on Education Technology, vol. 15, no. 1, pp. 78–92, 2023.

[25] Z. Yang et al., "XLNet: Generalized autoregressive pretraining for language understanding," Proc. NeurIPS, vol. 32, pp. 5753–5763, 2019.

[26] Y. Liu et al., "RoBERTa: A robustly optimized BERT pretraining approach," arXiv preprint arXiv:1907.11692, 2019.

[27] A. Radford et al., "Improving language understanding with unsupervised learning," OpenAI Report, 2018.

[28] C. Cai et al., "Wikipedia-based semantic kernels for information retrieval," Expert Systems with Applications, 2011.

[29] M. Kusner et al., "From word embeddings to document distances," Proc. ICML, vol. 37, pp. 957–966, 2015.

[30] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[31] K. Cho et al., "Learning phrase representations using RNN encoder–decoder," Proc. EMNLP, pp. 1724–1734, 2014.

[32] K. Shaalan, "Arabic question answering: Challenges and directions," International Journal on Information Technology, vol. 5, no. 2, pp. 1–22, 2014.

[33] H. Toba, Z. Y. Ming, Meiliana, and S. Bressan, "Discovering high quality answers in community question answering archives," Expert Systems with Applications, vol. 41, no. 7, pp. 3340–3351, 2014.

[34] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," Proc. ICML, vol. 32, pp. 1188–1196, 2014.

[35] F. Niu et al., "Word embedding based semantic similarity measurement," IEEE Access, vol. 7, pp. 162567–162576, 2019.

[36] V. Karpukhin et al., "Dense passage retrieval for open-domain question answering," Proc. EMNLP, pp. 6769–6781, 2020.

[37] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," Proc. NeurIPS, vol. 33, pp. 9459–9474, 2020.

[38] N. Reimers and I. Gurevych, "Making monolingual sentence embeddings multilingual using knowledge distillation," Proc. EMNLP, pp. 4512–4525, 2020.

[39] Y. Zhu, H. Lan, J. Gu, J. Jiang, S. Li, A. An, and J. Cheng, "Adaptive weight fusion for FAQ retrieval with heterogeneous features," Proc. COLING, pp. 3965–3975, 2022.

[40] R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, and J. Lin, "Distilling task-specific knowledge from BERT into simple neural networks," arXiv:1903.12136, 2023.

SEMSCORE-TFIDF: A Lightweight Semantic-Statistical Retrieval Framework for Multilingual FAQ Systems in Higher Education

Authors

Keywords:

Abstract

Author Biographies

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information

Developed By

Language

Announcements

Latest publications