Bridging Zero-Shot and Fine-Tuned Performance in Text Classification through Retrieval-Augmented Prompting

Authors

  • Olesia Khrapunova Independent Researcher

Keywords:

Chain-of-Thought Prompting, In-Context Learning, Large Language Models, Prompt Engineering, Retrieval-Augmented Learning, Text Classification, Transformer Encoders, Zero-Shot Learning

Abstract

Large Language Models (LLMs) have shown promise in zero-shot and few-shot classification. Yet, their performance often falls short of classic fine-tuned encoders, especially in fine-grained or domain-specific settings. This study compares fine-tuned BERT-family models with zero-shot and few-shot prompting of LLMs (GPT-4o, Llama 3.3 70B, and Mistral Small 3) on two benchmarks: AG News (coarse-grained topic classification) and BANKING77 (fine-grained intent classification). Baseline results confirm that fine-tuned models outperform zero-shot LLMs by ~10-25 points in accuracy, with a larger gap on the fine-grained task. We then test training-free methods to improve LLM performance, focusing on retrieval-augmented few-shot prompting, example ordering, and Chain-of-Thought (CoT) reasoning. Our results show that retrieval-augmented prompting consistently boosts accuracy, especially on the BANKING77 dataset with many semantically similar examples, where GPT-4o even slightly surpasses the best fine-tuned encoder. Ordering demonstrations from least to most similar further improves accuracy, reflecting the impact of recency bias in in-context learning. By contrast, CoT prompting decreases accuracy, suggesting that explanation-based prompting is not universally helpful for classification. These findings demonstrate that careful example selection and ordering can substantially narrow the gap between zero-shot LLMs and fine-tuned encoders, offering a practical, training-free alternative in data-scarce scenarios.

Author Biography

  • Olesia Khrapunova, Independent Researcher

    Senior AI,ML Engineer, Paris, France

References

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal et al. “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877–1901.

[2] M. J. J. Bucher and M. Martini. “Fine-tuned ’small’ LLMs (still) significantly outperform zero-shot generative AI models in text classification,” arXiv:2406.08660v2 [cs.CL], Aug. 2024.

[3] M. Bosley, M. Jacobs-Harukawa, H. Licht, and A. Hoyle. “Do we still need BERT in the age of GPT? Comparing the benefits of domain-adaptation and in-context-learning approaches to using LLMs for Political Science Research,” presented at the Annual Meeting of the Midwest Political Science Association, Chicago, IL, USA, Apr. 2023.

[4] Y. Wang, W. Qu, and X. Ye. “Selecting between BERT and GPT for text classification in political science research,” arXiv:2411.05050v1 [cs.CL], Nov. 2024.

[5] A. Edwards and J. Camacho-Collados. “Language models for text classification: Is in-context learning enough?” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024, pp. 10058-10072.

[6] J. Zhang, Y. Huang, S. Liu, Y. Gao, and X. Hu. “Do BERT-like bidirectional models still perform better on text classification in the era of LLMs?” arXiv:2505.18215v1 [cs.CL], May 2025.

[7] M. Luo, X. Xu, Y. Liu, P. Pasupat, and M. Kazemi. “In-context learning with retrieved demonstrations for language models: A survey,” in Transactions on Machine Learning Research, Oct. 2024.

[8] O. Rubin, J. Herzig, and J. Berant. “Learning to retrieve prompts for in-context learning,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jul. 2022, pp. 2655–2671.

[9] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen. “What makes good in-context examples for GPT-3?” in Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, May 2022, pp. 100–114.

[10] I. Levy, B. Bogin, and J. Berant. “Diverse demonstrations improve in-context compositional generalization,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp. 1401–1422.

[11] C. Qin, A. Zhang, C. Chen, A. Dagar, and W. Ye. “In-context learning with iterative demonstration selection,” in Findings of the Association for Computational Linguistics: EMNLP 2024, Nov. 2024, pp. 7441–7455.

[12] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp. “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022, pp. 8086–8098.

[13] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh. “Calibrate before use: Improving few-shot performance of language models,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139 PMLR, Jul. 2021, pp. 12697-12706.

[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia et al. “Chain-of-thought prompting elicits reasoning in large language models,” in Proceedings of the 36th International Conference on Neural Information Processing Systems, Nov. 2022, pp. 24824-24837.

[15] B. Wang, S. Min, X. Deng, J. Shen, Y. Wu, L. Zettlemoyer, and H. Sun. “Towards understanding chain-of-thought prompting: An empirical study of what matters,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2023, pp. 2717–2739.

[16] X. Zhang, J. Zhao, and Y. LeCun. “Character-level convolutional networks for text classification,” in Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, Dec. 2015, pp. 649–657.

[17] I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, and I. Vulić. “Efficient intent detection with dual sentence encoders,” in Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Jul. 2020, pp. 38–45.

[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp. 4171–4186.

[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen et al. “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv:19007.11692v1 [cs.CL], Jul. 2019.

[20] P. He, J. Gao, and W. Chen. “DeBERTaV3: Improving DeBERTa using electra-style pre-training with gradient-disentangled embedding sharing,” arXiv:2111.09543v4 [cs.CL], Mar. 2023.

[21] OpenAI. “GPT-4o system card.” Internet: https://openai.com/index/gpt-4o-system-card/, Aug. 2024 [Aug. 7, 2025].

[22] HuggingFace. “Meta-Llama-3.3-70b-Instruct model card.” Internet: https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct, Dec. 2024 [Aug. 7, 2025].

[23] Mistral AI. “Mistral Small 3,” Internet: https://mistral.ai/news/mistral-small-3, Jan. 2025, [Aug. 7, 2025].

[24] S. Robertson and H. Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval, 2009. pp. 333-389.

[25] N. Reimers and I. Gurevych. “Sentence-BERT: Sentence embeddings using Siamese BERT-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Nov. 2019, pp. 3982–3992.

[26] Anthropic. “Introducing Claude 3.5 Sonnet.” Internet: https://www.anthropic.com/news/claude-3-5-sonnet, Jun. 2024 [Aug. 7, 2025].

Downloads

Published

2025-10-11

Issue

Section

Articles

How to Cite

Olesia Khrapunova. (2025). Bridging Zero-Shot and Fine-Tuned Performance in Text Classification through Retrieval-Augmented Prompting. American Scientific Research Journal for Engineering, Technology, and Sciences, 103(1), 178-194. https://www.asrjetsjournal.org/American_Scientific_Journal/article/view/12048