Adjusting Tokenizer Training for Domain-specific Large Language Models – The Case of Serbian Legal Domain

XVII International Conference on Systems, Automatic Control and Measurements, SAUM 2024 (pp. 71-75)

АУТОР(И) / AUTHOR(S): Jelena Kocić , Miloš Bogdanović , Milena Frtunić Gligorijević , Leonid Stoimenov

САЖЕТАК / ABSTRACT:

The advancement of large-scale language models in recent years has significantly enhanced various natural language processing (NLP) domains. This research addresses the specific challenge of developing BERT-based models tailored for domain-specific language modeling. Within this paper we present the latest findings and techniques used for the development of the second version of SrBERTa model, where our primary objective was to enhance the model’s performance using novel approaches during the tokenizer training phase.

КЉУЧНЕ РЕЧИ / KEYWORDS:

tokenization, NLP, BERT, SrBERTa

ПРОЈЕКАТ/ ACKNOWLEDGEMENT:

This work was funded by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-66/2024-03/200102].

ЛИТЕРАТУРА / REFERENCE

Devlin, M. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Cedarville, OH, USA, Volume 1, pp. 4171–4186, 2019
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized BERT pretraining approach. arXiv 2019, arXiv:1907.11692.
Bogdanović, J. Kocić and L. Stoimenov. „SRBerta—A Transformer Language Model for Serbian Cyrillic Legal Texts“ Information 15, no. 2: 74, 2024. https://doi.org/10.3390/info15020074.
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; Walker, M.A., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: Cedarville, OH, USA, 2018; Volume 1, pp. 2227–2237.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog2019, 1, 9.
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; pp. 7871–7880.
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. arXiv2022, arXiv:2210.11416.
Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv2022, arXiv:2203.13474.
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T.L.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual generalization through multitask finetuning. arXiv2022, arXiv:2211.01786.
Zeng, W.; Ren, X.; Su, T.; Wang, H.; Liao, Y.; Wang, Z.; Jiang, X.; Yang, Z.; Wang, K.; Zhang, X.; et al. Pangu-α: Large-scale autoregressive pretrained chinese language models with auto parallel computation. arXiv2021, arXiv:2104.12369.
Huawei Technologies Co., Ltd. Huawei Technologies Co., Ltd. Huawei mindspore ai development framework. In Artificial Intelligence Technology; Springer: Berlin/Heidelberg, Germany, 2022; pp. 137–162.
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Roziere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv2022, arXiv:2302.13971.
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, E.; Heslow, D.; Launay, J.; Malartic, Q.; et al. Falcon-40B: An Open Large Language Model with State-of-the-Art Performance. 2023. Available online: https://huggingface.co/tiiuae/falcon-40b (accessed on 17 September 2023)
Gururangan S.; Marasović A.; Swayamdipta S.; Lo K.; Beltagy I.; Downey D.; Smith N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, 2020., Online. Association for Computational Linguistics.
Gu Y.; Tinn R.; Cheng H.; Lucas M.; Usuyama N.; Liu X.; Naumann T.; Gao J.; Poon H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare 3, 1, Article 2 (January 2022), 23 pages. 2021., https://doi.org/10.1145/3458754
Sachidananda V.; Kessler J.; Lai Y.A. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing, pages 155–165, Virtual. Association for Computational Linguistics. 2021.
Bogdanović and J. Kocic, SRBerta. Available online: https://huggingface.co/JelenaTosic/SRBerta (Accessed on 27 September 2024).
Hugging Face. “RoBERTa Model Documentation”, Available online: https://huggingface.co/docs/transformers/model_doc/roberta (Accessed on 27 September 2024).