CHARACTER-BASED BiLSTM FOR DIACRITIC RESTORATION IN SERBIAN

1st International Scientific Conference Education and Artificial Intelligence (EDAI 2024), [pp. 111-120]

AUTHOR(S) / АУТОР(И): Miljana Mladenović , Aleksandar Spasić , Lazar Stošić

DOI: 10.46793/EDAI24.111M

ABSTRACT / САЖЕТАК:

One of the most common irregularities in languages that contain letters with diacritics is the omission of diacritics. Their lack can lead to a misunderstanding and changes in the semantics of the text. Therefore, it is essential to restore diacritics automatically. This paper presents how contemporary deep-learning techniques can solve automatic diacritic restoration or diacritization problems without using powerful hardware with graphics processing units (GPUs). Training a neural network on a CPU is essential in an educational institution without proper new equipment, and this paper can encourage learning neural network programming in such conditions. We proposed a two-layer Bidirectional Long short-term memory (BiLSTM) sequential Neural Network for multiclass classification that learns to predict one of seven possible letters to replace a letter without the diacritic. The sequential nature of the text makes sequence networks efficient in Natural Language Processing and Understanding. The model was learned from Serbian text data. It is implemented using the character-based approach, a language-independent technique that can efficiently generate models for others, especially Slavic languages. The evaluation shows that all macro, micro, and weighted average metrics, such as precision, recall, and F1, achieved 98%. The main advantages of the proposed model are the easy and quick creation of a labeled dataset, not the very deep network, small vocabulary, small content window, a small number of fitting epochs, and easy manipulation of preprocessing and learning parameters to obtain the efficient and accurate model. The model is publicly available; it can be downloaded or tested on the corresponding website.

KEYWORDS / КЉУЧНЕ РЕЧИ:

neural networks, deep learning, computational linguistics, diacritic restoration, BiLSTM

REFERENCES / ЛИТЕРАТУРА:

Almanea, M. M. (2021). Automatic Methods and Neural Networks in Arabic Texts Diacritization: A Comprehensive Survey. IEEE Access 9, 145012-145032. (Doi: 10.1109/ACCESS.2021.3122977.) https://api.semanticscholar.org/CorpusID:240011970
Alqahtani, S., Mishra, A. & Diab, M. (2020). A Multitask Learning Approach for Diacritic Restoration. In D. Jurafsky, J. Chai, N. Schluter & J. Tetreault (ed.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8238−8247). Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.732
Al-Thubaity, A., Alkhalifa, A., Almuhareb, A. & Alsanie, W. (2020). Arabic Diacritization Using Bidirectional Long Short-Term Memory Neural Networks With Conditional Random Fields. IEEE Access 8, 154984−154996. https://doi.org/10.1109/ACCESS.2020.3018885
Asahiah, F. Ọ., Ọdẹ́jọbí, Ọ. À. & Adágúnodò, E. R. (2018). A survey of diacritic restoration in abjad and alphabet writing systems. Natural Language Engineering 24 (1), 123−154. https://doi.org/10.1017/S1351324917000407.
Csanády, B. & Lukács, A. (2022). Dilated Convolutional Neural Networks for Lightweight Diacritics Restoration. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk & S. Piperidis (ed.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022 (pp. 4253−4259). European Language Resources Association. https://aclanthology.org/2022.lrec-1.452
Crandall, D. (2016). Automatic Accent Restoration in Spanish text. https://homes.luddy.indiana.edu/djcran/projects/674_final.pdf
De Pauw, G., Wagacha, P. W. & de Schryver, G.-M. (2007). Automatic Diacritic Restoration for Resource-Scarce Languages. In V. Matousek & P. Mautner (ed.), Text, Speech and Dialogue (pp. 170−179). Springer Berlin Heidelberg.
Iman, Z., Adnan, S. & Eddine, E. M. B. (2023). Neural Network for Arabic Text Diacritization on a New Dataset. In M. Lazaar, E. M. En-Naimi, A. Zouhair, M. Al Achhab & O. Mahboub (ed.), Proceedings of the 6th International Conference on Big Data and Internet of Things, Vol. 625 (pp. 186−199). Springer International Publishing.
Kapočiūtė-Dzikienė, J., Davidsonas, A. & Vidugirienė, A. (2017). Character-Based Machine Learning vs. Language Modeling for Diacritics Restoration. Information Technology And Control 46 (4). https://doi.org/10.5755/j01.itc.46.4.18066.
Kartelj, A., Mladenović, M. & Vujičić-Stanković, S. (2024). Comparison of Algorithms for the Recognition of ChatGPT Paraphrased Texts. Preprinit. https://doi.org/10.21203/rs.3.rs-5107971/v1
Kharsa, R., Elnagar, A. & Yagi, S. (2024). BERT-Based Arabic Diacritization: A state-of-the-art approach for improving text accuracy and pronunciation. Expert Systems with Applications 248, 123416. https://doi.org/10.1016/j.eswa.2024.123416, https://www.sciencedirect.com/science/article/pii/S0957417424002811.
Klyshinsky, E., Karpik, O. & Bondarenko, A. (2021). A Comparison of Neural Networks Architectures for Diacritics Restoration. In W. M. P. van der Aalst, V. Batagelj, A. Buzmakov, D. I. Ignatov, A. Kalenkova, M. Khachay, Koltsova, A. Kutuzov, S. O. Kuznetsov, I. A. Lomazova, N. Loukachevitch, I. Makarov, A. Napoli, A. Panchenko, P. M. Pardalos, M. Pelillo, A. V. Savchenko & E. Tutubalina (ed.), Recent Trends in Analysis of Images, Social Networks and Texts (pp. 242−253). Springer International Publishing. (ISBN: 978-3-030-71214- 3.)
Krstev, C., Stanković, R. & Vitas, D. (2018). Knowledge and Rule-Based Diacritic Restoration in Serbian. In Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018). Sofia : The Institute for Bulgarian Language Prof. Lyubomir Andreychin, Bulgarian Academy of Sciences
Laki, L. J. & Yang, Z. G. (2020). Automatic Diacritic Restoration With Transformer Model Based Neural Machine Translation for East-Central European Languages. In International Conference on Applied Informatics. https://api.semanticscholar.org/CorpusID:221904875.
Ljubešić, N., Erjavec, T. & Fišer, D. (2016). Corpus-Based Diacritic Restoration for South Slavic Languages. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (ed.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 3612−3616). European Language Resources Association (ELRA). https://aclanthology.org/L16-1573.
Mihalcea, R. & Nastase, V. (2002). Letter Level Learning for Language Independent Diacritics Restoration. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002). https://aclanthology.org/W02-2021.
Mijlad, A. & Younoussi, Y. E. (2022). A Comparative Study of Some Automatic Arabic Text Diacritization Systems. Advances in Human-Computer Interaction 2022, 1−13. (Doi: 10.1155/2022/3613710.) https://doi.org/10.1155/2022/3613710.
Mijlad, A. & Younoussi, Y. E. (2023). Global and local attention for automatic Arabic text diacritization. International Journal of Engineering and Applied Physics (IJEAP) 3 (1), 653-662. https://ijeap.org/ijeap/article/view/113.
Mladenović, M., Mitrović, J., Krstev, C. & Vitas, D. (2016). Hybrid Sentiment Analysis Framework for a Morphologically Rich Language. Journal of Intelligent Information Systems 46 (3), 599-620. Springer. DOI: https://doi.org/10.1007/s10844-015-0372-5
Náplava, J., Straka, M. & Straková, J. (2021). Diacritics Restoration using BERT with Analysis on Czech language. Prague Bulletin of Mathematical Linguistics 116 (1), 27−42. (Doi: 10.14712/00326585.013.) http://dx.doi.org/10.14712/00326585.013.
Náplava, J., Straka, M., Straková, P. & Hajic, J. (2018). Diacritics Restoration Using Neural Networks. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis & T. Tokunaga (ed.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA). https://aclanthology.org/L18-1247.
Nuţu, M., Lőrincz, B. & Stan, A. (2019). Deep Learning for Automatic Diacritics Restoration in Romanian. In 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP) (pp. 235−240).
Ogheneruemu, K., Ajao, J. F., Isiaka, A., Asahiah, F. O. & Orimogunje, O. K. (2023). Diacritic Restoration for Yoruba Text with under dot and Diacritic Mark Based on LSTM. FUOYE Journal of Engineering and Technology 8 (3), 284−293. https://doi.org/10.46792/fuoyejet.v8i3.1020.
Orife, I. (2018). Attentive Sequence-to-Sequence Learning for Diacritic Restoration of YorùBá Language Text. In Interspeech 2018 (pp. 2848−2852).
Ozer, Z., Ozer, I. & Findik, O. (2018). Diacritic restoration of Turkish tweets with word2vec. Engineering Science and Technology, an International Journal 21 (6), 1120−1127. https://doi.org/10.1016/j.jestch.2018.09.002.
Özge, A. T., Bozal, Ö. & Özge, U. (2022). Diacritics correction in Turkish with context-aware sequence to sequence modeling. Turkish Journal of Electrical Engineering and Computer Sciences 30 (6), 2433−2445. https://doi.org/10.55730/1300-0632.3948. https://journals.tubitak.gov.tr/elektrik/vol30/iss6/28.
Rychlý, P. (2012). CzAccent – Simple Tool for Restoring Accents in Checz. In P. R. Aleš Horák (ed.), Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012 (pp. 85−89).
Stanković, R., Mladenović, M., Obradović, I., Vitas, M. & Krstev, C. (2018). Resource-based WordNet augmentation and enrichment. In Proceedings of the Third International Conference Computational Linguistics in Bulgaria (CLIB 2018). (pp. 104−114).
Stankevičius, L., Lukoševičius, M., Kapočiūtė-Dzikienė, J., Briedienė, M. & Krilavičius, T. (2022). Correcting Diacritics and Typos with a ByT5 Transformer Model. Applied Sciences 12 (5). https://doi.org/10.3390/app12052636. https://www.mdpi.com/2076-3417/12/5/2636.
Ruseti, S., Cotet, T.-M. & Dascălu, M. (2020). Romanian Diacritics Restoration Using Recurrent Neural Networks. ArXiv abs/2009.02743. https://api.semanticscholar.org/CorpusID:268130518.
Toth, Š., Zaymus, E., Ďuračík, M., Hrkút, P. & Meško, M. (2021). Diacritics restoration based on word n-grams for Slovak texts. Open Computer Science 11 (1), 180−189. https://doi.org/10.1515/comp-2020-0143.
Šantić, N., Šnajder, J. & Bašič, B. D. (2009). Automatic Diacritics Restoration in Croatian Texts. In H. Stančić, S. Seljan, D. Bawden, J. Lasić-Lazić & A. Slavić (ed.), The Future of Information Sciences, Digital Resources and Knowledge Sharing (pp. 309−318). Odsjek za informacijske i komunikacijske znanosti Filozofskog fakulteta Sveučilišta u Zagrebu. https://api.semanticscholar.org/CorpusID:53613394.
Šuppa, M. (2018). Diacritics Restoration for Slovak Texts Using Deep Neural Networks. Unpublished master’s thesis, Comenius University in Bratislava, Faculty of Mathematics, Physics and Informatics. http://cogsci.fmph.uniba.sk/~farkas/theses/marek.suppa.dip18.pdf.