AMorph: An End-to-End Morpheme Level Natural Language Processing Pipeline for Amharic

doi:10.4236/oalib.1114897

OALib Journal期刊
ISSN: 2333-9721
费用：99美元

查看量	下载量

Open Access Library Journal 13 2026

查看所有领域

AMorph: An End-to-End Morpheme Level Natural Language Processing Pipeline for Amharic

DOI: 10.4236/oalib.1114897, PP. 1-18

Natnael Tamirat Molla,Shi Ming

Subject Areas: Computer Engineering

Keywords: Amharic, Morphological Segmentation, Finite-State Morphology, Weak Supervision, Connectionist Temporal Classification (CTC), Multi-Task Learning, Dependency Parsing, Universal Dependencies

Full-Text Cite this paper Add to My Lib

Abstract

Morphological segmentation is foundational for Natural Language Processing in morphologically rich languages, such as Amharic, yet progress is constrained by limited gold annotations and fragmented toolchains. We present an end-to-end framework that jointly addresses data creation and modeling for Amharic segmentation, part-of-speech tagging, and dependency parsing. Our approach begins with a silver-data generation pipeline that bootstraps segmentation labels from a rule-based analyzer and refines them through automated review, large language model assisted verification, and human-in-the-loop correction. Using the resulting supervision, we train an XLM-RoBERTa segmenter with a connectionist temporal classification (CTC) based character transduction objective, enabling reliable morpheme boundary prediction without explicit character-morpheme alignment. We then introduce a unified multi-task toolkit model that replaces the common practice of training separate systems per task. A shared pretrained encoder is jointly optimized for POS tagging and dependency parsing to better capture cross-task linguistic regularities while remaining parameter-efficient. The observed morpheme segmentation, POS tagging, and dependency parsing results support the conclusion that analyzer-bootstrapped supervision combined with multilingual pretrained encoders is effective for Amharic morphosyntactic modeling in low-resource settings. We release an open-source toolkit with simple APIs and an interactive visualization interface, enabling users to run the pipeline and inspect intermediate and final outputs for practical Amharic NLP development. API, documentation, and pre-trained models are available at https://github.com/Netela-lab/AMorph.

Cite this paper

Molla, N. T. and Ming, S. (2026). AMorph: An End-to-End Morpheme Level Natural Language Processing Pipeline for Amharic. Open Access Library Journal, 13, e14897. doi: http://dx.doi.org/10.4236/oalib.1114897.

References

[1]	Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014) The Stanford Corenlp Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demon-strations, Baltimore, 23-24 June 2014, 55-60. https://doi.org/10.3115/v1/p14-5010
[2]	Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019) FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. Proceedings of the 2019 Conference of the North, Minneapolis, 2 June-7 June 2019, 54-59. https://doi.org/10.18653/v1/n19-4010
[3]	Honnibal, M., Montani, I., Van Landeghem, S. and Boyd, A. (2020) Spacy: Indus-trial-Strength Natural Language Processing in Python. Zenodo.
[4]	Straka, M., Hajič, J. and Straková, J. (2016) UD-Pipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. Proceedings of LREC 2016, Portorož, May 2016, 4290-4297. https://aclanthology.org/L16-1680/
[5]	Qi, P., Zhang, Y., Zhang, Y., Bolton, J. and Manning, C.D. (2020) Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5 July-10 July 2020, 101-108. https://doi.org/10.18653/v1/2020.acl-demos.14
[6]	Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C.D., Pyysalo, S., Schuster, S., Tyers, F. and Zeman, D. (2020) Universal De-pendencies v2: An Ever-Growing Multilingual Treebank Collection. Proceedings of LREC 2020, Marseille, 11–16 May 2020, 4034-4043. https://aclanthology.org/2020.lrec-1.497/
[7]	de Marneffe, M., Manning, C.D., Nivre, J. and Zeman, D. (2021) Universal Dependencies. Computational Linguistics, 47, 255-308.
[8]	Gezmu, A.M., Seyoum, B.E., Gasser, M. and Nürnberger, A. (2018) Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus. Pro-ceedings of the First Workshop on Linguistic Resources for Natural Language Processing, Santa Fe, 20 August 2018, 65-70. https://aclanthology.org/W18-3809/
[9]	Gasser, M. (2011) HornMorpho: A System for Morphological Processing of Amharic, Oromo, and Tigrinya. Proceedings of HLT4Dev 2011, Alexandria, 2-3 May 2011, 94-99.
[10]	Caruana, R. (1997) Multitask Learning. Machine Learning, 28, 41-75. https://doi.org/10.1023/a:1007379606734
[11]	Ruder, S. (2017) An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098. https://arxiv.org/abs/1706.05098
[12]	Seyoum, B.E., Miyao, Y., Mekonnen, B. and Yimam, B. (2016) Mor-pho-Syntactically Annotated Amharic Treebank. Corpus Linguistics Fest (CLiF) 2016. https://ceur-ws.org/Vol-1607/seyoum.pdf
[13]	Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guz-mán, F., et al. (2020) Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics, Online, 5-10 July 2020, 8440-8451. https://doi.org/10.18653/v1/2020.acl-main.747
[14]	Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. (2006) Connectionist Temporal Classification. Proceedings of the 23rd International Conference on Machine Learning-ICML ‘06, New York, 25-29 June, 2006, 369-376. https://doi.org/10.1145/1143844.1143891
[15]	Eisner, J.M. (1996) Three New Probabilistic Models for Dependency Parsing. Proceedings of the 16th Conference on Computational Linguistics, Strouds-burg, 5-9 August 1996, 340-345. https://doi.org/10.3115/992628.992688
[16]	Nguyen, M.V., Lai, V.D., Pouran Ben Veyseh, A. and Nguyen, T.H. (2021) Trankit: A Light-Weight Transformer-Based Toolkit for Multilingual Natural Language Processing. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Online, 19-23 April 2021, 80-90. https://doi.org/10.18653/v1/2021.eacl-demos.10
[17]	Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., et al. (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Proceedings of the CoNLL 2018 Shared Task, Brussels, 31 October-1 November 2018, 1-21. https://doi.org/10.18653/v1/k18-2001
[18]	Universal Dependencies (2025) Tokenization and Word Segmentation. Online Documentation. https://universaldependencies.org/u/overview/tokenization.html
[19]	Universal Dependencies (2025) UD Amharic-ATT Treebank. Online Documentation. https://universaldependencies.org/treebanks/am_att/
[20]	Seyoum, B.E., Miyao, Y. and Mekonnen, B.Y. (2018) Universal Dependencies for Amharic. Proceedings of LREC 2018, Miyazaki, 7-12 May 2018, 2216-2222. https://aclanthology.org/L18-1350/
[21]	Creutz, M. and Lagus, K. (2007) Unsupervised Models for Morpheme Segmen-tation and Morphology Learning. ACM Transactions on Speech and Language Processing, 4, 3-es. https://doi.org/10.1145/1217098.1217101
[22]	Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016) Farasa: A Fast and Furious Segmenter for Arabic. Proceedings of the 2016 Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Demonstrations, San Diego, 12-17 June 2016, 11-16. https://doi.org/10.18653/v1/n16-3003
[23]	Kondratyuk, D. and Straka, M. (2019) 75 Languages, 1 Model: Parsing Uni-versal Dependencies Universally. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 3-7 No-vember 2019, 2779-2795. https://doi.org/10.18653/v1/d19-1279
[24]	Dozat, T. and Manning, C.D. (2017) Deep Bi-affine Attention for Neural Dependency Parsing. https://openreview.net/forum?id=Hk95PK9le
[25]	Beesley, K.R. and Karttunen, L. (2003) Finite-State Morphology. CSLI Publications.
[26]	Pimentel, T., et al. (2021) Findings of the SIGMOR-PHON 2021 Shared Task on Morphological Reinflection. Proceedings of SIGMORPHON 2021, Bangkok, 5-6 August 2021, 1-16.
[27]	Habash, N. and Rambow, O. (2005) Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambigu-ation in One Fell Swoop. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, 25-30 June 2005, 573-580. https://doi.org/10.3115/1219840.1219911
[28]	Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, Minneapolis, 2 June-7 June 2019, 4171-4186. https://doi.org/10.18653/v1/n19-1423

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133