Morphological segmentation is foundational for Natural Language Processing in morphologically rich languages, such as Amharic, yet progress is constrained by limited gold annotations and fragmented toolchains. We present an end-to-end framework that jointly addresses data creation and modeling for Amharic segmentation, part-of-speech tagging, and dependency parsing. Our approach begins with a silver-data generation pipeline that bootstraps segmentation labels from a rule-based analyzer and refines them through automated review, large language model assisted verification, and human-in-the-loop correction. Using the resulting supervision, we train an XLM-RoBERTa segmenter with a connectionist temporal classification (CTC) based character transduction objective, enabling reliable morpheme boundary prediction without explicit character-morpheme alignment. We then introduce a unified multi-task toolkit model that replaces the common practice of training separate systems per task. A shared pretrained encoder is jointly optimized for POS tagging and dependency parsing to better capture cross-task linguistic regularities while remaining parameter-efficient. The observed morpheme segmentation, POS tagging, and dependency parsing results support the conclusion that analyzer-bootstrapped supervision combined with multilingual pretrained encoders is effective for Amharic morphosyntactic modeling in low-resource settings. We release an open-source toolkit with simple APIs and an interactive visualization interface, enabling users to run the pipeline and inspect intermediate and final outputs for practical Amharic NLP development. API, documentation, and pre-trained models are available at https://github.com/Netela-lab/AMorph.
Cite this paper
Molla, N. T. and Ming, S. (2026). AMorph: An End-to-End Morpheme Level Natural Language Processing Pipeline for Amharic. Open Access Library Journal, 13, e14897. doi: http://dx.doi.org/10.4236/oalib.1114897.
Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014) The Stanford Corenlp Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demon-strations, Baltimore, 23-24 June 2014, 55-60. https://doi.org/10.3115/v1/p14-5010
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S. and Vollgraf, R. (2019) FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. Proceedings of the 2019 Conference of the North, Minneapolis, 2 June-7 June 2019, 54-59. https://doi.org/10.18653/v1/n19-4010
Straka, M., Hajič, J. and Straková, J. (2016) UD-Pipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. Proceedings of LREC 2016, Portorož, May 2016, 4290-4297. https://aclanthology.org/L16-1680/
Qi, P., Zhang, Y., Zhang, Y., Bolton, J. and Manning, C.D. (2020) Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5 July-10 July 2020, 101-108. https://doi.org/10.18653/v1/2020.acl-demos.14
Gezmu, A.M., Seyoum, B.E., Gasser, M. and Nürnberger, A. (2018) Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus. Pro-ceedings of the First Workshop on Linguistic Resources for Natural Language Processing, Santa Fe, 20 August 2018, 65-70. https://aclanthology.org/W18-3809/
Gasser, M. (2011) HornMorpho: A System for Morphological Processing of Amharic, Oromo, and Tigrinya. Proceedings of HLT4Dev 2011, Alexandria, 2-3 May 2011, 94-99.
Seyoum, B.E., Miyao, Y., Mekonnen, B. and Yimam, B. (2016) Mor-pho-Syntactically Annotated Amharic Treebank. Corpus Linguistics Fest (CLiF) 2016. https://ceur-ws.org/Vol-1607/seyoum.pdf
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guz-mán, F., et al. (2020) Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics, Online, 5-10 July 2020, 8440-8451. https://doi.org/10.18653/v1/2020.acl-main.747
Graves, A., Fernández, S., Gomez, F. and Schmidhuber, J. (2006) Connectionist Temporal Classification. Proceedings of the 23rd International Conference on Machine Learning-ICML ‘06, New York, 25-29 June, 2006, 369-376. https://doi.org/10.1145/1143844.1143891
Eisner, J.M. (1996) Three New Probabilistic Models for Dependency Parsing. Proceedings of the 16th Conference on Computational Linguistics, Strouds-burg, 5-9 August 1996, 340-345. https://doi.org/10.3115/992628.992688
Nguyen, M.V., Lai, V.D., Pouran Ben Veyseh, A. and Nguyen, T.H. (2021) Trankit: A Light-Weight Transformer-Based Toolkit for Multilingual Natural Language Processing. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Online, 19-23 April 2021, 80-90. https://doi.org/10.18653/v1/2021.eacl-demos.10
Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., et al. (2018) CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Proceedings of the CoNLL 2018 Shared Task, Brussels, 31 October-1 November 2018, 1-21. https://doi.org/10.18653/v1/k18-2001
Seyoum, B.E., Miyao, Y. and Mekonnen, B.Y. (2018) Universal Dependencies for Amharic. Proceedings of LREC 2018, Miyazaki, 7-12 May 2018, 2216-2222. https://aclanthology.org/L18-1350/
Creutz, M. and Lagus, K. (2007) Unsupervised Models for Morpheme Segmen-tation and Morphology Learning. ACM Transactions on Speech and Language Processing, 4, 3-es. https://doi.org/10.1145/1217098.1217101
Abdelali, A., Darwish, K., Durrani, N. and Mubarak, H. (2016) Farasa: A Fast and Furious Segmenter for Arabic. Proceedings of the 2016 Conference of the North American Chapter of the Associa-tion for Computational Linguistics: Demonstrations, San Diego, 12-17 June 2016, 11-16. https://doi.org/10.18653/v1/n16-3003
Kondratyuk, D. and Straka, M. (2019) 75 Languages, 1 Model: Parsing Uni-versal Dependencies Universally. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 3-7 No-vember 2019, 2779-2795. https://doi.org/10.18653/v1/d19-1279
Pimentel, T., et al. (2021) Findings of the SIGMOR-PHON 2021 Shared Task on Morphological Reinflection. Proceedings of SIGMORPHON 2021, Bangkok, 5-6 August 2021, 1-16.
Habash, N. and Rambow, O. (2005) Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambigu-ation in One Fell Swoop. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, 25-30 June 2005, 573-580. https://doi.org/10.3115/1219840.1219911
Devlin, J., Chang, M., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, Minneapolis, 2 June-7 June 2019, 4171-4186. https://doi.org/10.18653/v1/n19-1423