Skip to main content

Towards Linguistic Nuance: Corpus Development for the Javanese Honorific System

Primary supervisor

Derry Wijaya

The Javanese language, spoken by a population of over 98 million people, faces notable challenges in digital and technological applications, especially when compared to globally recognized languages. This disparity is highlighted in several studies that discuss the lack of deep learning research benefits due to data scarcity for Javanese. Additionally, other studies have pointed out the inaccessibility of data resources and benchmarks for Javanese, contrasting with languages like English and Mandarin Chinese. They further emphasize the under-representation and low-resource nature of Javanese in Natural Language Processing (NLP) research. These studies collectively underscore the urgent need to improve the digital and technological infrastructure for the Javanese language. One interesting aspect of the Javanese language that makes it a bit harder to generate language technology for it is its honorific system. Embedded in the social and cultural fabric of Javanese society, the honorific system in Javanese uses different levels of speech to show respect and social hierarchy. These complexities represent both a challenge for NLP research and an opportunity to better understand and model the complexity within language, culture, and social structure. The primary objective of this research is the development of a comprehensive Javanese corpus, with a special focus on its complex honorific system. This corpus aims to provide a rich, detailed dataset that can be utilized to enhance NLP classification tasks, such as sentiment analysis, topic classification, and machine translation, specifically tailored for the Javanese language. By capturing the nuances and variations inherent in the honorific system, the corpus will facilitate the creation of more accurate and culturally sensitive computational models. This will contribute significantly to the field of NLP by providing insights and methodologies that can be applied to other low-resource languages with similar linguistic features.

Student cohort

Double Semester