Embracing diversity in legal studies: the significance of underrepresented LLMs

In a world where Artificial Intelligence (AI) is becoming increasingly integrated into daily life and essential services, the importance of addressing the challenges in building AI systems inclusive of minority languages and cultures cannot be overstated.

The goal is to build a vocabulary and platform for people to talk about their research in their own language, thereby taking cultural ownership over science. However, the current state of AI development presents several obstacles.

One of the key challenges is data scarcity for minority languages. Many of these languages are endangered, with few fluent speakers and limited digital resources, making it difficult to train effective AI models for these languages.

Another issue is cultural bias and representation. Large Language Models (LLMs) often reflect the biases of their training data, which can be predominantly based on dominant languages and cultures, resulting in models that may stereotype or underrepresent minority groups or sociolects.

Ethical and sovereignty concerns also arise when Indigenous leadership and community involvement are lacking in AI tool development. Without proper representation, there is a risk of misusing cultural knowledge and violating data sovereignty, potentially harming the communities they intend to serve.

Terminology barriers further complicate the situation. Minority language speakers may face challenges in relating new AI or technological concepts due to a lack of established terminology in their languages.

Addressing these challenges requires collaborative, culturally sensitive approaches that combine community leadership, tailored technological development, bias mitigation, ethical frameworks, and supportive linguistic resources.

Community-led development ensures AI tools align with cultural values and needs, preserving sovereignty and promoting ethical use. Targeted language revitalization AI tools, such as voice recognition with correction, translation models, and learning companions, can revive the everyday use of endangered languages.

Bias mitigation techniques, like reweighting training data, prompt adjustments, and careful curation of datasets, can reduce harmful stereotypes and promote fairer representation in generative AI models. Ethical guidelines and frameworks for language education AI can ensure AI-supported language education is effective, ethical, and context-sensitive.

Lexical development for technology terms, such as UNESCO’s online AI dictionary for Kiswahili, helps minority language speakers access AI concepts in their own languages, bridging comprehension gaps.

The success of a market-orientated dynamic in creating a truly inclusive system relies on companies being committed to investing time, resources, and serving the interests of marginalised communities. Underrepresented groups create their own systems when a dominant system fails to serve them equally.

Examples of this can be seen in initiatives like Masakhane and the Māori Data Sovereignty Network, which are now involved in NLP research in indigenous languages that directly engages indigenous researchers.

In Africa, companies like Lelapa AI are developing AI technologies specifically tailored to African languages and cultural contexts. However, the majority of data used to train AI language models is still in English, creating a significant imbalance.

Inaccurate translations by AI apps have led to asylum seekers being detained for months due to linguistic and cultural misrepresentations. These incidents highlight the urgent need for more inclusive AI systems.

As we move forward, researchers are shifting focus to multilingual language technologies, but this is not a solution for everyone. The success of a truly inclusive AI system lies in the commitment of companies, researchers, and communities to address the unique challenges faced by minority languages and cultures.

References: [1] Manyika, J., Chui, M., Brown, T., Bughin, J., Dobbs, R., Roxburgh, C., ... & Tapscott, A. (2017). Harnessing AI for social good. McKinsey & Company. [2] Crawford, K., & Liu, K. (2019). Artificial intelligence’s white guy problem. The Markup. [3] Biel, M. (2020). AI Lang: A framework for ethical AI in language education. TESOL Journal. [4] Bolukbasi, T., Chang, H., & Zou, J. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. [5] Mwangi, P. (2021). AI for African Languages: The Case of Kiswahili. Africa Policy Journal.

Science and technology can play crucial roles in preserving minority languages and cultures by developing AI tools and resources tailored to these communities. However, obstacles such as data scarcity, cultural bias, ethical concerns, terminology barriers, and a lack of proper representation persist, necessitating collaborative, culturally sensitive approaches that prioritize community leadership, bias mitigation, ethical frameworks, and lexical development for technology terms in underrepresented languages.