Nations Needn't Craft Their Individual AI-Instead, Secure Their Spot Within It
Governments and underrepresented communities must take concrete steps to increase digital representation and inclusion of their cultural and linguistic identities in artificial intelligence (AI) systems. The misbalanced use of English-language, Western-centric content in AI development risks flattening cultural diversity and excluding numerous communities.
Current AI models have been trained primarily on U.S.-based content, which has dominated online data ecosystems historically. Furthermore, research suggests that large language models often "think" in English, leading to a loss of language-specific nuance when translating to other languages.
This dominance is a consequence of economic geography, with the United States leading in AI development and boasting the largest AI market. However, it is unrealistic to expect technology companies to address this imbalance alone. While firms such as Google and Meta have contributed significantly to multilingual AI, the data divide persists, particularly for minority languages and cultures.
To address these issues, governments and local communities should focus on promoting diverse and inclusive data collection. This involves expanding training datasets with data from a variety of cultural and linguistic backgrounds, as well as collaborating with members of underrepresented communities in the data collection, annotation, and validation processes.
In addition to data collection efforts, governments must actively develop inclusive data architectures and governance. AI data architectures should account for linguistic diversity, cultural norms, and local contexts, while encouraging decentralized AI development through support for local technology initiatives and empowering communities to shape the AI tools that affect them.
Governments must confront ethical and structural challenges, including preventing algorithmic colonialism and supporting the preservation of minority languages and cultural practices. This ensures that underrepresented communities are included in AI training data and interfaces.
Policy and education initiatives are crucial for governance, as well. Regulations must require AI systems to be culturally and linguistically inclusive, with mechanisms for accountability when biases and exclusions arise. Cross-cultural education should be promoted among AI developers and users to emphasize cultural diversity and the importance of representation.
Finally, governments should foster a culture of open and local innovation through investing in open-source AI initiatives that allow for community-driven improvements and adjustments. This encourages resilience, self-determination, and ensures underrepresented communities are not just data subjects, but co-creators and stakeholders in the design and deployment of AI systems that reflect their needs and perspectives.
By pursuing these strategies, governments and underrepresented communities can collaborate to ensure meaningful representation of their cultures and languages in AI, fostering equity and inclusion globally.
- To combat the dominance of English-language content in AI systems, governments and local communities should focus on promoting diverse and inclusive data collection, expanding training datasets with data from a variety of cultural and linguistic backgrounds.
- Governments must actively develop inclusive data architectures and governance, ensuring AI data architectures account for linguistic diversity, cultural norms, and local contexts, while encouraging decentralized AI development.
- Policy and education initiatives are crucial for governance; regulations must require AI systems to be culturally and linguistically inclusive, with mechanisms for accountability when biases and exclusions arise. Cross-cultural education should be promoted among AI developers and users.
- Governments should foster a culture of open and local innovation by investing in open-source AI initiatives that allow for community-driven improvements and adjustments, ensuring underrepresented communities are not just data subjects but co-creators and stakeholders in the design and deployment of AI systems.