Recent advancements in Vietnamese language technology have emerged as educational authorities and leading tech firms collaborate to strengthen language skills among preschool-age ethnic minority children and develop high-quality AI solutions.
According to local educational authorities, significant strides have been made under the Ministry of Education and Training's initiatives aimed at enhancing Vietnamese language education for children from ethnic minority groups. This involves the implementation of various guidelines and the development of criteria to evaluate environments conducive to Vietnamese language learning. Numerous workshops have also been organized to promote language development strategies and to create educational settings centered around the child's learning experience.
Currently, there are 265 schools and 894 branch points serving nearly 84,469 preschool children, with about 50.3% of them being from ethnic minorities. The expansion of the boarding model at schools has been particularly beneficial for children from challenging localities, and materials, such as outdoor play equipment, have been significantly upgraded to meet educational demands.
Adding to this progress is the recent partnership between Viettel Solutions and Nvidia, materializing through the announcement of an extensive Vietnamese AI assistant training dataset. Reported by Dân trí, the dataset is described as high-quality and large-scale, paving the way for greater advancements within large language models (LLMs) for the Vietnamese language.
This collaborative effort marks the first such partnership between Viettel Solutions and Nvidia and involves direct data collection and processing from multiple sources, followed by normalization and classification. The NeMo Framework and Nvidia’s powerful GPU computational infrastructure played significant roles throughout this process.
Traditionally, AI models have been primarily trained on English language data, which poses challenges for creating Vietnamese applications. The hope is to change this with the newly developed dataset, allowing for more relevant AI experiences for Vietnamese users and extensive growth potential for the domestic AI community.
A representative of Viettel expressed optimism, stating, "By utilizing hardware resources and the NeMo library, we managed to process over 500GB of text data simultaneously, equivalent to around 120 million documents and 135 billion tokens." This statement reflects not only the scale of the dataset but also the commitment to ensuring cleanness and efficiency during the training process by eliminating duplicate and irrelevant information.
This new dataset has already been released via Nvidia’s technology-sharing platform and is available for free to researchers and developers within Vietnam. The intention is to continually enrich the dataset with diverse subject matter and detailed content, making it increasingly relevant and insightful.
Looking forward, both Viettel Solutions and Nvidia are dedicated to creating specialized datasets aimed at developing AI assistants for key areas such as healthcare, education, commerce, and public administration. Their strategic cooperation since 2022, including the latest agreement signed on December 5 to establish AI research and development centers, showcases the commitment of both firms to advanced technology solutions.
The Vietnamese government's pursuit, along with the partnership with Nvidia, signifies meaningful strides toward establishing Vietnam as a hub for advanced AI technology research and application. The recent dataset launch confirms the potential for effective collaboration between significant tech firms and educational authorities, paving the way for the immense possibilities of Vietnamese language technology advancements.