LANGUAGE TOOLS:
The dictionaries category of the training corpus consistently ranks among users’ most frequently searched topics, the Ministry of Digital Affairs said

By Chiu Chiao-chen
and Shelley Shan /
Staff reporters

Taiwan’s Sovereign AI Training Corpus has grown to include more than 1.1 billion tokens just more than a month after its official launch, the Ministry of Digital Affairs said yesterday.

The platform, launched on Dec. 24 last year, aims to gather high-quality data in traditional Mandarin to train sovereign artificial intelligence (AI) models, ensuring that outputs better reflect the language patterns and cultural references familiar to Taiwanese.

The platform initially contained more than 2,000 datasets totaling more than 600 million units of data, also known as tokens, Department of Data Innovation Director-General Chuang Ming-fen (莊明芬) said.

Photo courtesy of the Ministry of Digital Affairs

The corpus has since nearly doubled in size, surpassing 1.1 billion tokens, with weekly updates tracking the steady release of data by government agencies, she said.

Most of the data on the platform are provided by the Ministry of Culture and Ministry of Education, covering subjects such as education, languages, history and tourism, the ministry said.

The language and vocabulary section also features dictionaries, a category that consistently ranks among users’ most frequently searched resources, it said.

Ministry statistics showed that the platform was viewed more than 35,000 times, and about 20 organizations in academia and industries have applied for access.

“That shows that people in research institutions, government agencies and the corporate world pay close attention to the high-quality data released by the government to train sovereign AI databases. It has set a good starting point for subsequent AI model developments,” Chuang said.

The ministry said that it would gradually expand data sources for sovereign AI to include inputs contributed by local governments during the first and second quarters of this year.

Local government officials would be invited to join a ministry-hosted seminar, where they would learn about the policy governing sovereign AI as well as procedures they need to follow to upload data, the ministry said.

Workshops could also be organized to assist local governments in uploading data to the platform, it added.

The ministry is planning to begin forming partnerships with the private sector in the second half of this year, and is also seeking authorization from Academia Sinica and the National Museum of Taiwan Literature to upload their data to the platform.

In related news, the government’s Open Data Platform has attracted about 175.84 million views and 22.27 million downloads more than a decade since its launch.

The platform was created in accordance with the government’s policy of digital government and data governance.

The three most frequently downloaded topics include information on earthquakes, as well as closing prices and monthly average prices of stocks, Chuang said.

“That shows that people are mostly interested in data closely related to their lives,” she added.