
Kim Byoung-pil
The author is a professor of technology management at KAIST.
Artificial intelligence requires vast amounts of data to learn. But how much is enough to achieve true international competitiveness? The Chinese AI startup DeepSeek is said to have trained its models on roughly 50TB of text. If 20 percent of that is in Chinese, that alone would equal about 10TB — the equivalent of 30 million books. Korea’s National Library holds about 10 million books. Even if every Korean book ever published were digitized, it would still fall far short.
And text is no longer enough. AI is now entering the age of large world models — systems that learn not only from words, but also from images and records of human actions. Such models are essential for self-driving cars, robotics and medicine. At this stage, tens of terabytes of multimodal data are required. Without it, Korea risks being pushed to the margins of the global technology map.
![[JOONGANG ILBO]](https://www.europesays.com/wp-content/uploads/2025/08/4328f9b4-7c52-406b-88c7-3784ab9a72b3.jpg)
[JOONGANG ILBO]
The greatest obstacle to building massive training datasets is law. Copyrights, database rights, privacy and portrait rights are all entangled in data. Eliminating even minor risks of infringement is nearly impossible. Lawsuits are already piling up in Korea and abroad. The legal uncertainty has been debated for years, but the accelerating pace of AI demands an urgent solution.
One answer may be a special law for training data. Simply allowing companies to use all works without constraint would unfairly sacrifice rights holders. Instead, Korea could exempt responsibility for carefully chosen categories of data — those with low infringement risk but high value for AI training. Because technology changes so quickly, such a law could be introduced for a limited period, subject to regular review and renewal.
The exemptions should be limited. They should apply only to general-purpose AI — models that boost productivity and enrich society broadly — and only to large-scale systems requiring multiple terabytes of data. For instance, a model with hundreds of billions of parameters could be a threshold.
Ironically, large AI systems may help safeguard rights holders. General-purpose models are increasingly able to recognize legal and ethical boundaries, including judging possible copyright violations. Recent advances show that large vision-language models can already be used to assess infringement risks. Yet such capabilities are possible only after training on vast datasets. In other words, to prevent copyright violations, mass training on copyrighted works may need to be allowed — another reason to consider legal exemptions.
![A giant screen shows Chinese President Xi Jinping shaking hands with DeepSeek founder Liang Wenfeng during a symposium on private enterprises at a shopping complex in Beijing on Feb. 17. [REUTERS/YONHAP]](https://www.europesays.com/wp-content/uploads/2025/08/b3456015-aaf2-4b9c-bbc8-dae6cf8f1425.jpg)
A giant screen shows Chinese President Xi Jinping shaking hands with DeepSeek founder Liang Wenfeng during a symposium on private enterprises at a shopping complex in Beijing on Feb. 17. [REUTERS/YONHAP]
But immunity must not be unconditional. The benefits of AI training must feed back into Korea’s industrial ecosystem. Several safeguards could be put in place. Training data, for instance, could be required to be stored and processed domestically, preventing uncontrolled overseas transfers and ensuring economic value remains in Korea. Transparency and accountability measures — such as a government-run registration system for training datasets — could also strengthen trust and safety.
A special law would not apply only to private firms. It could empower the government to build and provide large-scale training databases through public institutions. The most symbolic resource would be the National Library of Korea’s 10 million-plus volumes, which could be digitized and refined into a high-quality text database. Rather than leaving companies to purchase books individually, the state could supply standardized, legally sound data.
Such a public data hub could incorporate other Korean-language resources: online archives, academic papers, court rulings and textbooks — and eventually expand to speech and video. Applying the special law to this effort would reduce costs and eliminate legal uncertainty around vast resources.
![Figurines with computers and smartphones are pictured in front of the words ″Artificial Intelligence." [REUTERS/YONHAP]](https://www.europesays.com/wp-content/uploads/2025/08/55f9445e-80e1-416d-b158-84baf42b3fe3.jpg)
Figurines with computers and smartphones are pictured in front of the words ″Artificial Intelligence.” [REUTERS/YONHAP]
How exactly to design such a law will require broad debate. But delay is no longer an option. With the current legal framework, building training datasets of tens of terabytes is nearly impossible. If nothing changes, the Korean language and Korean culture risk being sidelined in the age of AI.
Korea must move quickly to adapt its laws to the new technological environment. Passing a special law for AI training data would be the first step.
This article was originally written in Korean and translated by a bilingual reporter with the help of generative AI tools. It was then edited by a native English-speaking editor. All AI-assisted translations are reviewed and refined by our newsroom.