EU orders AI companies to clean up their act, stop using pirated data • FRANCE 24 English

tech editor Peter O’Brien who’s been looking into that for us. Hi Peter. Hi Jenny. So like any new rules released by the EU, it was bound to be controversial. Controversial EU rules never. Um no this code of practice, I’ll just say what it is briefly. It’s uh guidelines around AI safety, copyright and transparency. And it’s particularly focused towards these companies creating the cutting edge of AI generalist chat bots. So like OpenAI for its Chachi PT or Google for its Gemini or Anthropic for Claude for instance and as you say it’s caused a stir. Big tech lobbies say it goes too far. Civil society groups say it’s been watered down by those big tech lobbies. Case in points, CC CCIA Europe, which represents the technology industry, says it imposes a dis disproportionate burden on AI providers, while the future society civil societ civil society think tank says it perpetuates providers deploy first question later mentality and means that potentially dangerous models will get to European unit users without receiving any meaningful scrutiny. They’re essentially arguing that the EU wants to look innovation friendly and also doesn’t want to piss off Donald Trump um and has therefore given big tech lobbyists exclusive access to the last draft of this document because they want to make sure they get some of these tech companies signing on the dotted line. Obviously, the EU Commission will say that there’s not much point in a code of practice if no one’s signed up for it. And we’ll have to see who will sign up to it. We’re not sure yet. And some of the key parts of the code of practice surround data. Tell us more. Yeah, so data is the lifeblood of all of these AI models. It’s crucial to how they work, what kind of data you put in and why. Now, up to now, most of the models don’t make it very clear or transparent what kind of data they’re using and how. And the code of practice is going to change this. So signitaries will have to report on their training method, their training data, how they got it, what kind of data it is, and also evidence that they actually obtained the right to thirdparty data. They’re also going to have to open the door to independent external evaluators to look at their AI models and even look at their relevant training data. So that could be a big step for these companies to have to take. Now copyrighted data is particularly a thorny issue. Does it deal with that as well? Yeah. So one of the three parts of this code of practice is dedicated to copyright. And if we call data the lifeblood of these models, well artists and authors have been saying for a long time now that this is lifeblood that’s been sucked from them without their permission for profit. Web crawlers have been crawling the internet, scooping up everything on it, including copyright content and using it to feed into the machine and train these models. Anthropic, even as was revealed in um court documents recently, has made copies of every single book they could possibly find, millions of them, by buying secondhand books and scanning them page by page and then destroying them. Not only that, vast swavthes of deliberately pirated material have been used to train these models or at least the early versions of them. Now, the code of practice asks for the first time companies to commit to not using pirated databases and also ask them to allow rights holders to opt out of their work being used to train these models. It comes hot on the tail of three uh court cases in the US where judges have for the first time pronounced on this issue and they’re tending to say that copyrighted material for training AI is fair use although that will differ from case to case but they haven’t crucially given a free pass to these companies on pirated content and uh I think it comes as no surprise that AI companies are paying top dollar for top data that’s right so now the market for highquality proprietary data is exploding and it’s not just because of the push to respect copyrights and the push to crack down on pirated content, but also simply because the better your data is, the more competitive your model is. So, you might have heard of last month’s $14 billion investment by Meta into scale AI. This startup provided training data to a bunch of these different AI companies. But such is the competition that this deal has now spooked some of them off and sent them running to alternatives. The CEO of rival AI data company Cheuring John Jonathan Sidarth told me that business has been booming in the last couple of weeks. Now their business model like some of the other data providers is based on millions of freelance contractors, software engineers and experts in poor countries around the world. One of them in in India told me that it was decent money uh flexible hours but zero job security whatsoever. So it’s kind of like the rebirth of the gig economy for the uh for the AI era. Um now Sid I asked him earlier this week about this data and piracy issue. Take a listen. I think it’s really important to um respect the the rights of creators which is why platforms like Turing are one of the ways to um responsibly um scale model performance because the data that we generate for each of our clients is proprietary. Uh we the and the talents gets paid uh the client is paying for data that they basically own. uh which is different from scraping content from the from the internet. And new startups that come along, they almost in a way have to use pirated content if they have any chance of catching up because they simply can’t afford the kind of service you provide. Is there any solution to this or do we just have to accept it? I think there are some good academic data sets like uh there’s common crawl, C4, GitHub, archive. There are a few data sets that academia usually trains on for pre-training at Turing like we are also um considering how we can help uh in this regard. Um we’ll have more to share about that uh soon. Okay. I did try to push Turing on what exactly their solution is going to be for making those high quality proprietary data sets available to wider public and to uh startups that would usually be stealing data. uh they didn’t bite. But in any case, I think it’s very interesting how the EU in its push to regulate this could be one of the contributing factors to this entirely new way of working around the world. Turing itself has about four million contractors around the world.

The European Commission wants AI companies to stop using pirated data and allow creators to withhold their copyrighted material. This comes amid the rise of a massive global workforce of remote workers from poor countries, who provide bespoke data via third party brokers. We take a closer look in this edition of Tech 24.
#ai #big tech #eu

Read more about this story in our article: https://f24.my/BJAh.y

🔔 Subscribe to France 24 now: https://f24.my/YTen
🔴 LIVE – Watch FRANCE 24 English 24/7 here: https://f24.my/YTliveEN

🌍 Read the latest International News and Top Stories: https://www.france24.com/en/

Like us on Facebook: https://f24.my/FBen
Follow us on X: https://f24.my/Xen
Bluesky: https://f24.my/BSen and Threads: https://f24.my/THen
Browse the news in pictures on Instagram: https://f24.my/IGen
Discover our TikTok videos: https://f24.my/TKen
Get the latest top stories on Telegram: https://f24.my/TGen

43 comments
  1. I have an idea that would actually work: let's publish a list of pirate companies that will then fall out of intellectual property and hacking protection and whose bills you don't have to pay by law. Problem solved.

  2. This isn't going to work.

    Suppose a Chinese company in China uses 1 million songs in it's data source to make an AI model obeying Chinese copyright law (which is almost non-existent), then in China you use the model to create 1 million new songs.

    You take those 1 million new song to the EU the Chinese company has a copyright for those 1 million songs because they were generated in China by a Chinese AI system that wasn't breaking Chinese law.

    You can use the 1 new songs that the Chinese company has to train a new AI model in Europe and you have not broken any law in the EU or anywhere.

  3. Hope this opens an easier legal recourse for companies & creators to sue these AI companies and make them pay or be forced to delete the knowledge base of their AI models

  4. This is so stupid. Now the EU is definitely going to be left behind in the AI race. I guess it's going to be the US and China driving the technology forward for the foreseeable future. Good luck getting either of them to conform to these rules.

  5. EU “clean your data before you make your data models for a.i.”

    Meanwhile American congressmen “how you press button? What’s the inter-webs?”

  6. Tell this reporter to speak more naturally, ouch! He swallows half the words with his heavy English telly voice, both annoying and hard to understand.

  7. This is just preaching to the bubble(s). To the AItrepreneurs… Don't be a moonshiner, be a distiller. Do not be a Pirate, be a Privateer. To the Investors… The dataWash has begun. The Internet is not only dead… the corpse is being cleaned. Better get a brush… or your generational knowledge will now be subscription based. "What's in your robot?"

  8. Digital Artists were screwed over from one day to the next. Time to go back to paper and pencil for authenticate art.

  9. EU regulation? Outdated 😂😂😂😂 that's why you are not innovating. EU have money because of Tourist. Europe without tourist. A 3rd world country already.😂

  10. Copyrights?😂😂 They already live in the old world. How can AI improve without data? Copyright make AI slower and not so smart. Think about benefits of AI not always criticism. That's why you are behind..

  11. AI have one tool LOGIC . It can not function in another way .If some one wont to make from it LAI it need to be highly professional in LIE .

  12. The EU’s call for AI companies to “clean up their act” seems less about ethics and more about sabotage. Instead of innovating or competing seriously in the field of artificial intelligence, Europe—lagging far behind global AI leaders—now wants to rewrite the rules to slow others down. It’s ironic to try regulating an industry in which you’re barely a player. If Europe wants to be relevant in AI, it should invest in talent, infrastructure, and innovation—not create roadblocks for those who already lead the game.

  13. hindering own development of technology won't make concurents to stop it. Nobody could stop China to copy other products, "fake it till you make it", and they made it. What is gonna happen with european own AI technology? Will it be left behind whilst hindered by bureaucracy in stone age?

  14. Never gonna work with something like A.I all this will do is force A.I companies to abandon the E.U.
    Since the alternative would result in dumb A.I.
    Enjoy your Bingbot , Siri and Alexa E.U regulators.

  15. Wow, they are able to take and use copyrighted data commerically without paying! Talk about avoiding costs (and undermining the whole point of copyright, from a commercial view point).

  16. There should be a proper way of balance both side. Because content creators need AI also. If AI pays for content creators. AI content price will increased. 😮

Comments are closed.