The flood of new AI reports continues apace – not always with good news for users or the AI sector, as we have seen.
A new survey from customer experience specialist TELUS Digital comes with the headline that user trust in AI depends on how the training data is sourced.
That’s a bold and heartening claim. Especially when most leading AI tools – ChatGPT among them (800 million active weekly users) – have been trained on data scraped from the pre-2023 Web, often without permission, and sometimes from known pirate sources. Fifty-plus lawsuits are ongoing worldwide against AI vendors for breaches of copyright.
Meanwhile, a March report from the Rettighets Alliancen Denmark’s Rights Alliance presents data suggesting that Apple, Anthropic, DeepSeek, Meta, Microsoft, NVIDIA, OpenAI, Runway AI, and music platform Suno scraped known pirated content, such as the free LibGen library. (Suno has admitted to scraping nearly every high-res audio file off the internet, while Meta’s policy of using pirated texts was cited by Judge Chhabria in his copyright judgement last week)
So, on what basis does TELUS Digital make the claim that trust and data transparency are critical to AI customers, given that the world’s usage data would seem to say otherwise? OpenAI’s subscription revenues have doubled in the past 12 months. What price transparency there?
The evidence is apparently this: TELUS Digital’s survey of 1,000 US adults finds that 87% believe companies should be transparent about how they source data for Generative AI models. That is up from 75% in a similar survey last year, which – if nothing else – does suggest that news of vendors’ unethical behavior on copyright has an impact.
More, nearly two-thirds of respondents (65%) say that the exclusion of high-quality, verified content – TELUS Digital cites the New York Times, Reuters, and Bloomberg – can lead to inaccurate and/or biased responses from Large Language Models (LLMs).
Interesting stuff, especially given the US Government’s “fake news” war on traditional media, backed by Big Techs and the likes of Elon Musk, all of whom have a vested interest in dismantling the edifice of 20th Century media. “You can trust us”, they say, while sucking up the proprietary content of that century at industrial scale.
Yet while the TELUS Digital survey does suggest that transparency is a growing issue for users in the US – despite the overwhelming force applied by AI vendors, the attempted banning of US state regulation (just overturned by the Senate), and the force-feeding of ChatGPT, Copilot, Gemini, Claude, and other tools on every cloud platform – the figures tell us that customers use the tools regardless. Perhaps while holding their noses.
So, the question is: why do they deploy ChatGPT et al despite their makers’ apparent contempt for creators’ copyright – policies that are being tested in US courts? The answer is found in other reports this year (see diginomica, passim): users primarily adopt AI to save money and time, not to make smarter decisions. And because hype and competitive peer pressure compels them to.
Even so, the growing awareness of vendors’ disregard for creators’ rights has an effect, it seems. This suggests that, if vendors really want their subscription revenues to overtake their vast capex on data centers and chips, then adopting an ethical stance is one way to do it. But that will cost them money: paying for the data they should have licensed in the first place.
Expert data the way forward
So, what does TELUS Digital make of it all?
Amith Nair is Global VP and General Manager, Data and AI Solutions, at the Vancouver, Canada, headquartered provider. Nair says:
As AI systems become more specialized and embedded in high-stakes use cases, the quality of the datasets used to optimize outputs is emerging as a key differentiator for enterprises between average performance and having the potential to drive real-world impacts.
We’re well past the era where general crowdsourced or internet data can meet today’s enterprises’ more complex and specialized use cases. This is reflected in the shift in our clients’ requests from ‘wisdom of the crowd’ datasets to ‘wisdom of the experts’.
Experts and industry professionals help curate such datasets to ensure they are technically sound, contextually relevant and responsibly built.
Nair adds:
In high-stakes domains like healthcare or finance, even a single mislabelled data point can distort model behavior in ways that are difficult to detect and costly to correct.
Fair enough. And as my earlier report revealed, academic studies of LLM behavior find deep problems for the technology whenever real-world complexity challenges any simple prompted answers. In many cases, the deeper we dig into LLMs’ responses, the less accurate and prone to hallucination they become, having been trained on both fact and fiction, of course.
My take
So, verified, expert, high-quality data is clearly the way ahead, plus the availability of human experts to verify AIs’ workings. But as I suggested above, LLMs’ and Gen-AI’s problems are not as easily solved as that.
First, user behavior is strongly biased towards expediency, and towards cost and time savings. It is not targeted at making smarter decisions: in this sense, AI is little more than the new automation for many enterprise users.
Second, data is not held in a traditional database with these tools. It is more the case that tokens are reflected in weights and statistical probabilities. As a result, flawed or inaccurate data persists; it can’t simply be deleted.
Therefore, one can only hope that hallucinations are challenged and corrected, despite ample evidence from professional markets, such as legal services, that even seasoned experts are prone to trust chatbots’ output without question.
So, why have lawyers presented hallucinated case law in courts across the US? Because they are time poor and overwhelmed with paperwork, and AI CEOs have allegedly lied about their products’ proximity to superintelligence. Marketing BS, in other words: currently the most destructive force on Earth.
And third, as synthetic data booms and the internet is overrun with the AI slop generated by millions of shadow-IT users who are AIs’ largest customer base, access to verified, human-authored data will become more challenging to find, not less.
The irony of all this is obvious: the least transparent and most exploitative vendors – the ones that dominate the market – have grown fat on selling effort-free text, images, and video to users, rather than solving real-world problems.
What they should have done is sold trust to professionals first.