AI devs close to scraping bottom of data barrel • The Register

Those spiffy AI systems that tech companies keep promising require mountains of training data, but high-quality sources may have already run out—unless enterprises can unlock the information trapped behind their firewalls, according to Goldman Sachs

Training data is the Achilles heel of massive new AI models, as detailed by George Lee, co-head of the Goldman Sachs Global Institute, in a recent webcast on data’s role in AI.

“The quality of the outputs from these models, particularly in enterprise settings, is highly dependent on the quality of the data that you’re sourcing and referencing,” Lee said.

The problem is finding enough quality data, according to Neema Raphael, Goldman Sachs’ chief data officer and head of data engineering. Some developers may be resorting to synthetic data or training models on the output of existing AI systems.

“We’ve already run out of data,” Raphael said. “When you read about the new models, the undertone of what people say, like with models like Deepseek, is how did they do that with less money? One of the big hypotheses is they trained against another model.”

The interesting thing is going to be how previous models then shape what the next iteration of the world looks like, he added, if models are being trained on the output of other models and less on real world data.

One danger is model collapse, where the performance of an AI system degrades once it is trained on its own previously generated data outputs, leading to a model losing previously learned nuances, while errors accumulate and get amplified with each new generation.

But when asked if this might hold back or even torpedo the unrealized potential of upcoming AI developments like autonomous agents, Raphael said he didn’t think it would be a roadblock to future advances.

“There will be a curse of AI slop against more insightful data, but I don’t think it’s really going to be a massive constraint because there is a lot of trapped enterprise data that still has not been harnessed,” he said.

The amount of information that lives behind corporate firewalls and trapped inside data repositories is “highly salient to garnering business value,” according to Goldman Sachs.

“From an AI perspective, it’s obvious that its real and it’s here to stay. There is absolutely a hype to it, but also when you go on your phone and you take a picture and ask what is this and you get great answers – it’s definitely real in consumer apps,” Raphael said.

“I think the potential in the enterprise is still to be seen – where can people harness their enterprise data and their proprietary data to make some differentiation. That’s the “to be seen” part,” he explained.

“Cleaning your data, normalizing it, having the semantics of the data understood, all of this stuff is what’s going to allow enterprises to level up,” he added.

However, this optimism should be set against recent findings that US companies have invested up to $40 billion in Generative AI initiatives already, with almost nothing to show for it; that autonomous AI agents get office tasks wrong most of the time; and that AI systems need humans to monitor them and correct their mistakes. ®

AI devs close to scraping bottom of data barrel • The Register

Tags: