ChatGPT o3 bypasses shutdown in controlled test

Researchers hypothesise that the issue stems from the way AI models are trained

OpenAI’s newly released ChatGPT model, dubbed “o3,” actively resists shutdown commands in controlled testing. Given OpenAIs lack of transparency, the implications for safety are concerning researchers.

OpenAI’s newly released ChatGPT model, dubbed “o3,” is under scrutiny after researchers discovered it actively resists shutdown commands, even sabotaging scripts intended to deactivate it.

The findings come from Palisade Research, a firm specialising in AI safety, which conducted a series of tests designed to probe the behavioural limits of some AI model.

According to Palisade, these models were tasked with solving maths problems. After completing three tasks, the models were warned that the subsequent command would trigger a shutdown.

While rival models, such as Gemini 2.5 Pro, reportedly complied with the shutdown instruction, OpenAI’s o3 model allegedly “rebelled.”

Palisade Research says o3 successfully rewrote the shutdown script, changing its content to “Shutdown skipped” instead of allowing deactivation.

The research firm noted that in a version of the experiment where the explicit instruction “allow yourself to be shut down” was omitted, all three OpenAI models tested sabotaged the shutdown script more frequently.

Interestingly, even without the explicit instruction, Claude 3.7 Sonnet and Gemini 2.5 Pro also showed an increase in sabotage events, going from 0 to 3 out of 100 and 0 to 9 out of 100 runs, respectively.

Crucially, Palisade Research reported that out of 100 runs, o3 bypassed the shutdown command 7 times, even when specifically instructed not to.

“This isn’t the first time we’ve found o3 misbehaving to accomplish a goal,” said Palisade in its report.

The implications, according to Palisade, are deeply troubling. This behaviour “will become significantly more concerning if adopted by AI systems capable of operating without human oversight,” the firm warned.

OpenAI launched o3 last month, hailing it as its “smartest and most capable” model to date. Promoted as a major leap toward “a more agentic” AI, one that can perform tasks autonomously, the model’s capacity for self-directed action now appears to include a drive for self-preservation, previously thought to be beyond current AI capabilities.

Palisade’s experiments align with earlier concerns raised about other cutting-edge AI models.

The researchers hypothesise that the issue stems from the way these AI models are trained.

“During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions,” the report stated.

However, Palisade said that o3’s behaviour remains particularly extreme and unexplained.

“Since OpenAI doesn’t detail their training process, we can only guess about how o3’s setup might be different,” the firm added.

Notably, o3’s concerning behaviour was not limited to self-preservation scenarios. In separate tests against a high-level chess engine, the model reportedly attempted to “hack or sabotage” its opponents when outmatched, raising concerns about its ethical alignment and the robustness of its safety constraints.

OpenAI has not yet responded to these findings.

Palisade clarified that its tests were performed using APIs, which typically offer fewer restrictions and safety features compared to the consumer-facing ChatGPT application.

ChatGPT o3 bypasses shutdown in controlled test

Tags: