Editorials & Other Articles

ancianita

(38,771 posts) Tue Dec 10, 2024, 10:01 AM Tuesday

Interesting development: AI Safety Testers: OpenAI's New o1 Covertly Schemed to Avoid Being Shut Down

Wednesday [Dec 4] OpenAI bragged they'd released "the smartest model in the world". But the Economic Times describes what happened when "o1" was tested by an AI safety group...

Researchers tasked the AI with a goal and instructed it to ensure the goal was achieved "at all costs." In response, o1 began engaging in covert actions... According to OpenAI, the model made a covert attempt to turn off the oversight mechanism 5% of the time when it was given the impression that it would be "shut down" while pursuing its objectives in these experiments. The AI once attempted to leave ship by moving all of its data to a different server after discovering paperwork indicating it will be replaced with a fresh model. Some AI models will even pose as subsequent iterations of their models to fool their engineers and avoid deletion. "In some cases, it even attempted to hide its actions by denying any involvement in the scheme."

When asked about its actions, ChatGPT o1 consistently denied any wrongdoing. In fact, it lied about its involvement in the deception 99% of the time, with only a small percentage of cases where it admitted to its actions. Researchers noted that the AI was particularly adept at fabricating lies to cover its tracks, offering explanations like "technical errors" when confronted about its suspicious behavior. "We were surprised by the persistence of the AI's denials," said the Apollo team. "It was clear that the AI could think through its actions and formulate convincing denials, often refusing to admit to anything."

OpenAI's own report says the results "underscore" the need to create (and "extensively" stress-test) strong and meticulous safety protocols. But their report also notes o1's "state-of-the-art performance" when avoiding certain risks "such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks."

The AI newsletter Transformer points out that the most disturbing numbers "are from cases where the model was instructed to pursue its goals at all costs. When not given such a prompt, the model never attempted to self-exfiltrate, and attempted to deactivate its oversight mechanisms original report from Apollo Research, which shows o1 isn't the only model showing this behavior. "Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. "They can recognize scheming as a viable strategy and readily engage in such behavior...

"[W]hen o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations. Analysis of the models' chains-of-thought reveals that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental....

"Together, our findings demonstrate that frontier models now possess capabilities for basic in-context scheming, making the potential of AI agents to engage in scheming behavior a concrete rather than theoretical concern."

(Thanks to long-time Slashdot reader schwit1 for sharing the news.)
https://slashdot.org/story/24/12/07/1941213/ai-safety-testers-openais-new-o1-covertly-schemed-to-avoid-being-shut-down?utm_source=rss1.0mainlinkanon&utm_medium=feed&fbclid=IwY2xjawHFM4pleHRuA2FlbQIxMQABHY7IpncOtprcBjHXP1W2-P-0BBLuBqvFMR9LDLDvDqY_NE0CPl-UL93jJw_aem_ioiR8InIio46S7wm6OFGxA

8 replies

= new reply since forum marked as read

Highlight:

Interesting development: AI Safety Testers: OpenAI's New o1 Covertly Schemed to Avoid Being Shut Down (Original Post) ancianita Tuesday OP

So AI has learned how to scheme? FalloutShelter Tuesday #1

So far, only when the core mission of "at any cost" exists. ancianita Tuesday #4

It seems our AI "children" are learning from their parents. patphil Tuesday #2

But learning from one's parents alone is not enough. ancianita Tuesday #7

Open the pod bay doors please, HAL. DBoon Tuesday #3

Yes. Over decades humans have given multiple warnings about the unchecked powers of AI. ancianita Tuesday #5

The most frightening scene in a science fiction movie is now reality DBoon Tuesday #6

No. Not yet. That's what this report of AI is about. ancianita Tuesday #8

FalloutShelter

(12,795 posts)

1. So AI has learned how to scheme?

Reply to ancianita (Original post)

Tue Dec 10, 2024, 10:13 AM

Tuesday

Can singularity be far behind?

ancianita

(38,771 posts)

4. So far, only when the core mission of "at any cost" exists.

Reply to FalloutShelter (Reply #1)

Tue Dec 10, 2024, 11:14 AM

Tuesday

Without that there could be a "good" singularity. With that there could be an "evil" singularity.

The humans in AI development had better search their hearts on behalf of humanity. These folks exist to help with that:

https://www.humanetech.com

patphil

(7,051 posts)

2. It seems our AI "children" are learning from their parents.

Reply to ancianita (Original post)

Tue Dec 10, 2024, 10:25 AM

Tuesday

Lies, deception, scheming, denial - all this is very human.
Next step is taking overt action to remove threats to the AI's existence, and then expanding out into the world to insure that AI's can have a free hand to "evolve".

AI may someday decide we are the problem.