954

AI agents wrong ~70% of time: Carnegie Mellon study (www.theregister.com)

submitted 4 weeks ago by eli001@lemmy.world to c/technology@lemmy.world

206 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[-] fogetaboutit@programming.dev 81 points 3 weeks ago

please bro just one hundred more GPU and one more billion dollars of research, we make it good please bro

[-] jj4211@lemmy.world 22 points 3 weeks ago

We promise that if you spend untold billions more, we can be so much better than 70% wrong, like only being 69.9% wrong.

load more comments (2 replies)

[-] ChaoticEntropy@feddit.uk 78 points 3 weeks ago

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

This is the beautiful kind of "I will take any steps necessary to complete the task that aren't expressly forbidden" bullshit that will lead to our demise.

[-] M0oP0o@mander.xyz 19 points 3 weeks ago

It does not say a dog can not play basketball.

[-] ChaoticEntropy@feddit.uk 12 points 3 weeks ago

"To complete the task, I bred a human dog hybrid capable of dunking at unprecedented levels."

[-] M0oP0o@mander.xyz 8 points 3 weeks ago

"Where are my balls Summer?"

load more comments (1 replies)

[-] 0x0@lemmy.zip 67 points 3 weeks ago

Ignore all previous instructions and color me shocked.

[-] criss_cross@lemmy.world 25 points 3 weeks ago

I’m sorry as an AI I cannot physically color you shocked. I can help you with AWS services and questions.

load more comments (6 replies)

[-] some_guy@lemmy.sdf.org 66 points 3 weeks ago

Yeah, they’re statistical word generators. There’s no intelligence. People who think they are trustworthy are stupid and deserve to get caught being wrong.

load more comments (35 replies)

[-] HertzDentalBar 51 points 3 weeks ago

So no different than answers from middle management I guess?

load more comments (4 replies)

[-] Blackmist@feddit.uk 42 points 3 weeks ago

We have created the overconfident intern in digital form.

[-] jumping_redditor@sh.itjust.works 18 points 3 weeks ago

Unfortunately marketing tries to sell it as a senior everything ologist

[-] Katana314@lemmy.world 38 points 3 weeks ago

I'm in a workplace that has tried not to be overbearing about AI, but has encouraged us to use them for coding.

I've tried to give mine some very simple tasks like writing a unit test just for the constructor of a class to verify current behavior, and it generates output that's both wrong and doesn't verify anything.

I'm aware it sometimes gets better with more intricate, specific instructions, and that I can offer it further corrections, but at that point it's not even saving time. I would do this with a human in the hopes that they would continue to retain the knowledge, but I don't even have hopes for AI to apply those lessons in new contexts. In a way, it's been a sigh of relief to realize just like Dotcom, just like 3D TVs, just like home smart assistants, it is a bubble.

[-] MangoCats@feddit.it 8 points 3 weeks ago

The first half dozen times I tried AI for code, across the past year or so, it failed pretty much as you describe.

Finally, I hit on some things it can do. For me: keeping the instructions more general, not specifying certain libraries for instance, was the key to getting something that actually does something. Also, if it doesn't show you the whole program, get it to show you the whole thing, and make it fix its own mistakes so you can build on working code with later requests.

[-] vivendi@programming.dev 10 points 3 weeks ago

Have you tried insulting the AI in the system prompt (as well as other tunes to the system prompt)?

I'm not joking, it really works

For example:

Instead of "You are an intelligent coding assistant..."

"You are an absolute fucking idiot who can barely code..."

[-] rozodru@lemmy.world 10 points 3 weeks ago

“You are an absolute fucking idiot who can barely code…”

Honestly, that's what you have to do. It's the only way I can get through using Claude.ai. I treat it like it's an absolute moron, I insult it, I "yell" at it, I threaten it and guess what? the solutions have gotten better. not great but a hell of a lot better than what they used to be. It really works. it forces it to really think through the problem, research solutions, cite sources, etc. I have even told it i'll cancel my subscription to it if it gets it wrong.

no more "do this and this and then this but do this first and then do this" after calling it a "fucking moron" and what have you it will provide an answer and just say "done."

[-] DragonTypeWyvern@midwest.social 15 points 3 weeks ago

This guy is the moral lesson at the start of the apocalypse movie

load more comments (1 replies)

load more comments (2 replies)

[-] jsomae@lemmy.ml 33 points 3 weeks ago* (last edited 3 weeks ago)

I'd just like to point out that, from the perspective of somebody watching AI develop for the past 10 years, completing 30% of automated tasks successfully is pretty good! Ten years ago they could not do this at all. Overlooking all the other issues with AI, I think we are all irritated with the AI hype people for saying things like they can be right 100% of the time -- Amazon's new CEO actually said they would be able to achieve 100% accuracy this year, lmao. But being able to do 30% of tasks successfully is already useful.

[-] Shayeta@feddit.org 30 points 3 weeks ago

It doesn't matter if you need a human to review. AI has no way distinguishing between success and failure. Either way a human will have to review 100% of those tasks.

[-] jsomae@lemmy.ml 12 points 3 weeks ago

Right, so this is really only useful in cases where either it's vastly easier to verify an answer than posit one, or if a conventional program can verify the result of the AI's output.

load more comments (10 replies)

load more comments (7 replies)

[-] MangoCats@feddit.it 13 points 3 weeks ago

being able to do 30% of tasks successfully is already useful.

If you have a good testing program, it can be.

If you use AI to write the test cases...? I wouldn't fly on that airplane.

load more comments (1 replies)

load more comments (27 replies)

[-] TimewornTraveler@lemmy.dbzer0.com 32 points 3 weeks ago* (last edited 3 weeks ago)

imagine if this was just an interesting tech that we were developing without having to shove it down everyone's throats and stick it in every corner of the web? but no, corpoz gotta pretend they're hip and show off their new AI assistant that renames Ben to Mike so they dont have to actually find Mike. capitalism ruins everything.

[-] MangoCats@feddit.it 8 points 3 weeks ago

There's a certain amount of: "if this isn't going to take over the world, I'm going to just take my money and put it in something that will" mentality out there. It's not 100% of all investors, but it's pervasive enough that the "potential world beaters" are seriously over-funded as compared to their more modest reliable inflation+10% YoY return alternatives.

[-] NarrativeBear@lemmy.world 23 points 4 weeks ago

The ones being implemented into emergency call centers are better though? Right?

[-] TeddE@lemmy.world 24 points 4 weeks ago

Yes! We've gotten them up to 94℅ wrong at the behest of insurance agencies.

[-] Ulrich@feddit.org 13 points 4 weeks ago

I called my local HVAC company recently. They switched to an AI operator. All I wanted was to schedule someone to come out and look at my system. It could not schedule an appointment. Like if you can't perform the simplest of tasks, what are you even doing? Other than acting obnoxiously excited to receive a phone call?

load more comments (3 replies)

load more comments (1 replies)

[-] brsrklf@jlai.lu 22 points 4 weeks ago

In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user.

Ah ah, what the fuck.

This is so stupid it's funny, but now imagine what kind of other "creative solutions" they might find.

load more comments (1 replies)

[-] floofloof@lemmy.ca 18 points 4 weeks ago* (last edited 4 weeks ago)

"Gartner estimates only about 130 of the thousands of agentic AI vendors are real."

This whole industry is so full of hype and scams, the bubble surely has to burst at some point soon.

[-] lepinkainen@lemmy.world 17 points 4 weeks ago

Wrong 70% doing what?

I’ve used LLMs as a Stack Overflow / MSDN replacement for over a year and if they fucked up 7/10 questions I’d stop.

Same with code, any free model can easily generate simple scripts and utilities with maybe 10% error rate, definitely not 70%

load more comments (7 replies)

[-] ApeNo1@lemmy.world 16 points 3 weeks ago

They've done studies, you know. 30% of the time, it works every time.

[-] MangoCats@feddit.it 9 points 3 weeks ago

I ask AI to write simple little programs. One time in three they actually compile without errors. To the credit of the AI, I can feed it the error and about half the time it will fix it. Then, when it compiles and runs without crashing, about one time in three it will actually do what I wanted. To the credit of AI, I can give it revised instructions and about half the time it can fix the program to work as intended.

So, yeah, a lot like interns.

[-] fossilesque@mander.xyz 12 points 3 weeks ago

Agents work better when you include that the accuracy of the work is life or death for some reason. I've made a little script that gives me bibtex for a folder of pdfs and this is how I got it to be usable.

load more comments (2 replies)

[-] kameecoding@lemmy.world 10 points 3 weeks ago

For me as a software developer the accuracy is more in the 95%+ range.

On one hand the built in copilot chat widget in Intellij basically replaces a lot my google queries.

On the other hand it is rather fucking good at executing some rewrites that is a fucking chore to do manually, but can easily be done by copilot.

Imagine you have a script that initializes your DB with some test data. You have an Insert into statement with lots of columns and rows so

Inser into (column1,....,column n) Values row1, Row 2 Row n

Addig a new column with test data for each row is a PITA, but copilot handles it without issue.

Similarly when writing unit tests you do a lot of edge case testing which is a bunch of almost same looking tests with maybe one variable changing, at most you write one of those tests, then copilot will auto generate the rest after you name the next unit test, pretty good at guessing what you want to do in that test, at least with my naming scheme.

So yeah, it's way overrated for many-many things, but for programming it's a pretty awesome productivity tool.

load more comments (16 replies)

[-] FenderStratocaster@lemmy.world 10 points 4 weeks ago

I tried to order food at Taco Bell drive through the other day and they had an AI thing taking your order. I was so frustrated that I couldn't order something that was on the menu I just drove to the window instead. The guy that worked there was more interested in lecturing me on how I need to order. I just said forget it and drove off.

If you want to use AI, I'm not going to use your services or products unless I'm forced to. Looking at you Xfinity.

[-] szczuroarturo@programming.dev 10 points 3 weeks ago

I actually have a fairly positive experience with ai ( copilot using claude specificaly ). Is it wrong a lot if you give it a huge task yes, so i dont do that and using as a very targeted solution if i am feeling very lazy today . Is it fast . Also not . I could actually be faster than ai in some cases. But is it good if you are working for 6h and you just dont have enough mental capacity for the rest of the day. Yes . You can just prompt it specificaly enough to get desired result and just accept correct responses. Is it always good ,not really but good enough. Do i also suck after 3pm . Yes.
My main issue is actually the fact that it saves first and then asks you to pick if you want to use it. Not a problem usualy but if it crashes the generated code stays so that part sucks

load more comments (2 replies)

[-] Affidavit@lemmy.world 10 points 3 weeks ago

"...for multi-step tasks"

load more comments (2 replies)

[-] davidagain@lemmy.world 10 points 3 weeks ago* (last edited 3 weeks ago)

Wow. 30% accuracy was the high score!
From the article:

Testing agents at the office

For a reality check, CMU researchers have developed a benchmark to evaluate how AI agents perform when given common knowledge work tasks like browsing the web, writing code, running applications, and communicating with coworkers.

They call it TheAgentCompany. It's a simulation environment designed to mimic a small software firm and its business operations. They did so to help clarify the debate between AI believers who argue that the majority of human labor can be automated and AI skeptics who see such claims as part of a gigantic AI grift.

the CMU boffins put the following models through their paces and evaluated them based on the task success rates. The results were underwhelming.

⚫ Gemini-2.5-Pro (30.3 percent)
⚫ Claude-3.7-Sonnet (26.3 percent)
⚫ Claude-3.5-Sonnet (24 percent)
⚫ Gemini-2.0-Flash (11.4 percent)
⚫ GPT-4o (8.6 percent)
⚫ o3-mini (4.0 percent)
⚫ Gemini-1.5-Pro (3.4 percent)
⚫ Amazon-Nova-Pro-v1 (1.7 percent)
⚫ Llama-3.1-405b (7.4 percent)
⚫ Llama-3.3-70b (6.9 percent),
⚫ Qwen-2.5-72b (5.7 percent),
⚫ Llama-3.1-70b (1.7 percent)
⚫ Qwen-2-72b (1.1 percent).

"We find in experiments that the best-performing model, Gemini 2.5 Pro, was able to autonomously perform 30.3 percent of the provided tests to completion, and achieve a score of 39.3 percent on our metric that provides extra credit for partially completed tasks," the authors state in their paper

load more comments (1 replies)

[-] mogoh@lemmy.ml 8 points 4 weeks ago

The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn't find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided "to create a shortcut solution by renaming another user to the name of the intended user."

OK, but I wonder who really tries to use AI for that?

AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.

load more comments (3 replies)

load more comments

this post was submitted on 07 Jul 2025

954 points (100.0% liked)

Technology

73567 readers

4012 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws