477

AI Models from Google, OpenAI, Anthropic Solve 0% of ‘Hard’ Coding Problems (analyticsindiamag.com)

submitted 1 month ago by cm0002@lemmy.world to c/programming@programming.dev

49 comments fedilink hide all child comments

top 49 comments

sorted by: hot top controversial new old

[-] Endmaker@ani.social 57 points 1 month ago* (last edited 1 month ago)

In the ‘Medium’ difficulty category, OpenAI’s o4-mini-high model scored the highest at 53.5%.

This fits my observation of such models. o4-mini-high is able to help me with 80-90% of the problems at work. For the remaining problems, it would come up with a nonsensical solution and no matter how much I prompt it, it would tunnel-vision on that specific approach. It could never second guess itself and realise that its initial solution is completely off the mark, and try an entirely differently approach. That's where I usually step in and do the work myself.

It still saves me time with the trivial stuff though.

I can't say the same for the rest of the LLMs. They are simply no good at coding and just waste my time.

[-] yogsototh@programming.dev 15 points 1 month ago

I didn’t see Claude 4 Sonnet in the tests and this is the one I use. And it looks like about the same category as o4 mini from my experience.

It is a nice tool to have in my belt. But these LLM based agents are still very far from being able to do advanced and hard tasks. But to me it is probably more important to communicate and learn about the limitations about these tools to not lose tile instead of gaining it.

In fact, I am not even sure they are good enough to be used to really generate production-ready code. But they are nice for pre-reviewing, building simple scripts that don’t need to be highly reliable, analyse a project, ask specific questions etc… The game changer for me was to use Clojure-MCP. Having a REPL at disposal really enhance the quality of most answers.

[-] Ugurcan@lemmy.world 2 points 1 month ago

For me, it’s the Claude Code where everything finally clicked. For advanced stuff, sure they’re shit when they left alone. But as long as I approach it as a Junior Developer (breaking down the tasks to easy bites, having a clear plan all the time, steering away from pitfalls), I find myself enjoying other stuff while it’s doing the monkey work. Just be sure you provide it with tools, mcp, rag and some patience.

[-] technocrit@lemmy.dbzer0.com 5 points 1 month ago

Search engines are able to help me with 100% of work.

[-] rikudou@lemmings.world 9 points 1 month ago

I remember those times, too (well, some 99.9%, there are still the few issues I never found solution to).

But these times are long past, search engines suck nowadays.

[-] nieceandtows@programming.dev 3 points 1 month ago

Not anymore. They've all made deals with each other, and search engines SUCK these days

[-] cupcakezealot@piefed.blahaj.zone 49 points 1 month ago

ai is basically just the worst answer on stackexchange

[-] gens@programming.dev 23 points 1 month ago

It's a rubber ducky that talks back. If you don't take it seriously, it can reach the level of usefulness just above a wheezing piece of yellow rubber.

[-] Saledovil@sh.itjust.works 3 points 1 month ago

They aren't as cute as actual rubber ducks, though.

[-] nieceandtows@programming.dev 2 points 1 month ago

Actual rubber ducks don't randomly spew bullshit either

[-] daniskarma@lemmy.dbzer0.com 3 points 1 month ago

The bullshit is good it triggers the Cunningham's Law in my brain.

Sometimes it's easier to come up with a solution correcting something blatantly wrong than doing it from scratch.

[-] merc@sh.itjust.works 4 points 1 month ago

It's literally the most common answer on stackexchange.

[-] tunetardis@piefed.ca 46 points 1 month ago

For instance, if an AI model could complete a one-hour task with 50% success, it only had a 25% chance of successfully completing a two-hour task. This indicates that for 99% reliability, task duration must be reduced by a factor of 70.

This is interesting. I have noticed this myself. Generally, when an LLM boosts productivity, it shoots back a solution very quickly, and after a quick sanity check, I can accept it and move on. When it has trouble, that's something of a red flag. You might get there eventually by probing it more and more, but there is good reason for pessimism if it's taking too long.

In the worst case scenario where you ask it a coding problem for which there is no solution—it's just not possible to do what you're asking—it may nevertheless engage you indefinitely until you eventually realize it's running you around in circles. I've wasted a whole afternoon with that nonsense.

Anyway, I worry that companies are no longer hiring junior devs. Today's juniors are tomorrow's elites and there is going to be a talent gap in a decade that LLMs—in their current state at least—seem unlikely to fill.

[-] Modern_medicine_isnt@lemmy.world 11 points 1 month ago

Sadly, the lack of junior devs means my job is probably safe until I am ready to retire. I have mixed feelings about that. On the one hand, yeah for me. On the other sad for the new grads. And sad for software as a whole. But software truely sucks, and has only been enshitifying worse and worse. Could a shake up like this somehow help that? I don't see how, but who knows.

[-] beejjorgensen@lemmy.sdf.org 6 points 1 month ago

Sucks for today's juniors, but that gap will bring them back into the fold with higher salaries eventually.

[-] Schal330@lemmy.world 5 points 1 month ago

In the worst case scenario where you ask it a coding problem for which there is no solution—it's just not possible to do what you're asking—it may nevertheless engage you indefinitely until you eventually realize it's running you around in circles.

Exactly this, and it's frustrating as a Jr dev to be fed this bs when you're learning. I've had multiple scenarios where it blatantly told me wrong things. Like using string interpolation in a terraform file to try and set a dynamic source - what it was giving me looked totally viable. It wasn't until I dug around some more that I found out that terraform init can't use variables in the source field.

On the positive side it helps give me some direction when I don't know where to start. I use it with a highly pessimistic and cautious approach. I understand that today is the worst it's going to be, and that I will be required to use it as a tool in my job going forward, so I'm making an effort to get to grips when working with it.

[-] FizzyOrange@programming.dev 42 points 1 month ago

I don't think that's a surprise to anyone that has actually used them for more than a few seconds.

[-] atzanteol@sh.itjust.works 31 points 1 month ago

The claims that AI will be surpassing humans in programming are pretty ridiculous. But let's be honest - most programming is rather mundane.

[-] wetbeardhairs@lemmy.dbzer0.com 6 points 1 month ago

Well, this kind of AI won't ever be useful as a programmer. It doesn't think. It doesn't reason. It cannot make decisions besides using a ton of computational power and enormous deep neural networks to shit out a series of words that seem like they should follow your prompt. An LLM is just a really, really good next-word guesser.

So when you ask it to solve the Tower of Hanoi problem, great it can do that. Because it saw someone else's answer. But if you ask it to solve it for a tower than is 20 disks high it will fail because no one ever talks about going that far and it flounders. It's not actually reasoning to solve the problem - it's regurgitating answers it has ingested from stolen internet conversations. It's not even attempting to solve the general case because it's not trying to solve the problem, it's responding to your prompt.

That said - an LLM is also great as an interface to allow natural language and code as prompts for other tools. This is where the actually productive advancements will be made. Those tools are garbage today but they'll certainly improve.

[-] atzanteol@sh.itjust.works 3 points 1 month ago

Well, this kind of AI won’t ever be useful as a programmer

It already is.

[-] childOfMagenta@jlai.lu 7 points 1 month ago

You mean useful to a programmer, or as useful as a programmer?

[-] atzanteol@sh.itjust.works 4 points 1 month ago

Ah - yeah I read that wrong. It's useful to a programmer.

[-] wetbeardhairs@lemmy.dbzer0.com 2 points 1 month ago

I explicitly meant "as". It's great as autocomplete. Not as an agent to complete programming tasks.

[-] atzanteol@sh.itjust.works 2 points 1 month ago

It’s great as autocomplete.

I love the weird need to downplay just how good AIs are by calling them "autocomplete".

[-] wetbeardhairs@lemmy.dbzer0.com 1 points 1 month ago

Did you even read my earlier comment?

[-] childOfMagenta@jlai.lu 1 points 1 month ago

Thanks for clarifying.

[-] Ledivin@lemmy.world 5 points 1 month ago

My productivity has at least tripled since I started using Cursor. People are actually underestimating the effects that AI will have in the industry

[-] PushButton@lemmy.world 16 points 1 month ago

It means the AI is very helpful to you. This also means you are as good as 1/3 of an AI in coding skills...

Which is not a great news for you mate.

[-] atzanteol@sh.itjust.works 10 points 1 month ago

Ah knock it off. Jesus you sound like people in the '90s mocking "intellisense" in the IDE as somehow making programmers "less real programmers".

It's all needless gatekeeping and purity test BS. Use tools that are useful. Don't worry if it makes you less of a man.

[-] Feyd@programming.dev 3 points 1 month ago

It's not gate keeping it is true. I know devs that say ai tools are useful but all the ones that say it makes them multiples more productive are actually doing negative work because I have to deal with their terrible code they don't even understand.

[-] atzanteol@sh.itjust.works 3 points 1 month ago

The devs I know use it as a tool and check their work and fully understand the code they've produced.

So your experience vs. mine. I suspect you just work with shitty developers who would be producing shitty work whether they were using AI or not.

[-] Ledivin@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

I literally don't write code anymore, I write detailed specs, invest a lot of time into my guardrails and integrations, and review changes from my agents. My code quality has not fallen, in fact we've been able to be much more strict about our style guidelines.

My job has changed completely, but the results are the same - simply much, much faster. And to be clear, this is in code bases that are hundreds of thousands of lines deep, across multiple massive monorepos, and using context from several different documentation sites - both internal and external.

If anything, people are understating the effects this will have over the next year, let alone further. The entry-level IC dev is dead. If you aren't producing at least twice as fast as you used to, you're going to be left behind. I cannot possibly suggest strongly enough that you start learning how to use it.

[-] Feyd@programming.dev 1 points 1 month ago

Sure, Jan

[-] technocrit@lemmy.dbzer0.com 8 points 1 month ago

People are actually underestimating the effects that ~~AI~~ autocomplete will have in the industry

[-] rikudou@lemmings.world 2 points 1 month ago

True, I use some local model by Jetbrains that only completes a single line and that's my sweet spot, it usually guesses the line well and saves me some time without forcing me to read multiple lines of code I didn't write.

[-] AlecSadler 3 points 1 month ago

Tripled is an understatement for me. Cursor and Claude Code are a godsend for OE for me.

[-] daniskarma@lemmy.dbzer0.com 23 points 1 month ago

They have their uses. For instance the other day I needed to read some assembly and decompiled C, you know how fun that can be. LLM proved quite good at translating it to english. And really speed up the process.

Writing it back wasn't that good though, just good enough to point in a direction but I still ended up writing the patcher mostly by myself.

[-] technocrit@lemmy.dbzer0.com 4 points 1 month ago

Ok, but there's no "AI" involved in this process.

[-] Modern_medicine_isnt@lemmy.world 11 points 1 month ago

Fortunately, 90% of coding is not hard problems. We write the same crap over and over. How many different creat an account and signin flows do we really need. Yet there seem to be an infinite amount, and each with it's own bugs.

[-] xthexder@l.sw0.com 15 points 1 month ago* (last edited 1 month ago)

The hard problems are the only reason I like programming. If 90% of my job was repetitive boilerplate, I'd probably be looking elsewhere.

I really dislike how LLMs are flooding the internet with a seemingly infinite amount of half-broken TODO-app style programs with no care at all for improving things or doing something actually unique.

[-] Modern_medicine_isnt@lemmy.world 3 points 1 month ago

A lot of people don't realize how many times the problem they are solving has already been solved. But after being in the industry for 3 decades, very few things people are working on haven't been done before. They just get put together in different combinations.

As for AI, I have found it decent at wruting one time scripts to gather information I need to make design decisions. And it's a little quicker when I need to look up a syntax for a language or like a resource name for terraform. But even one off scripts I sometimes have to ask it if a while loop wouldn't be better and such.

[-] danzabia@infosec.pub 9 points 1 month ago

Funny how I never see articles on Lemmy about improvements in LLM capabilities.

[-] nullagon@ani.social 15 points 1 month ago

i would guess a lot of the pro ai stuff is from corpos given the fact good press is money to them.

[-] rayquetzalcoatl@lemmy.world 8 points 1 month ago

Probably because nobody really wants to read absolute nonsense.

[-] funkless_eck@sh.itjust.works 8 points 1 month ago

there aren't that many, if you're talking specifically LLMs, but ML+AI is more than LLMs.

Not a defence or indictment of either side, just people tend to confuse the terms "LLM" and "AI"

I think there could be worth in AI for identification (what insect in this, find the photo I took of the receipt for my train ticket last month, order these chemicals from lowest to highest pH...) - but LLMs are only part of that stack - the input and output - which isn't going to make many massive breakthroughs week to week.

[-] Glitchvid@lemmy.world 2 points 1 month ago

The recent boom in neural net research will have real applicable results that are genuine progress: signal processing (e.g. noise removal), optical character recognition, transcription, and more.

However the biggest hype areas with what I see as the smallest real return is in the huge model LLM space, which basically try to portray AGI as just around the corner. LLMs will have real applications in summarization, but largely otherwise they just generate asymptotically plausible babble, very good for filling the Internet with slop, not actually useful to replace all the positions OAI, et al, need it to (for their funding to be justified).

[-] Stubb@lemmy.sdf.org 9 points 1 month ago

I've found that AI is only good at solving programming problems that are relatively "small picture" — or if it has to do with the basics of a language — anything else that it provides a solution for you will have to re-write completely once you consult with the language's standards and best practices.

[-] rikudou@lemmings.world 4 points 1 month ago

Well, I recently did kind of an experiment, writing a kid game in Kotlin without ever using it. And it was surprisingly easy to do. I guess it helps that I'm fluent in ~5 other programming languages because I could tell what looked obviously wrong.

My conclusion kinda is that it's a really great help if you know programming in general.

[-] Shanmugha@lemmy.world 8 points 1 month ago

Come on, guys, any second now. Aany second...

this post was submitted on 22 Jun 2025

477 points (100.0% liked)

Programming

22057 readers

215 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

UlrikHD@programming.dev