112

ChatGPT's o3 Model Found Remote Zeroday in Linux Kernel Code (linuxiac.com)

submitted 2 weeks ago by KarnaSubarna@lemmy.ml to c/linux@lemmy.ml

42 comments fedilink hide all child comments

top 42 comments

sorted by: hot top controversial new old

[-] WalnutLum@lemmy.ml 57 points 2 weeks ago

The Blog Post from the researcher is a more interesting read.

Important points here about benchmarking:

o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

I'm not sure if a signal to noise ratio of 1:100 is uh... Great...

[-] drspod@lemmy.ml 24 points 2 weeks ago

If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.

[-] bunitor@lemmy.eco.br 8 points 1 week ago

this confirms what i just said in reply to a different comment: most cases of ai "success" are actually curated by real people from a sea of bullshit

[-] DarkDarkHouse@lemmy.sdf.org 2 points 2 weeks ago

And if Gutenberg had just written faster, he would've produced more books in the first week?

[-] WalnutLum@lemmy.ml 5 points 1 week ago

I'm not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.

[-] DarkDarkHouse@lemmy.sdf.org 1 points 1 week ago

I agree not brilliant, but It's early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.

[-] irotsoma 1 points 1 week ago

Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL

[-] sem 7 points 2 weeks ago

The models seem to be getting worse at this one task?

[-] PushButton@lemmy.world 3 points 2 weeks ago

It's only good for clickbait titles.

It brings clicks and it's spreading the falsehood that "AI" is good at something/getting better for the majority of people who stop at the title.

[-] some_guy@lemmy.sdf.org 41 points 2 weeks ago

I'm skeptical of this. The primary maintainer of curl said that all of their AI bug submissions have been bunk and wasted their time. This seems like a lucky one-off rather than anything substantial.

[-] Evotech@lemmy.world 13 points 1 week ago

Of course, if you read the article you'll see that the model found the bugk 8 out of 100 attempts.

It was prompted what type of issue to look for.

[-] some_guy@lemmy.sdf.org 2 points 1 week ago

I meant one-off that it worked on this code base rather than how many times it found the issue. I don't expect it to work eight out of a hundred times on any and all projects.

[-] bunitor@lemmy.eco.br 13 points 1 week ago

this summarizes most cases of ai "success". people see generative ai generating good results once and then extrapolate that they're able to consistently generate good results, but the reality is that most of what it generates is bullshit and the cases of success are a minority of the "content" ai is generating, curated by actual people

[-] GnuLinuxDude@lemmy.ml 3 points 1 week ago

Curated by experts, specifically. Seeing a lot of people use this stuff and flop, even if they're not doing it with any intention to spam.

I think the curl project gets a lot of spam because 1) it has a bug bounty with a payout and 2) kinda fits with CVE bloat phenomenon where people want the prestige of "discovering" bugs so that they can put it on their resumes to get jobs, or whatever. As usual, the monetary incentive is the root of the evil.

[-] ctrl_alt_esc@lemmy.ml 29 points 2 weeks ago

This means absolutely nothing. It scanned a large amount of text and found something. Great, that's exactly what it's supposed to do. Doesn't mean it's smart or getting smarter.

[-] CsXGF8uzUAOh6fqV@lemmy.world 10 points 2 weeks ago

People often dismiss AI capabilities because "it's not really smart". Does that really matter? If it automates everything in the future and most people lose their jobs (just an example), who cares if it is "smart" or not? If it steals art and GPL code and turns a profit on it, who cares if it is not actually intelligent? It's about the impact AI has on the world, not semantics on what can be considered intelligence.

[-] nyan@sh.itjust.works 5 points 2 weeks ago

It matters, because it's a tool. That means it can be used correctly or incorrectly . . . and most people who don't understand a given tool end up using it incorrectly, and in doing so, damage themselves, the tool, and/or innocent bystanders.

True AI ("general artificial intelligence", if you prefer) would qualify as a person in its own right, rather than a tool, and therefore be able to take responsibility for its own actions. LLMs can't do that, so the responsibility for anything done by these types of model lies with either the person using it (or requiring its use) or whoever advertised the LLM as fit for some purpose. And that's VERY important, from a legal, cultural, and societal point of view.

[-] CsXGF8uzUAOh6fqV@lemmy.world 2 points 2 weeks ago

Ok, good point. It also matters if AI is true intelligence or not. What I meant was the comment I replied to said

This means absolutely nothing.

Like if it is not true AI nothing it does matters? The effects of the tool, even if not true AI, matters a lot.

[-] bunitor@lemmy.eco.br 3 points 1 week ago

i feel like people are misunderstanding your point. yes, generative ai is bullshit, but it doesn't need to be good in order to replace workers

[-] ctrl_alt_esc@lemmy.ml 2 points 1 week ago

I don't know if you read the article, but in there it says AI is becoming smarter. My comment was a response to that.

Irrespective of that, you raise an interesting point "it's about the impact AI has on the world". I'd argue it's real impact is quite limited (mind you I'm referring to generative AI and specifically LLMs rather than AI generally), it has a few useful applucations, but the emphasis here is on few. However, it's being pushed by all the big tech companies and those lobbying for them as the next big thing. That's what's really leading to the "impact" you're perceiving.

[-] atzanteol@sh.itjust.works 6 points 2 weeks ago

It scanned a large amount of text and found something.

How hilariously reductionist.

AI did what it's supposed to do. And it found a difficult to spot security bug.

"No big deal" though.

[-] Luffy879@lemmy.ml 24 points 1 week ago

TL;DR: The pentester already found it himself, and wanted to test how offen GPT finds it if he pasts that code into it

[-] 8uurg@lemmy.world 7 points 1 week ago

Not quite, though. In the blogpost the pentester notes that it found a similar issue (that he overlooked) that occurred elsewhere, in the logoff handler, which the pentester noted and verified when spitting through a number of the reports it generated. Additionally, the pentester noted that the fix it supplied accounted for (and documented) a issue that it accounted for, that his own suggested fix for the issue was (still) susceptible to. This shows that it could be(come) a new tool that allows us to identify issues that are not found with techniques like fuzzing and can even be overlooked by a pentester actively searching for them, never mind a kernel programmer.

Now, these models generate a ton of false positives, which make the signal-to-noise ratio still much higher than what would be preferred. But the fact that a language model can locate and identify these issues at all, even if sporadically, is already orders of magnitude more than what I would have expected initially. I would have expected it to only hallucinate issues, not finding anything that is remotely like an actual security issue. Much like the spam the curl project is experiencing.

[-] Luffy879@lemmy.ml 8 points 1 week ago

Yes, but:

To get to this point, OpenAI had to suck up almost all data ever generated in the world. So in order for it to become better, lets say it has to have 3 times as much data. That alone would take more than 3 Lifetimes to get the data alone, IF we don´t consider the AI slop and assume that all data is still Human made, which is just not true.

In other words: What you describe will just about never happen anymore, at least as long as 2025 will still be remembered

[-] 8uurg@lemmy.world 5 points 1 week ago

Yes, true, but that is assuming:

Any potential future improvement solely comes from ingesting more useful data.
That the amount of data produced is not ever increasing (even excluding AI slop).
No (new) techniques that makes it more efficient in terms of data required to train are published or engineered.
No (new) techniques that improve reliability are used, e.g. by specializing it for code auditing specifically.

What the author of the blogpost has shown is that it can find useful issues even now. If you apply this to a codebase, have a human categorize issues by real / fake, and train the thing to make it more likely to generate real issues and less likely to generate false positives, it could still be improved specifically for this application. That does not require nearly as much data as general improvements.

While I agree that improvements are not a given, I wouldn't assume that it could never happen anymore. Despite these companies effectively exhausting all of the text on the internet, currently improvements are still being made left-right-and-center. If the many billions they are spending improve these models such that we have a fancy new tool for ensuring our software is more safe and secure: great! If it ends up being an endless money pit, and nothing ever comes from it, oh well. I'll just wait-and-see which of the two will be the case.

[-] balsoft@lemmy.ml 18 points 1 week ago

I'm surprised it took this long. The world is crazy over AI, meaning everyone and their grandma is likely trying to do something like this right now. The fact it took like 3 years for an actual vulnerability "discovered by AI" (actually it seems it was discovered by the researcher filtering out hundreds of false positives?) tells me it sucks ass at this particular task (it also seems to be getting worse, judging by the benchmarks?)

[-] DonutsRMeh@lemmy.world 1 points 1 week ago

All ai is is a super fast web search with algorithms for some reasoning. It's not black magic.

[-] balsoft@lemmy.ml 10 points 1 week ago* (last edited 1 week ago)

No, it's not. It's a word predictor trained on most of the web. On its own it's a pretty bad search engine because it can't reliably produce the training data (that would be overfitting). What it's kind of good at is predicting what the result would look like if someone asked a somewhat novel question. But then it's not that good at producing the actual answer to that question, only imitating what the answer would look like.

[-] HowdWeGetHereAnyways@lemmy.world 6 points 1 week ago

That's why we really shouldn't call them "AI" imo

[-] DonutsRMeh@lemmy.world 1 points 1 week ago

100%. It's a super fast web crawler. These are buzz words capitalists throw around to make some more money. I don't know if you've heard of the bullshit that anthropic was throwing around about claude threatening to "blackmail" employees if they took it offline. Lmao.

[-] melmi 7 points 1 week ago

Calling it a web crawler is just innacurate. You can give it access to a web search engine, which is how the "AI search engines" work, but LLMs can't access the internet on their own. They're completely self-contained unless you give them tools that let them do other things.

[-] data1701d@startrek.website 1 points 1 week ago* (last edited 1 week ago)

I would agree calling it a web crawler is inaccurate, but disagree with the reasoning; I think it's more in the sense that calling an LLM a web crawler is akin to calling a search index a web crawler; in other words, an LLM could be considered a weird version of a search index.

[-] melmi 2 points 1 week ago

Yeah, I can see that. It's definitely more like a search index than a web crawler. It's not great at being a search index though, since it can synthesize ideas but can't reliably tell you where it got them from in the first place.

[-] data1701d@startrek.website 1 points 1 week ago

Firmly agree with you on that.

[-] PixelatedSaturn@lemmy.world 10 points 2 weeks ago

I don't get it, I use o3 a lot and I couldn't get it to even make a simple developed plan.

I haven't used it for coding, but other stuff I often get better results with o4.

I don't get what they call reasoning with it.

[-] WalnutLum@lemmy.ml 8 points 2 weeks ago

This would feel a lot less gross if this had been with an open model like deepseek-r1.

[-] wuphysics87@lemmy.ml 2 points 2 weeks ago

Why?

[-] Aradia@lemmy.ml 8 points 1 week ago

literaly says "o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it." on the original author...

[-] biofaust@lemmy.world 4 points 1 week ago

I have read the threads up to now and, despite being ignorant about security research, I would call myself convinced of the usefulness of such a tool in the near-future to shave off time in the tasks required for this kind of work.

My problem with this is that transformer-based LLMs still don't sound to me like the good tool for the job when it comes to such formal languages. It is surely a very expensive way to do this job.

Other architectures are getting much less attention because of this the focus of investors on this shiny toy. From my understanding, neurosymbolic AI would do a much better and potentially faster job at a task involving stable concepts.

[-] utopiah@lemmy.ml 4 points 1 week ago* (last edited 1 week ago)

Looks like another of those "Asked AI to find X. AI does find X as requested. Claims that the AI autonomously found X."

I mean... the program literally does what has been asked and its dataset includes examples related to the request.

Shocked Pikachu face? Really?

[-] Revan343@lemmy.ca 5 points 1 week ago* (last edited 1 week ago)

The shock is that it was successful in finding a vulnerability non already known to the researcher, at a time when LLMs aren't exactly known for reliability

[-] utopiah@lemmy.ml 1 points 1 week ago* (last edited 1 week ago)

Maybe I misunderstood but the vulnerability was unknown to them but the class of vulnerability, let's say "bugs like that", are well known and published by the security community, aren't there?

My point being that if it's previously unknown and reproducible (not just "luck") is major, if it's well known in other projects, even though unknown to this specific user, then it's unsurprising.

Edit: I'm not a security researcher but I believe there are already a lot of tools doing static and dynamic analysis. IMHO It'd be helpful to know how those perform already versus LLMs used here, namely across which dimensions (reliability, speed, coverage e.g. exotic programming languages, accuracy of reporting e.g. hallucinations, computation complexity and thus energy costs, openness, etc) is each solution better or worst than the other. I'm always wary of "ex nihilo" demonstrations. Apologies if there is benchmark against existing tools and if I missed that.

this post was submitted on 31 May 2025

112 points (100.0% liked)

Linux

55244 readers

513 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
No misinformation
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 6 years ago

MODERATORS

AgreeableLandscape@lemmy.ml

nooter692@lemmy.ml

MarcellusDrum@lemmy.ml

cypherpunks@lemmy.ml

cyclohexane@lemmy.ml

d3Xt3r@lemmy.nz