112
submitted 2 weeks ago by KarnaSubarna@lemmy.ml to c/linux@lemmy.ml
you are viewing a single comment's thread
view the rest of the comments
[-] WalnutLum@lemmy.ml 57 points 1 week ago

The Blog Post from the researcher is a more interesting read.

Important points here about benchmarking:

o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

I'm not sure if a signal to noise ratio of 1:100 is uh... Great...

[-] drspod@lemmy.ml 24 points 1 week ago

If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.

[-] bunitor@lemmy.eco.br 8 points 1 week ago

this confirms what i just said in reply to a different comment: most cases of ai "success" are actually curated by real people from a sea of bullshit

[-] DarkDarkHouse@lemmy.sdf.org 2 points 1 week ago

And if Gutenberg had just written faster, he would've produced more books in the first week?

[-] WalnutLum@lemmy.ml 5 points 1 week ago

I'm not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.

[-] DarkDarkHouse@lemmy.sdf.org 1 points 1 week ago

I agree not brilliant, but It's early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.

[-] irotsoma 1 points 1 week ago

Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL

[-] sem 7 points 1 week ago

The models seem to be getting worse at this one task?

[-] PushButton@lemmy.world 3 points 1 week ago

It's only good for clickbait titles.

It brings clicks and it's spreading the falsehood that "AI" is good at something/getting better for the majority of people who stop at the title.

this post was submitted on 31 May 2025
112 points (100.0% liked)

Linux

55208 readers
513 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 6 years ago
MODERATORS