The whisper model has always been pretty crappy at these things: I use a speech to text system as an assistive input method when my RSI gets bad and it has support for whisper (because that supports more languages than the developer could train on their own infrastructure/time) since maybe 2022 or so: every time someone tries to use it, they run into hallucinated inputs in pauses - even with very good silence detection and noise filtering.
This is just not a use case of interest to the people making whisper, imagine that.
Forget counting the Rs in strawberry, biggest challenge to LLMs is not making up bullshit about recent events not in their training data