What is wrong with LLM benchmarks, and why are we still using them? - sh.itjust.works (sh.itjust.works)

submitted 2 years ago by micheal65536@lemmy.micheal65536.duckdns.org to c/fosai@lemmy.world

5 comments fedilink hide all child comments

top 3 comments

sorted by: hot top controversial new old

[-] j4k3@lemmy.world 2 points 2 years ago* (last edited 2 years ago)

First of all, have you replicated the actual white paper tests with identical methodology?

If you use a different setup, such as easy setups with popular webui's or check point models, or models with a different quantization method, you are going to see different performance, likely drastically different. Most of these models are meant to run at ~~float 64~~ where even a small model like a 7B will require enterprise class hardware to run.

The number of parameters is like a total vocabulary. Indeed a larger model has a bigger chance of having whatever specialization you are looking to implement. The smaller models will require fine tuning on any niche subject of interest.

Honestly, go play with Stable Diffusion and images for a while as an exercise in this area of how models work. Look at how textual inversion, loras, lyconis, and prompts work in detail. Try prompting without specialized fine tuning and with. Try some fine tuning of your own. Stable diffusion is much more accessible in this area. Go get several checkpoints with various sizes and styles. This will teach you a ton about what is possible with fine tuning and small checkpoints. It is far more accessible and the results are much more clear to see.

As far as models with technical accuracy. Out of the box, the WizardLM 30B GGML with the largest quantization size you can work with in system memory is likely your best option.

[-] Spott@lemmy.world 3 points 2 years ago

Just an fyi, llama is float 16 under the hood I believe. Stable diffusion is float32. Basically no machine learning model I’ve ever heard of is float64 based.. the only people using float64 on gpus is physicists/applied math people on doe supercomputers.

(Weirdly enough, commodity hardware is now on the opposite end: they used to struggle with float64 when people were trying to put physics models on commodity hardware, now they struggle with float16/float8 support when people are trying to put language models on commodity hardware.)