10

cross-posted from: https://lemmy.world/post/19242887

I can run the full 131K context with a 3.75bpw quantization, and still a very long one at 4bpw. And it should barely be fine-tunable in unsloth as well.

It's pretty much perfect! Unlike the last iteration, they're using very aggressive GQA, which makes the context small, and it feels really smart at long context stuff like storytelling, RAG, document analysis and things like that (whereas Gemma 27B and Mistral Code 22B are probably better suited to short chats/code).

you are viewing a single comment's thread
view the rest of the comments
[-] brucethemoose@lemmy.world 2 points 1 day ago

Arcee 14B is probably the best "small" model around now. You can squeeze it in a small GPU with the right settings.

[-] Smorty 1 points 1 day ago* (last edited 1 day ago)

could you define "right settings"?

I assuma Q4 and some context window Q8 aswell. Anything lese to tweak?
I just have a smol gtx1060 6gb VRAM, so i probably can't fit it on mine and imma have to use cpu partly. but maybe other readers here can!

(I'm just a silly ollama user, not knowing anything more complex than the tokenizer... so yea, maybe put a lil infodump in here to make us all smarter please <3 )

EDIT: brucethemoose probably refered to this model named "Medius". there is no 14B in the name.

[-] brucethemoose@lemmy.world 2 points 1 day ago* (last edited 1 day ago)

Ah, yeah. Normally I would tell people to use an exl2 instead of a GGUF and squeeze everything in using its Q6\Q4 cache, but 6GB really is too tight for a 14B model, and I don’t think exl2 works on 1000 series cards. You can use an IQ4 8B GGUF for better performance though.

That is indeed the model I was referring to, sorry for not being more specific.

I am a huge local, open model advocate… but as a former 2060 6GB users, 6GB is really the point where you should use free APIs like Gemini, Groq, Kobold Horde, Samba, Qwen\Deepseek web UIs, or whatever meets your needs.

I kinda feel weird talking about LLMs on Lemmy in general, which feels very broadly anti AI.

[-] Smorty 1 points 1 day ago

i totally agree.. with everything. 6GB really is smol and, cuz imma crazy person, i currently try and optimize everything for llama3.2 3B Q4 model so people with even less GB VRAM can use it. i really like the idea of people just having some smollm laying around on their pc and devs being able to use it.

i really should probably opt for APIs, you're right. the only API I ever used was Cohere, cuz yea their CR+ model is real nice. but i still wanna use smol models for a smol price if any. imma have a look at the APIs you listed. Never heard of Kobold Horde and Samba so i'll have a look at those... or i go for the lazy route and chose depseek cuz it's apparently unreasonably cheap for SOTA perf. so eh..

also yes! Lemmy really does seem anti AI, and i'm fine with that. i just say yeah companies use it in obviously dum ways but the tech is super interesting which is a reasonable argument i think.

so yes, local llm go! i wanna get that new top amd gpu once that gets announced. so i'll be able to run those spicy 32B models. for now i'll just stick with 8B and 3B cuz they work quick and kinda do what i want.

[-] brucethemoose@lemmy.world 1 points 19 hours ago

Oh yeah, I was thinking of free APIs. If you are looking for paid APIs, Deepseek and Cohere are of course great. Gemini Pro is really good too, and free for 50 requests a day. Cerebras API is insanely fast, like way above anything else. Check out Openrouter too, they host tons of models.

this post was submitted on 31 Aug 2024
10 points (100.0% liked)

AI

4240 readers
2 users here now

Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals, which involves consciousness and emotionality. The distinction between the former and the latter categories is often revealed by the acronym chosen.

founded 3 years ago