submitted 10 months ago* (last edited 10 months ago) by TheHobbyist@lemmy.zip to c/localllama@sh.itjust.works

7 comments fedilink hide all child comments

From Simon Willison: "Mistral tweet a link to a 281GB magnet BitTorrent of Mixtral 8x22B—their latest openly licensed model release, significantly larger than their previous best open model Mixtral 8x7B. I’ve not seen anyone get this running yet but it’s likely to perform extremely well, given how good the original Mixtral was."

you are viewing a single comment's thread
view the rest of the comments

[-] Audalin@lemmy.world 1 points 10 months ago* (last edited 10 months ago)

I thought MoEs had to be loaded entirely in the (V)RAM and the inference speedup was because you only need to use a fraction of layers to compute the next token (but the choice of layers can be different for each token, so you need them all ready; or keep moving data between the disk <-> RAM <-> VRAM and get reduced performance).

this post was submitted on 10 Apr 2024

29 points (100.0% liked)

LocalLLaMA

2514 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 2 years ago

MODERATORS

SkySyrup@sh.itjust.works

pax@sh.itjust.works

noneabove1182@sh.itjust.works