13

Conducting deep web searches and gathering sources is one of the main things I've been using LLMs for. How far away are we from being able to self-host something like Claude's web search capabilities? Or even just a service where I'd pay with my money instead of my data?

you are viewing a single comment's thread
view the rest of the comments
[-] vapeloki@lemmy.world 7 points 2 days ago* (last edited 2 days ago)

Openwebui+searxng on a AMD strix board.

Pro: works like a charm, low power consumption, fast, "big" , LLM (running qwen3.6 35B A3B + gemma4 E4B for website summaries and other smaller tasks)

Con: strix boards start at 2k€, more in USA because of tarrifs

[-] avidamoeba@lemmy.ca 1 points 1 day ago* (last edited 1 day ago)

Curious why do you swap between Qwen and E4B. On my hardware they perform with similar tps. Qwen 3.6 35B spits out 80-100tps on AMD 9700 and E4B gives me about the same tps.

[-] vapeloki@lemmy.world 2 points 1 day ago* (last edited 1 day ago)

To avoid context switching on the GPU. OpenWebUi for example uses it for memory and title generation.

Those are not performance critical and background tasks, so instead of slowing down qwen, we just outsource this stuff to the NPU.

Edit: see here for more details

[-] avidamoeba@lemmy.ca 1 points 1 day ago

Oh I see. Okay this makes sense. I just throw Qwen 3.6 35B Q8 on 2 GPUs and use it for everything but coding agent.

[-] vapeloki@lemmy.world 4 points 2 days ago

For those who want to know more, rough setup:

  • llama-cpp rocmfp4 fork
  • currently custom quantized qwen3.6 35B A3B model, working on publishing
  • be3 embedding and reranker, also GPU
  • gemma4-e4b via FastFlowLM on NPU!
  • OpenWebUI and searxng as docker containers on a Pi currently

We get 70-100tok/s generation. Four slots with 256k context length each.

We use a smaller Board with "only" 64GB of shared LPDDR5X. Bottleneck is memory speed, rocmfp4 quants help a lot.

As soon as I get my imatrix calibration right, I will publish the quantized versions.

Most existing quantized models are broken. The authors did some not supported stuff (like using a already quantized model and requantize it) that you may get issues with coherence or sudden Chinese words in the output.

That is not an issue with rocmfp4 but with vibe coders and agent psychosis.

[-] TropicalDingdong@lemmy.world 5 points 2 days ago

Do you have a walk through for setup?

I'm on the strix halo 128 gb variant and while I got ollama working fine, i haven't gotten any of these multi headed setups working

[-] vapeloki@lemmy.world 5 points 2 days ago

I am on Gentoo for it, but everything with a decent rocm should work.

Have a look for llama-swap, that handles multi head endpoints.

Also, as you are on a big board, you can quantize yourself, as the BF16 version of qwen has only 72gb.

I will try and post a full writeup next days. But feel free to dm me, if you need some guidance on quantize or more.

I am using this fork currently: https://github.com/charlie12345/ROCmFPX

Stuff happens fast currently, so may be worth to wait a week or two ig you need something super stable, but if you are up for experimenting, that's the way to go

[-] TropicalDingdong@lemmy.world 3 points 2 days ago

THis is great, thanks. I'm on the z-13 and needed to use it for a work project, which is wrapping up soon. I'm planning on re-building it as a locally hosted agent support machine.

[-] Shimitar@downonthestreet.eu 2 points 2 days ago

Great man! Gentoo lover and long time addicted here.... Keep it the good work!

[-] ejs@piefed.social 2 points 2 days ago

Thank you so so much for pointing out ROCmFP4. I have been tinkering with my RDNA 3 framework on llama. I was struggling with ROCm llama.cpp and have been using vulcan in the meantime. I know there’s some issues on the llama.cpp github to try and fix my issue (UMA stuff), but haven’t come across this specific project. Gonna try it out

[-] catdog@lemmy.ml 2 points 2 days ago

Yup. And if you want to take a small step without major hardware requirements: connect your setup to a paid subscription Mistral or Anthropic API. They allow you to switch off training on your data.

On top of that, the costs are way lower than the normal consumer grade chat subscriptions, and your searches + memory are kept locally (e.g., managed through open webui).

[-] vapeloki@lemmy.world 1 points 2 days ago

Openrouter is also nice for this. You can use real cheap models for embedding and the bigger ones for the actual research.

[-] artyom@piefed.social 1 points 2 days ago

What's a "strix board"? Is that necessary?

[-] vapeloki@lemmy.world 3 points 2 days ago

AMD Strix is an APU, optimized for AI. It is the cheapest option I am aware of to run bigger models at home. 2k for 56GB VRAM, and less den 300W total power Budget.

One could run smaller models. But for the context sizes required for research work, that is nearly impossible.

Also, external services, like openrouter, can be used to use models hosted in the cloud.

But for self hosted, you need something that can run models with at least 15GB of VRAM + Context. For comparison. Our highly quantized model uses 20GB of vram. For our 4 slots we need another 20GB on top of it (around 5GB for 254k tokens), making it 40GB.

this post was submitted on 21 Jun 2026
13 points (100.0% liked)

Self Hosted - Self-hosting your services.

20016 readers
4 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules

Important

Cross-posting

If you see a rule-breaker please DM the mods!

founded 5 years ago
MODERATORS