136
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 05 May 2026
136 points (100.0% liked)
Fight For Privacy
93 readers
201 users here now
Privacy is a fundamental human right, we have to fight for in data capitalism.
Everything about privacy, and means , from legal to obfuscation we can use to protect human way of life
founded 5 months ago
MODERATORS
If it is not immediately obvious to you how negligible the cost is going to be, you have no clue how little compute small models like this require. Apply a bit of common sense: this is a model designed to run locally on smartphones. If it used a lot of power, the phone would run out of battery.
It's hard (if not impossible) to find power usage figures for Gemini Nano, because they're going to depend on the efficiency of the device it's running on. If it's on a phone (where most Chrome installs are), that phone likely has an NPU, in which case the power draw will be negligible. If it has to run on the CPU, it'll be more.
So let's instead assume every user will be using a model comparable to ChatGPT, for which we do have reasonable estimates. According to this estimate, 500 output tokens would use about 0.3 Wh of energy. 500 output tokens is about 400 words, which is probably more than the average user will be using Gemini Nano for (it is intended for small tasks), but let's assume that as the average daily use. 1 billion users times 0.3 Wh is 300 MWh. Fuck all on a global scale, about 0.0015% of the world's energy production (20 TWh per day).
Keep in mind that figure is for the full ChatGPT, which runs on 1500-watt GPUs. Gemini Nano runs on chips that draw more like 1.5 W, and on devices that physically cannot draw more than 15 W. It's thus reasonable to estimate that it is on the order of 100x more efficient.
Their estimate of energy uses was only based on FLOPs, but I'd assume for real world energy usage the KV cache would be very impactful if not eventually dominant. It's probably also a bit unfair of them to ignore the Internet traffic and likely all the extra network traffic behind the load balancer.
Not a fan of their analysis, but I wonder if it's potentially close to accurate to this deployment? I can't imagine they're having large contexts and ballooning caches on a model meant for a phone.
They talk about this in the appendix where they go over the (estimated) effects of large amounts of input tokens (up to 100k). This isn't really relevant for Gemini Nano because it only has a max 32k context window, and the deployment in Chrome probably caps it at far less than that.
I'm inclined to believe the main analysis is reasonably accurate. The numbers are similar to what I get on my local machine with local models. Granted, I tested with smaller models (7b parameter Mistral in this case) on weaker hardware (AMD 6700XT), but on a quick test I get about 50 tok/s locally at 180 W power use, which is about 0.5 Wh for 500 tokens. AMD GPUs suck for AI, so I think it's plausible that dedicated compute hardware would get basically the same energy efficiency on a frontier model.
Gemini Nano on a phone NPU is obviously going to be far more efficient -- by all accounts it gets the same or better tok/s I am getting at like 1/50th the TDP.