Running AI on your computer instead of in the cloud — OnDevice

When a company says their product uses AI, the AI is almost always running somewhere else. Your words travel to a server, the server runs a language model, and the result comes back. This is fast, convenient, and completely normal. It is also the reason your conversation with a chatbot is stored somewhere you do not control.

Local AI changes the direction of that flow. Instead of your words going out, the model comes in. The language model file downloads to your browser once, and from then on every prompt you type is processed inside your own tab, on your own hardware. Nothing leaves your device. The server is out of the loop entirely.

This sounds like a clean win, and in some ways it is. But local AI is not a free upgrade. It comes with real tradeoffs, and understanding them helps you decide when it matters and when it does not.

What actually changes when AI runs locally

The most obvious change is privacy. A cloud AI service is a third party that receives your input text. That text gets processed according to their terms of service, which may include logging requests for quality improvement, using conversations as training data, or sharing information with partners. Local AI removes the third party entirely. The model runs in your browser. Your text is never sent.

The less obvious change is that you give up the scale advantage of a server. A company running AI on their own hardware can use models with hundreds of billions of parameters and update them constantly. Your browser runs a smaller model — typically in the range of one to three billion parameters — because that is what fits in browser memory and runs at a reasonable speed on consumer hardware.

For everyday tasks like summarizing a document, generating a cover letter, or drafting a short creative piece, the smaller model is genuinely useful. For tasks that require deep factual accuracy, complex multi-step reasoning, or specialized domain knowledge, the difference in model size shows.

Speed: not what you might expect

People assume local AI is slower than cloud AI. The reality is more complicated. A cloud AI response requires a round trip over the internet: your request travels to the server, the server processes it, the response travels back. On a slow connection or a busy server, that round trip adds real time. A local model avoids all of that. Every token is generated by hardware that is physically next to you.

On a modern laptop with a dedicated GPU or Apple Silicon, local generation can reach thirty to fifty tokens per second. That is fast enough to feel instant for short outputs. On a five-year-old machine without GPU acceleration, generation might drop to five tokens per second or fewer, which is slow enough to feel like you are waiting.

The first visit is slower than subsequent ones. The model file — typically several hundred megabytes — needs to download once. After that, the browser caches it. From the second visit onward, the model loads from your local storage, not the network, and on most devices this takes only a second or two.

What the model download actually is

The model file that downloads to your browser is a compressed set of numerical weights that encode the language model. It is not software in the traditional sense — you cannot run it as a standalone program, and it does not have access to your files or your operating system. It is a large array of numbers that a runtime built into the browser uses to predict the next word in a sequence.

Once cached, it behaves like any other browser-cached asset. If you clear your browser cache, it will download again on the next visit. It is stored in your browser origin storage, not in a file on your desktop, so you will not see it in your file manager.

Accuracy: where smaller models show their limits

A local model running in your browser will make mistakes that a larger cloud model would not. It will occasionally confuse facts, produce plausible-sounding but incorrect details, or generate outputs that are technically coherent but miss the point of a subtle prompt. These are not bugs in the implementation. They are fundamental properties of smaller models.

For the tasks these tools handle — summarizing text you provide, rewriting a paragraph, drafting a cover letter based on information you enter, generating a recipe from ingredients you list — the model is working with context you give it, not facts it has to recall from training. This narrows the accuracy gap considerably. The model does not need to recall a historical date; it needs to take your input and restructure it clearly.

Where the gap opens up is tasks that require recall of specific information: detailed legal analysis, medical accuracy, current events, or highly specialized technical domains. If accuracy on those topics matters, a cloud model with more parameters is the honest choice.

Offline operation: what it looks like in practice

Once the model is cached, these tools work without an internet connection. You can open the tool on a plane, in a basement with no signal, or on a metered connection where you want to avoid extra data usage. The tool itself is a web page, so you need to have visited it at least once while connected. After that, the page and the model both come from cache.

This is a genuine use case for a specific kind of person: someone traveling with sensitive documents, working in a location with unreliable connectivity, or wary of transmitting private text over any network. For most casual users, offline operation is a convenience feature rather than a primary reason to choose local AI.

Hardware requirements: what devices work

Local AI in the browser runs on most modern devices. The rough threshold is any laptop or desktop made in the past four or five years with at least four gigabytes of RAM. Phones and tablets can also run local AI, though generation speed on mobile hardware is significantly lower than on a laptop.

GPU acceleration, when available, speeds up generation considerably. On Apple Silicon Macs, the Neural Engine provides hardware acceleration automatically. On Windows machines with a discrete Nvidia or AMD GPU, the browser can route computation through the GPU as well. On older machines or those without GPU support, the model falls back to processing on the main processor, which is slower but functional.

Unsupported devices will see a message explaining that the tool requires a browser with local AI capabilities. This is most commonly older phones, low-memory devices, or browsers that have not yet implemented the necessary runtime support.

When to use local AI, and when not to

Local AI is the right choice when what you are processing is private and what you need is useful, not perfect. A first draft of a cover letter, a summary of a contract you cannot upload to a third party, a rewrite of a sensitive internal document — these are cases where local AI earns its tradeoffs. The output may need editing. The privacy is real.

Cloud AI is the right choice when accuracy is paramount and privacy is less of a concern. A public news article you want analyzed, a programming question with a factual correct answer, a translation task where precision matters — these are cases where the larger model’s edge is worth the transmission cost.

Most people will end up using both, in different contexts, for different kinds of tasks. The important thing is knowing which you are using and why.

A note on what “private” actually means here

Saying a tool is private does not mean it is anonymous. If you are signed in to an account on a site that also runs local AI, the site still knows you visited. If the site uses analytics, it still knows the page was loaded. What local AI changes is specifically the handling of your input text and output — those stay on your device. The surrounding infrastructure of a website still behaves like a website.

On this site, PostHog analytics records that pages were visited, but does not receive any of the text you type into a tool. Google AdSense may load on some pages. Clerk handles account authentication. None of these services receive your prompt or your output. The privacy page explains which third-party services are present and what each one receives.

You can verify this yourself. Open your browser’s network panel, navigate to the AI summarizer, type a prompt, and watch the network tab during generation. You will see requests to load the page and model files, then nothing outbound carrying your text. That is what local AI looks like from the outside.

Summary

Local AI trades model size for privacy and offline capability. The models that run in your browser are genuinely useful for everyday writing and analysis tasks. They are slower on older hardware and less accurate than cloud models on tasks requiring deep factual recall. For private documents, sensitive text, and situations where you cannot or do not want to transmit your input over a network, local AI is a practical and meaningful alternative to cloud AI services. For everything else, it is a legitimate option worth knowing about.