Is Semantic Search Possible Without Wi-Fi?

Can you run real vector similarity search, the kind that understands meaning, entirely offline in a browser tab? No server, no API, no network connection at all?

Yes. And I've been doing it for six months.

But the answer has a lot of texture to it, and most of the blog posts I've seen on "offline AI" skip the parts that actually matter to developers. They wave their hands at "on-device inference" and move on. So I want to walk through how this actually works, what the architectural constraints are, and where the math lives when your user is sitting on an airplane with Wi-Fi turned off.

The pieces you need (and the ones you don't)

Semantic search, at its core, requires two things: a way to turn text into dense vectors, and a way to compare those vectors to find the closest matches. That's it. Everything else (your fancy HNSW indices, your sharded vector databases, your billion-parameter models) is optimization. Important optimization! But optimization.

For offline browser-based semantic search, the stack I've been working with (and that TraceMind uses in production) looks like this:

A transformer model small enough to run inference in WASM or WebGPU
An embedding store that lives entirely in IndexedDB
A vector similarity function that doesn't need anything fancier than a for-loop

No Pinecone. No Weaviate. No ElasticSearch kNN plugin. Nothing leaves the browser.

Transformers.js is the linchpin

If you haven't looked at transformers.js recently, it's grown up a lot. It's a JavaScript port of the Hugging Face transformers pipeline that runs ONNX models through either WebAssembly (via onnxruntime-web) or, increasingly, WebGPU.

The model that makes offline semantic search practical is all-MiniLM-L6-v2. Let me give you the numbers that matter:

6 transformer layers (hence L6)
~22 million parameters
Produces 384-dimensional embeddings
ONNX quantized model size: roughly 23MB

Twenty-three megabytes. That's a medium-sized image. You download it once, cache it in the browser, and from that point forward, you can embed text without touching a network.

This is the single most important fact in the whole offline-AI conversation and people gloss over it constantly. The model fits in a browser cache. It's not streaming weights from a CDN on every query. Once it's local, it's local.

How the embedding pipeline works

When you call pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2') in transformers.js, here's what's actually happening under the hood:

The input text gets tokenized using a WordPiece tokenizer (same one BERT uses). Each token maps to an integer ID. The tokenizer's vocabulary is about 30,000 tokens, and the vocab file is bundled with the model.

Those token IDs go through six stacked transformer encoder layers. Each layer has multi-head self-attention (12 heads, 384 hidden dim) followed by a feed-forward network. The attention mechanism computes:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

where d_k = 32 (384 dimensions / 12 heads). This is standard scaled dot-product attention, nothing exotic.

After all six layers, you have contextual embeddings for every token. But you need one vector for the whole input, not one per token. So the final step is mean pooling: average all token embeddings (ignoring padding tokens via the attention mask) to produce a single 384-dimensional vector.

That vector is your semantic fingerprint. Two pieces of text about similar topics will produce vectors that are close together in 384-dimensional space, even if they share zero words in common.

Wait, what about Voy?

Voy is the other half of this equation, and honestly, it's the half that gets me more excited from an engineering perspective.

Voy is a WASM-compiled vector similarity search library built specifically for the browser. You feed it vectors, you query it with a vector, it gives you the nearest neighbors. Offline. Fast.

Under the hood, Voy uses a brute-force approach for small to medium collections, and it supports HNSW (hierarchical navigable small world graphs) for larger ones. For most browser-based use cases, where you're indexing thousands or tens of thousands of documents rather than millions, brute-force cosine similarity is fine. More than fine. It's predictable, debuggable, and exact.

Here's something that bugs me about the discourse around vector search: people treat approximate nearest neighbor algorithms as a baseline requirement. They're not. ANN is an optimization for when you have millions or billions of vectors and can't afford to scan them all. If you have 50,000 vectors at 384 dimensions each? A brute-force cosine scan takes milliseconds. Not "optimized milliseconds." Just regular, boring milliseconds.

TraceMind takes this approach even further, with a CSP-safe brute-force cosine index built directly into the extension. No external vector database, no k-d tree partitioning. Just math in a loop. I wrote about the broader architecture in my piece on how vector embeddings work in your browser if you want more context.

The math that makes it work offline

Let me get into the cosine similarity calculation because understanding this removes a lot of the mysticism around "AI search."

Two vectors, A and B, each with 384 dimensions. Cosine similarity is:

cos(θ) = (A · B) / (||A|| * ||B||)

The dot product A · B is the sum of element-wise multiplications. The norms ||A|| and ||B|| are the square root of the sum of squared elements.

In JavaScript, this is embarrassingly simple:

function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

Twelve lines. No TensorFlow. No CUDA. No cloud.

Pre-compute and store the norms alongside each vector, and your query-time cost drops to one dot product and one division per candidate. For 10,000 stored vectors, that's 10,000 dot products of length 384. Around 3.8 million multiply-add operations. Modern JavaScript engines chew through that in under 10ms.

Quantization trick

Here's a practical detail that matters more than most blog posts admit: storing 384 float32 values per page visit adds up fast. That's 1,536 bytes per vector. Index 10,000 pages and you're at ~15MB just for embeddings.

TraceMind quantizes from float32 down to uint8, shrinking each vector from 1,536 bytes to 384 bytes. Roughly 87% smaller. The precision loss is measurable but small enough that retrieval quality doesn't noticeably degrade for the kinds of queries real people actually type.

The quantization is straightforward: find the min and max across the 384 dimensions, linearly map the range to 0-255, round to the nearest integer. At query time, you either dequantize back to float32 before computing similarity, or (cleverer) compute similarity directly on the quantized values and accept the minor approximation.

This is not a controversial tradeoff. It's the same principle behind int8 inference in every major ML framework.

What "offline" actually means in practice

I want to be precise here because "offline" is doing a lot of work in this article's title.

There are two phases:

Phase 1: Model acquisition (requires network, once). The first time the system runs, it needs to download the ONNX model file. For all-MiniLM-L6-v2, that's ~23MB. After that, it's cached. Some implementations use the Cache API, others store it in IndexedDB directly. Either way, it persists across sessions.

Phase 2: Everything else (fully offline). Tokenization, embedding, storage, retrieval, ranking. All local. All the time. No fallback to a server, no "degraded mode" that secretly phones home.

This distinction matters because some tools claim to be "offline capable" but actually stream model weights on every cold start. That's not offline. That's "offline if you never close your browser tab."

I've tested TraceMind on flights with airplane mode enabled. It works. Full semantic search across everything I'd previously browsed. You type a concept, it finds pages about that concept, even if your query words don't appear on those pages. That's the whole point of semantic search, and it holds up at 35,000 feet.

The fallback story

What happens when on-device ML inference genuinely isn't available? Maybe the user's device can't run the WASM backend. Maybe the model file got evicted from cache and there's no network.

TraceMind falls back to a distilled static embedding model (Model2Vec): instead of running full transformer inference, it maps tokens to pre-computed vectors through a lookup table and combines them. Think of it like word2vec's more sophisticated cousin.

The quality drops. Obviously. You lose the contextual understanding that attention heads provide. "Bank of the river" and "bank account" might get similar embeddings instead of different ones. But you still get meaning-based retrieval. You don't fall all the way back to keyword matching.

This graceful degradation pattern is underappreciated. Most systems treat ML inference as binary: it works or it doesn't. Having a middle ground where search quality degrades smoothly instead of breaking is better engineering than most developers bother with.

Hybrid retrieval: not an afterthought

Pure semantic search has a weakness that nobody likes to talk about.

If you search for "RFC 7231" and you've visited a page containing exactly that string, you want an exact match. Dense vector retrieval might surface it, but it might also rank a general "HTTP specification overview" higher because the embeddings are semantically close. This is infuriating when you know the exact phrase you want.

The solution is hybrid retrieval. Run semantic search and full-text keyword search in parallel, then merge the results. TraceMind does this with Reciprocal Rank Fusion (RRF), a ranking merge algorithm. For each result, you compute:

RRF_score = Σ 1 / (k + rank_i)

where k is a constant (typically 60) and rank_i is the result's rank in each individual search. Documents that rank highly in both searches bubble to the top. Documents that rank highly in only one still appear, but lower.

The full-text side uses FlexSearch, a fast in-memory text index that supports fuzzy matching and runs entirely in the browser. No server required.

What I find clever about TraceMind's implementation is that it detects whether you're navigating (looking for a specific known page) versus exploring (trying to find something you vaguely remember) and adjusts the blend between semantic and keyword results accordingly. I covered the differences between these search modes in semantic search vs keyword search if you're curious about the details.

So what's the catch?

There are real limitations. I'm not going to pretend this is perfect.

Model size constrains capability. all-MiniLM-L6-v2 is good, but it's not a 1.5-billion-parameter embedding model. It struggles with highly domain-specific jargon, rare languages, and very long passages (it truncates at 256 tokens). You can throw a bigger model at the problem, but now your first-load experience involves downloading 100MB+ and inference latency goes from 30ms to 300ms.

Cold start matters. Loading the ONNX model into memory and warming up the WASM runtime takes 2-4 seconds on a typical laptop. Not terrible, but noticeable. WebGPU cuts this significantly when available.

IndexedDB is not a database. It's a key-value store with pretensions. It doesn't support vector operations. You're pulling vectors into memory and scanning them yourself. For a few thousand vectors, this is fine. For 100,000+, you need to think about pagination and batching.

Browser storage limits. Most browsers give extensions generous storage quotas (Chrome gives extensions essentially unlimited IndexedDB), but web apps are more constrained. If you're building this as a PWA rather than an extension, you'll hit storage pressure sooner.

These are engineering constraints, not fundamental barriers. The core claim holds: semantic search works offline, entirely in the browser, with no network dependency after initial model download.

Try it yourself

If you want to see this in action without building it yourself, TraceMind is the cleanest implementation I've found. The free tier includes the full semantic and hybrid search engine with no paywall on search quality. Install it, browse normally, then search your history by meaning instead of keywords. All offline. All local.

For developers who want to build their own, the key libraries are transformers.js for model inference and either Voy or your own brute-force cosine implementation for retrieval. The writeup on how TraceMind chose IndexedDB and WASM is also worth reading if you're making similar architectural decisions.

The transformer wave didn't just give us ChatGPT. It gave us 23MB models that fit in a browser cache and turn any text into a searchable vector. Running that without Wi-Fi isn't a hack or a compromise. It's just what the math looks like when it's small enough to take with you.