Paper Summary #17 - Engram | Shreyansh Singh

Attention is not memory

Self-attention can resolve relationships in a sentence. It does not automatically provide a grounded representation of what the entities actually are.

Harry ambiguity among several possible Harry entities — Token association is not enough: "Harry" can point to many entities.

Harry Potter grounded by facts such as wizard and Hogwarts — "Harry Potter" becomes useful when it retrieves a richer factual cluster.

In a standard Transformer, this grounding is reconstructed through repeated computation. Attention composes nearby tokens. Feed-forward layers transform features. Later layers gradually turn surface strings into semantic representations.

The Engram paper frames this as an architectural mismatch: dynamic reasoning should use computation, while common static phrases should often use lookup.

The FFN already looks like a memory

A Transformer MLP can be read as a bank of pattern detectors and value writers.

$$\operatorname{FFN}(h) = W_{\text{down}} \, \sigma(W_{\text{up}}h + b_{\text{up}}) + b_{\text{down}}.$$

Geva et al. showed that FFNs behave like key-value memories: rows of $W_{\text{up}}$ detect patterns, while columns of $W_{\text{down}}$ write value vectors into the residual stream.

FFN up projection as pattern detection — Up-projection features act like soft keys.

FFN down projection writing value information — Down-projection values write information back to the residual stream.

MoE scales this by adding many FFNs and routing each token to a few experts:

$$\operatorname{MoE}(h_t)=\sum_{i \in \operatorname{TopK}(r(h_t))} p_i(h_t)E_i(h_t).$$

Mixture of experts as conditional computation — MoE increases the number of possible transformations, but still performs runtime computation.

Static facts want tables

For a single token, lookup is simple. For phrases, the combinatorics explode.

A token embedding table maps an ID directly to a vector:

$$e = E[x], \qquad E \in \mathbb{R}^{|V| \times d}.$$

But facts are usually phrase-level. "Harry" is ambiguous; "Harry Potter" is a much more specific key.

A direct bigram table with $|V|=128{,}000$ would have:

$$|V|^2 = 128{,}000^2 = 16{,}384{,}000{,}000.$$

Single token lookup table — Single-token lookup is manageable.

Bigram lookup table explosion — Direct bigram lookup is already huge.

Hash the local phrase

Engram compresses token IDs, hashes suffix n-grams, and retrieves rows from multiple embedding tables.

First, a tokenizer projection maps raw token IDs into canonical IDs:

$$P: V \to V', \qquad x'_t = P(x_t).$$

Then Engram forms suffix n-grams:

$$g_{t,n} = (x'_{t-n+1}, \ldots, x'_t).$$

Each hash head maps the compressed n-gram into a table row:

$$z_{t,n,k} = \phi_{n,k}(g_{t,n}), \qquad e_{t,n,k} = E_{n,k}[z_{t,n,k}].$$ $$e_t=\big\Vert_{n=2}^{N}\big\Vert_{k=1}^{K}e_{t,n,k}.$$

Hash lookup overview — A hash function maps the local phrase to a row in a learned memory table.

Why multiplicative-XOR?

Addition creates structured collisions and loses order. Plain XOR also loses order because it is commutative. Engram uses position-specific multipliers before XOR:

$$\phi_{n,k}(g_{t,n})= \left(\bigoplus_{i=0}^{n-1}m^{(\ell,k)}_i x'_{t-i}\right)\bmod M_{n,k}.$$

Addition hash weakness — Addition keeps nearby IDs nearby.

Multiplicative XOR hash — Position-specific multipliers make the hash order-sensitive.

A small hash lab

This toy demo is not DeepSeek's implementation. It makes the design intuition visible: one phrase produces several independent table addresses.

Multi-head lookup

Choose a phrase and watch eight simulated heads map it to different slots. Multi-head hashing makes a total collision across all heads much less likely.

Φ

$$\Pr[\forall k,\ \phi_k(a)=\phi_k(b)] \approx \prod_{k=1}^{K}\frac{1}{M_k}.$$

Lookup needs a gate

Static memory is useful only when the current context agrees with it.

The retrieved vector $e_t$ is projected into a key and value:

$$k_t = W_K e_t, \qquad v_t = W_V e_t.$$

The hidden state is the query. The scalar gate is:

$$\alpha_t=\sigma\left( \frac{\operatorname{RMSNorm}(h_t)^\top\operatorname{RMSNorm}(k_t)}{\sqrt{d}} \right).$$ $$\tilde{v}_t=\alpha_t v_t.$$

Key and value projections from retrieved memory — Memory becomes a key for relevance and a value for content.

Context-aware scalar gate — The current hidden state decides how much memory to admit.

Short convolution

After gating, Engram applies a short depthwise causal convolution and a residual path:

$$Y=\operatorname{SiLU}\left(\operatorname{Conv1D}(\operatorname{RMSNorm}(\tilde{V}))\right)+\tilde{V}.$$ $$H^{(\ell)} \leftarrow H^{(\ell)} + Y.$$

Short convolution applied after gating — The convolution lets nearby gated values interact before residual injection.

Inside the Transformer, not just at the input

Engram is inserted into selected Transformer blocks. The paper's 27B model uses layers 2 and 15.

Engram inserted into transformer blocks — Engram augments selected blocks while the ordinary token embedding and LM head remain intact.

Layer 1 is too raw

The hidden state is still close to token embeddings, so context-aware gating has little context to use.

Layer 2 is the sweet spot

One round of attention is enough to make the gate useful while still being early enough to save depth.

Middle layers refine

A later Engram module catches associations that only become clear after partial processing.

For multi-branch mHC backbones, Engram shares the memory table and value projection, but uses branch-specific key projections:

$$\alpha^{(m)}_t= \sigma\left( \frac{\operatorname{RMSNorm}(h^{(m)}_t)^\top\operatorname{RMSNorm}(W^{(m)}_K e_t)}{\sqrt{d}} \right),$$ $$u^{(m)}_t=\alpha^{(m)}_t(W_Ve_t).$$

Branch-specific gating in multi-branch architecture — Different residual branches can use the same memory vector differently.

How much memory is enough?

Engram's strongest empirical claim is that sparse capacity should be split between MoE and memory.

$$P_{\text{sparse}}=P_{\text{tot}}-P_{\text{act}}.$$ $$P_{\text{MoE}}^{(\text{sparse})}=\rho P_{\text{sparse}}, \qquad P_{\text{Engram}}=(1-\rho)P_{\text{sparse}}.$$

Sparsity allocation

Move the slider. The paper's optimum appears around $\rho \approx 0.75$ to $0.80$, where most sparse capacity remains MoE but a meaningful chunk becomes Engram memory.

rho = 0.80

MoE

Engram

MoE sparse share 80%

Engram sparse share 20%

Toy validation loss 1.711

Allocation curve with rho around 0.8 — The paper finds a U-shaped validation-loss curve, with the best region near rho 0.75-0.80.

Engram scaling with embedding slots — Increasing memory slots keeps improving loss over the tested range.

What changes at scale?

Engram-27B is iso-parameter and iso-FLOPs relative to MoE-27B. The win comes from reallocating sparse capacity, not from spending more activated compute.

Model	Total params	Activated params	Experts	Engram params
Dense-4B	4.1B	3.8B	none	none
MoE-27B	26.7B	3.8B	2 shared + 72 routed, top-6	none
Engram-27B	26.7B	3.8B	2 shared + 55 routed, top-6	5.7B
Engram-40B	39.5B	3.8B	2 shared + 55 routed, top-6	18.5B

Benchmark gain summary over MoE baseline — Gains are not limited to factual knowledge; the paper reports strong improvements in reasoning, code, and math too.

Effective depth

Engram helps shallow layers behave like deeper MoE layers because static local reconstruction is handled by lookup.

$$ \text{lookup for static facts} \Rightarrow \text{less early reconstruction} \Rightarrow \text{more effective depth} $$

CKA heatmaps showing Engram effective depth — CKA maps show shallow Engram layers aligning with deeper MoE layers.

Retained performance when Engram is ablated — Zeroing Engram during inference heavily damages factual knowledge tasks while reading comprehension largely survives.

Lookup frees attention

The paper argues that once local stereotyped patterns are handled by memory, attention can spend more of its capacity on global context.

Model	Multi-Query NIAH	Variable Tracking
MoE-27B, 50k pretrain steps	84.2	77.0
Engram-27B, 46k steps, matched loss	97.0	87.2

This does not mean Engram directly performs long-context retrieval. It means early representations are cleaner and attention has less local reconstruction work to do.

Why CPU offload can work

MoE routing depends on hidden states. Engram indices depend only on token IDs.

$$\text{MoE expert IDs}=r(h_t).$$ $$\text{Engram IDs}=\phi(x_1,\ldots,x_T).$$

Because Engram addresses are known before the layer executes, rows can be prefetched from host memory while earlier GPU layers are still computing.

The active communication volume scales with retrieved rows, not total table size:

$$\text{bytes per token}\approx |\mathcal{N}|K d_{\text{head}}\cdot\text{bytes-per-element}.$$

The paper reports less than 3 percent throughput penalty when offloading a 100B-parameter Engram layer to host DRAM in their nano-vLLM-based setup.

Implementation path

The official repository ships a demo that focuses on data flow rather than production kernels.

The useful way to read the demo is as a call graph. Engram is inserted inside selected Transformer blocks before the ordinary attention and MoE sublayers. The block still receives the full token IDs because the memory address is computed from tokens, not hidden states.

class TransformerBlock(nn.Module):
    def forward(self, input_ids, hidden_states):
        if self.engram is not None:
            hidden_states = (
                self.engram(hidden_states=hidden_states, input_ids=input_ids)
                + hidden_states
            )

        hidden_states = self.attn(hidden_states) + hidden_states
        hidden_states = self.moe(hidden_states) + hidden_states
        return hidden_states

So the lookup path is not a sidecar after decoding. It is a residual branch inside the model's forward pass. For configured layers such as 1 and 15 in the demo, the sequence is:

Step	Code object	Role
Compress	`CompressedTokenizer`	Normalize equivalent token strings and map original token IDs to a smaller canonical ID space.
Index	`NgramHashMapping.hash`	Call the n-gram hash routine for every Engram layer and return layer-specific row IDs.
Gather	`MultiHeadEmbedding`	Use offsets so many head-specific tables can live inside one contiguous embedding table.
Fuse	`Engram.forward`	Project retrieved rows into keys and values, gate with the hidden state, apply short convolution, and return a residual update.

Tokenizer compression

The demo builds an array mapping each original token ID to a normalized canonical ID. The normalizer applies Unicode normalization, accent stripping, lowercasing, whitespace cleanup, and a fallback for undecodable tokens. This matters because many surface forms should share lookup rows.

old2new = {}
key2new = {}

for tid in range(vocab_size):
    text = tokenizer.decode([tid], skip_special_tokens=False)
    key = token_string_if_undecodable(text) or normalize(text)

    if key not in key2new:
        key2new[key] = len(key2new)

    old2new[tid] = key2new[key]

lookup = np.empty(vocab_size, dtype=np.int64)
for tid in range(vocab_size):
    lookup[tid] = old2new[tid]

Where the n-gram hash is called

The demo's n-gram hash routine is named _get_ngram_hashes. It is called by NgramHashMapping.hash, which first compresses the input IDs and then computes separate hash IDs for every configured Engram layer.

def hash(self, input_ids):
    input_ids = self.compressed_tokenizer(input_ids)
    hash_ids_for_all_layers = {}

    for layer_id in self.layer_ids:
        hash_ids_for_all_layers[layer_id] = self._get_ngram_hashes(
            input_ids,
            layer_id=layer_id,
        )

    return hash_ids_for_all_layers

Inside _get_ngram_hashes, the implementation forms shifted token views so that each position can see its local suffix. For a trigram-capable layer, the arrays are roughly current token, previous token, and token two steps back.

def shift_k(k):
    if k == 0:
        return x
    shifted = np.pad(
        x,
        ((0, 0), (k, 0)),
        mode="constant",
        constant_values=self.pad_id,
    )[:, :T]
    return shifted

base_shifts = [shift_k(k) for k in range(self.max_ngram_size)]

The actual indexing function is multiplicative-XOR followed by a per-head modulus. Each layer receives its own random odd multipliers, seeded from the layer ID, so identical n-grams can map differently in different layers.

for n in range(2, self.max_ngram_size + 1):
    n_gram_index = n - 2
    tokens = base_shifts[:n]

    mix = tokens[0] * multipliers[0]
    for k in range(1, n):
        mix = np.bitwise_xor(mix, tokens[k] * multipliers[k])

    for j, mod in enumerate(head_vocab_sizes):
        head_hash = mix % int(mod)
        all_hashes.append(head_hash.astype(np.int64, copy=False))

return np.stack(all_hashes, axis=2)

The demo chooses distinct prime table sizes for each head. That is a small but important engineering detail: if all heads used the same modulus, collisions would be correlated; different prime moduli reduce repeated collision structure.

Gathering rows

The row IDs returned by hashing have shape [B, T, H], where H = (N - 1)K. MultiHeadEmbedding stores all head tables in one embedding matrix and adds precomputed offsets so every head indexes its own region.

offsets = [0]
for table_size in list_of_N[:-1]:
    offsets.append(offsets[-1] + table_size)

shifted_input_ids = input_ids + self.offsets
rows = self.embedding(shifted_input_ids)

Then Engram.forward flattens the per-head vectors into a single memory vector per token:

hash_input_ids = torch.from_numpy(
    self.hash_mapping.hash(input_ids)[self.layer_id]
)

embeddings = self.multi_head_embedding(hash_input_ids)
embeddings = embeddings.flatten(start_dim=-2)

Branch-specific gating

The hidden state decides whether the retrieved memory is relevant. For every hyper-connection branch, Engram projects the memory into a key, compares it with the branch hidden state, and uses the score as a scalar gate on the value projection.

gates = []
for hc_idx in range(backbone_config.hc_mult):
    key = self.key_projs[hc_idx](embeddings)
    normed_key = self.norm1[hc_idx](key)

    query = hidden_states[:, :, hc_idx, :]
    normed_query = self.norm2[hc_idx](query)

    gate = (normed_key * normed_query).sum(dim=-1)
    gate = gate / math.sqrt(backbone_config.hidden_size)
    gate = gate.abs().clamp_min(1e-6).sqrt() * gate.sign()
    gate = gate.sigmoid().unsqueeze(-1)
    gates.append(gate)

gates = torch.stack(gates, dim=2)
value = gates * self.value_proj(embeddings).unsqueeze(2)

Short convolution and residual output

After gating, the demo applies grouped depthwise causal convolution over the branch dimension, then returns the memory update. The Transformer block adds that update to the current hidden state.

output = value + self.short_conv(value)
return output

A production implementation still needs distributed sparse table sharding, fused row gather, fused key/value projections, asynchronous host-memory prefetch, cache management, and careful handling of CPU-to-GPU transfer overlap. The demo makes the algorithm readable; it is not meant to be the final serving kernel.

Embedding scaling around Engram

Engram fits a broader shift: scale the representation interface, not fixed embedding plumbing.

FFN Implicit key-value memory inside ordinary Transformer blocks.

PKM Learned nearest-neighbor memory selected from hidden-state queries.

SCONE Offloaded frequent n-gram embeddings learned by an auxiliary model.

RAG External document retrieval, editable but slower and less tightly integrated.

Engram Trainable parametric memory addressed by deterministic hashed local token patterns.

The related work falls into three families. Some methods change the tokenizer so each model step carries more text. Some add larger input-side embedding tables while keeping output softmax cost controlled. Others add sparse lookup branches inside the network, closer to Engram.

arXiv 2502.01637

SCONE: Scaling Embedding Layers in Language Models

SCONE means Scalable, Contextualized, Offloaded, N-gram Embedding. It keeps the original token vocabulary but adds frequent n-gram embeddings. During training, a separate f-gram model learns contextualized vectors; during inference, those vectors are cached as a large off-accelerator lookup table. The important contrast with Engram is training: SCONE avoids instantiating a giant train-time table, while Engram directly trains hashed memory rows inside the model.

arXiv 2503.13423

SuperBPE: Space Travel for Language Models

SuperBPE changes BPE training rather than the Transformer. It first learns ordinary subwords, then removes the whitespace boundary so later merges can create superword tokens such as common multi-word expressions. This reduces token counts and can improve downstream performance because a model step can represent a more semantic chunk. It is related to Engram because both notice that phrase-level units often behave like atomic knowledge, but SuperBPE bakes them into the tokenizer.

arXiv 2501.16975

Over-Tokenized Transformer / Over-Encoding

Over-Encoding decouples input and output vocabularies. The input side receives a much larger hierarchical n-gram vocabulary, while the output softmax can remain smaller. The paper reports OE-1.2M and OE-12.8M input vocabularies and argues that input vocabulary scaling gives nearly log-linear loss improvements. It is a direct embedding-scaling result: more input lookup capacity improves the model without paying the full cost of a huge decoder vocabulary.

arXiv 2412.09871

Byte Latent Transformer: Patches Scale Better Than Tokens

BLT removes fixed-vocabulary tokenization altogether. It groups raw bytes into dynamically sized patches, often using entropy from a small byte model to decide where the next patch should start. The expensive global Transformer runs on patches, while local byte modules encode and decode. BLT is not an n-gram table method, but it is deeply relevant: it treats granularity as a scaling axis and reallocates compute away from predictable byte regions.

Google docs

Layer Embeddings / Gemma 3n Per-Layer Embeddings

Gemma 3n documents Per-Layer Embedding parameters that are used during execution to enhance each model layer. The public material frames PLE as an edge-device memory technique: keep only the core model hot on accelerator and cache or load layer-specific embedding parameters as needed. I did not find a standalone PLE paper, so the primary reference here is the official Gemma 3n documentation.

RWKV docs

DeepEmbed in RWKV-V8

RWKV's DeepEmbed preview adds token-indexed learned vectors inside every FFN layer and uses them as channelwise modulation. The stated deployment motivation is similar to embedding scaling: many parameters can live in RAM, SSD, or memory-mapped storage because each token activates only a tiny slice. I did not find a standalone DeepEmbed paper; the primary source is the RWKV architecture documentation and demo code references.

arXiv 2601.21204

LongCat-Flash-Lite: Scaling Embeddings Outperforms Scaling Experts

LongCat-Flash-Lite is the strongest production-scale neighbor: a 68.5B-parameter sparse MoE model with roughly 3B activated parameters and 31.4B parameters in n-gram embeddings. The paper argues that, in high-sparsity regimes, allocating parameters to n-gram lookup can beat adding more MoE experts. It also stresses the systems side: n-gram cache, optimized embedding lookup, kernel fusion, expert parallelism, and speculative decoding are needed to turn theoretical sparsity into real throughput.

Useful, not magic

Engram is not a replacement for reasoning, external retrieval, or careful training.

It stores parametric knowledge. Changing facts still needs fine-tuning, table editing, or another update mechanism.
Hash collisions are reduced by multiple heads, not eliminated.
The optimal MoE/Engram ratio is empirical and may shift with scale, data, tokenizer, and hardware.
It is strongest for local stereotyped patterns: names, entities, idioms, common code fragments, and frequent phrase structures.
Independent replication will matter because the systems benefits depend heavily on implementation quality.

Conditional memory does not replace computation. It stops computation from pretending to be a lookup table.