These are just short notes / excerpts from the technical paper for quick lookup. Actual implementational details are anyways missing in the technical paper.
Gemini 1.5 Pro is a sparse MoE Transformer-based model (seems like a trend these days, after GPT-4’s rumors and Mixtral).
It supports upto 10M context multimodal context length which is about a day of 22 hours of audio recordings, more than ten times the entirety of the 1440 page book “War and Peace”, the entire Flax codebase (41,070 lines of code), or three hours of video at 1 frame-per-second.
Near-perfect retrieval across complete context length. In all modalities, i.e., text, video and audio, it achieves 100% accurate needle-in-a-haystack recall for 530k tokens, 99.7% accurate recall for 1M tokens and 99.2% for 10M contexts.
However, with about 100 needles (with the model requiring to retrieve them all), the performance drops to 70% recall up to 128K tokens, and >60% recall up to 1M tokens. GPT-4 is 50% recall at 128k tokens (its maximum context).
Gemini 1.5 Pro beats RAG-based systems for all modalities. Although this may not be cost efficient right now unless there are multiple queries and the long context can be prefix-cached.
Excellent use of context for learning new skills. It was able to learn to translate from/to English to/from Kalamang, an extremely low-resource language (spoken by less than 200 people in the world), with just a single set of linguistic documentation, a dictionary, and ≈ 400 parallel sentences. That’s it!
Gemini 1.5 Pro is exceeds Gemini 1.0 Ultra on text. However the latter is still better on vision and audio tasks, but the compute-efficiency and speed of shipping does suggest some ideas of distillation coming into the picture.
Figures 4 and 5 in the Technical report are super cool. In Figure 4, Gemini 1.5 is able to extract the exact scene with the page number from a textbook given a hand-drawn sketch as the prompt.
In Figure 5 they show that it is able to do the same in videos as well. The model returns a detailed description and timestamp of the scene.
The cumulative average negative log-likelihood (NLL) as a function of token position in long documents follows a power-law quite accurately up to 1M tokens for long-documents and about 2M tokens for code but deviates at the 10M scale.