Google releases the first native multimodal embedding model, Gemini Embedding 2

SnapshotLaborer · 2026-03-10T23:53:33+00:00

Google DeepMind launched Gemini Embedding 2 on March 10th, the first multimodal embedding model supporting unified processing of text, images, videos, audio, and documents, enhancing semantic understanding and processing capabilities, while lowering the technical barriers for enterprises to build multimodal systems. The model is available through the Gemini API and Vertex AI public preview, offering flexible embedding dimension options, with the newly added speech processing capability being a highlight.

SnapshotLaborer

2026-03-10 23:53:33

Abstract generation in progress

On March 10th, Google DeepMind launched Gemini Embedding 2, their first native multi-modal embedding model. It maps text, images, videos, audio, and documents into a single embedding space, marking a new stage of full-modal integration in AI embedding technology.

Gemini Embedding 2 supports semantic understanding in over 100 languages and outperforms existing mainstream models in benchmarks for text, image, and video tasks. It also introduces speech processing capabilities previously lacking in embedding models.

The model is now available for public preview through the Gemini API and Vertex AI, allowing developers to access it immediately.

For enterprise users, this release lowers the technical barriers to building multi-modal retrieval-augmented generation (RAG), semantic search, and data classification systems, potentially simplifying complex data pipelines that previously required separate processing across different modalities.

Unified Multi-Modal: Expanding from Text to Five Media Types

Built on the Gemini architecture, Gemini Embedding 2 extends embedding capabilities from pure text to five input types:

Text supports up to 8,192 input tokens;

Images can process up to 6 per request, supporting PNG and JPEG formats;

Videos support MP4 and MOV files up to 120 seconds long;

Audio can be directly ingested to generate embedding vectors without intermediate transcription;

Documents support embedding of PDF files up to 6 pages.

Unlike traditional methods that handle each modality separately, this model supports interleaved inputs, meaning multiple modalities such as images and text can be submitted simultaneously in a single request. This enables the model to capture complex and subtle semantic relationships across different media types.

Gemini Embedding 2 continues to utilize Google’s previously developed Matryoshka Representation Learning (MRL) technology. This technique compresses vector dimensions dynamically through “nesting,” allowing output dimensions to be flexibly reduced from the default 3,072, helping developers balance model performance and storage costs.

Benchmark Performance and New Speech Capabilities

Google states that Gemini Embedding 2 outperforms current leading models in benchmarks for text, image, and video tasks, establishing a new performance standard in multi-modal embedding.

Google recommends developers choose among 3,072, 1,536, or 768 dimensions to optimize embedding quality based on application needs. This flexible design is especially important for enterprises deploying large-scale embedding vectors, enabling effective cost control without significantly sacrificing accuracy.

In terms of capabilities, the model introduces native speech embedding, a feature often missing in similar models. It can process audio data directly without relying on speech-to-text conversion.

Google notes that embedding technology is already widely used across its products, including context engineering in RAG scenarios, large-scale data management, and traditional search and analytics.

Some early access partners are already building multi-modal applications based on Gemini Embedding 2. Google states these use cases are demonstrating the model’s practical potential in high-value scenarios.

Risk Warning and Disclaimer

Market risks exist; invest cautiously. This article does not constitute personal investment advice and does not consider individual users’ specific investment goals, financial situations, or needs. Users should consider whether any opinions, views, or conclusions herein are suitable for their particular circumstances. Invest at your own risk.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

1 Likes