Table of Contents

Introduction

You can find all the C# code samples here: Embeddings GitHub Repository

Before we dive into Image Verbalization via LLMs, I recommend reading these earlier posts to get the most value from this guide:

If you’ve ever tried to build a search experience that spans both text and images, you’ve probably run into the same challenge everyone hits sooner or later: text embeddings and image embeddings live in different vector spaces, unless you use a multimodal model.

But what if you don’t have a multimodal embedding model available? Or what if you want a solution that works with any text-embedding model you already use?

That’s where image verbalization comes in.

In this post, we’ll walk through a practical, production-ready approach to converting images into descriptive text using an LLM, and then embedding that text using a standard text-embedding model. You’ll also see a complete C# example showing how to index verbalized images and compare them with text queries.

Multimodal Embedding vs. Image Verbalization

There are two common techniques for addressing the challenge of multi-modal vector search. In this post, we will focus on the Image Verbalization pattern using gpt-4.1-mini LLM model, the multimodal embeddings pattern was covered in this blog post but let’s briefly recap the ideas behind both.

Multimodal Embeddings

The picture showing multimodal embeddings model capabilities with Azure Vision service.

The Multimodal Embeddings approach is quite straightforward. It relies on a single embedding model that can process multiple data modalities, in this case, text and images. This is essential because it maps both types of embeddings into a shared vector space, allowing us to directly compare text and image vectors.

Using Azure Vision, we can vectorize both data types through the vectorizeText and vectorizeImage APIs.

Image Verbalization

Picture showing the concept of image verbalization which is then leveraged during multi-modal similarity search (text and image).

The Image Verbalization technique is a two-step process that translates visual data into descriptive text so it can be queried using standard text embeddings.

Step 1: Translating the Image to Text (The Verbalization Phase)

Instead of embedding an image directly, the visual content (such as the picture of Mars) is first passed through a Large Language Model (LLM). To ensure the LLM generates the most useful description for a vector search index, it is guided by a strict system prompt. The model acts as an “Image Verbalization assistant,” tasked with generating a single, detailed paragraph by following specific rules:

  • Identify the Subject and Context: Clearly state what is in the image and its background setting.
  • Be Literal: Focus entirely on what is in the frame, ignoring stylistic elements like lighting, camera angles, or artistic flair.
  • Optimize for Search: Use standard terminology that a user is likely to type into a search engine (e.g., “image of the planet Mars”).

Step 2: Vectorizing the Text (The Embedding Phase)

Once the LLM generates this highly descriptive, literal text, it is passed into a standard text Embedding Model. Because the visual content is now in text format, a user’s text-based search query (like “photos of the planet Mars”) can be processed by that exact same embedding model.

As illustrated in the diagram, this maps both the verbalized image and the search query into a shared vector space. This shared space produces vector arrays (like [-0.78, 0.12 ... -0.16]) that allow for the mathematical measurement of similarity between the user’s text query and the original image.

Image Verbalization Step by Step

Now, let’s walk through the C# example step by step.

Registering ChatClient and EmbeddingClient

Models deployed in Microsoft Foundry including GPT 4.1 Mini model and Text-Embedding-Ada-002 model.

Referring back to the Image Verbalization pattern diagram, I have defined two distinct models:

  • LLM (gpt-4.1-mini): Responsible for the verbalization of the image.
  • Embedding Model (text-embedding-ada-002): Responsible for converting the resulting text into a vector.

I have deployed both models to Microsoft Foundry. To interface with these deployments using C#, I first need to pull two specific NuGet packages:

  • Azure.AI.OpenAI: Provides the AzureOpenAIClient, EmbeddingClient, and ChatClient.
  • Azure.Identity: Provides the DefaultAzureCredential class for secure authentication.

To interact with the LLM, I use the ChatClient, while the EmbeddingClient is used to generate the necessary embeddings.

public class ImageVerbalizationViaLLMsExample
{
    private readonly EmbeddingClient _embeddingClient;
    private readonly ChatClient _chatClient;

    public ImageVerbalizationViaLLMsExample()
    {
        var openAiClient = new AzureOpenAIClient(
            new Uri(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_CLIENT_URI")!),
            new DefaultAzureCredential());

        _embeddingClient = openAiClient.GetEmbeddingClient(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_EMBEDDING_CLIENT_DEPLOYMENT_NAME")!);
        _chatClient = openAiClient.GetChatClient(Environment.GetEnvironmentVariable("AZURE_OPEN_AI_EMBEDDING_CHAT_CLIENT_DEPLOYMENT_NAME")!);
   }
}

Authentication and Security

You will notice that I am not using API keys, instead, I rely on the DefaultAzureCredential class. I have authenticated to my Azure subscription directly within Visual Studio via Tools > Options > Azure Service Authentication.

Because of this setup, DefaultAzureCredential automatically retrieves a token by invoking the VisualStudioCredential class behind the scenes. For this to work, I have assigned the Azure AI User RBAC role to my security principal.

Verbalization System Prompt

Crafting an effective verbalization prompt is essential for the translation process to work smoothly. A strong prompt must capture both the main subject and the broader context of the image, condensing that information into a single, compact paragraph. It should focus entirely on literal content rather than artistic style, actively avoiding imaginative storytelling or unsupported assumptions. Furthermore, embedding search-friendly terminology directly into the instructions ensures the resulting text closely matches what a typical user might enter into a search engine.

A strong verbalization prompt should:

  • Focus on literal content, not artistic style
  • Avoid assumptions or storytelling
  • Use search‑friendly terminology
  • Produce a single, compact paragraph
  • Capture the subject and context of the image

The system prompt I have used in the C# example:

private static string GetSystemPromptText()
{
    return """
    You are an Image Verbalization assistant; your goal is to translate visual content into descriptive, searchable text.

    Instructions:
    - Identify the Subject: State clearly what is in the image.
    - Identify Context: Mention the setting or significant background elements.
    - Be Literal, Not Stylistic: Focus on 'what' is in the frame, not 'how' it looks. Ignore artistic style, lighting, or camera angles.
    - Search-Optimized: Use standard terminology that a user would likely type into a search engine.

    Output Format:
    - Provide a single, detailed paragraph (2-3 sentences) that captures the essence of the image for a vector search index.
    """;
}

Verbalizing images

private async Task<string> VerbalizeImageAsync(string imagePath)
{
    byte[] imageBytes = await File.ReadAllBytesAsync(imagePath);

    var messageParts = new List<ChatMessageContentPart>
    {
        ChatMessageContentPart.CreateTextPart("Describe this image."),
        ChatMessageContentPart.CreateImagePart(BinaryData.FromBytes(imageBytes), "image/jpeg")
    };

    var chatMessages = new List<ChatMessage>
    {
        new SystemChatMessage(GetSystemPromptText()),
        new UserChatMessage(messageParts)
    };

    ChatCompletion completion = await _chatClient.CompleteChatAsync(chatMessages);

    var text = completion.Content.FirstOrDefault()?.Text ?? string.Empty;
    Console.WriteLine($"Image verbalized ({Path.GetFileName(imagePath)}): {text}\n");

    return text;
}

To implement the verbalization phase, I use a method that combines the raw image data with the specialized system prompt I defined earlier. This process sends the image bytes directly to the gpt-4.1-mini model, which acts as an intelligent “translator” by converting visual features into a concise, search-optimized paragraph. Once the model returns this descriptive text, I can proceed to use it as the source for my text-based embedding.

public async Task Run()
{
    var astronautImageText = await VerbalizeImageAsync(GetFilePath("astronaut.jpg"));
    var coffeeImageText = await VerbalizeImageAsync(GetFilePath("coffee.jpg"));
    var marsImageText = await VerbalizeImageAsync(GetFilePath("mars.jpg"));
    var marsRoverImageText = await VerbalizeImageAsync(GetFilePath("mars_rover.jpg"));

    var verbalizedImageAstronaut = await GetTextEmbeddingAsync(astronautImageText);
    var verbalizedImageCoffee = await GetTextEmbeddingAsync(coffeeImageText);
    var verbalizedImageMars = await GetTextEmbeddingAsync(marsImageText);
    var verbalizedImageMarsRover = await GetTextEmbeddingAsync(marsRoverImageText}
}

Below are the images I have selected to demonstrate this pattern.

Once the VerbalizeImageAsync method is executed, the LLM generates highly descriptive, search-optimized strings for each image. These descriptions serve as the textual bridge that allows us to perform standard text-based vector searches against visual content.

Image verbalized (astronaut.jpg): The image shows an astronaut in a full white spacesuit standing on the surface of the moon, with the lunar landscape visible beneath their feet. The spacesuit features the American flag on the shoulder, and the lunar surface displays footprints and rocky texture, with a dark sky overhead. The reflective helmet visor partially shows the reflection of the surroundings.

Image verbalized (coffee.jpg): The image shows a cup of latte with latte art in a patterned cup and saucer set on a textured glass table outdoors. Next to the coffee, there is a small blue plate with a dessert topped with almond slices and a fork resting beside it. The background features green foliage, suggesting a garden or patio setting.

Image verbalized (mars.jpg): The image shows a detailed view of the planet Mars, highlighting its reddish surface with numerous impact craters and dark volcanic regions. The planet is set against a completely black background, allowing a clear observation of Mars' topography, including variations in color and surface texture. This image serves as a reference for studying the Martian surface features and planetary characteristics.

Image verbalized (mars_rover.jpg): The image shows a Mars rover with six wheels and a tall robotic arm mounted on a panel of solar panels, positioned on the rocky surface of Mars. The background features a barren, reddish-brown Martian landscape with hills in the distance under a sky with a gradient of earthy tones. The rover is equipped with various scientific instruments and cameras, designed for exploration and analysis on the Martian terrain.

Final results

To validate the effectiveness of this pattern, I compare these verbalized descriptions against a set of simple, human-written text queries. I then calculate the cosine similarity between the vectors of the verbalized image and the vectors of these sample texts:

var astronautText = "astronaut on the moon";
var coffeeText = "latte and cake outside";
var marsText = "planet mars from space";
var marsRoverText = "mars rover on rocky surface";

Below are the final similarity scores. A score closer to 1.00 indicates a near perfect semantic match.

(Text) Astronaut vs images (image verbalization):
- (Text) Astronaut vs (Verbalized) Astronaut image: 0.91
- (Text) Astronaut vs (Verbalized) Mars Rover image: 0.82
- (Text) Astronaut vs (Verbalized) Mars image: 0.80
- (Text) Astronaut vs (Verbalized) Coffee image: 0.76

(Text) Coffee vs images (image verbalization):
- (Text) Coffee vs (Verbalized) Coffee image: 0.89
- (Text) Coffee vs (Verbalized) Astronaut image: 0.76
- (Text) Coffee vs (Verbalized) Mars Rover image: 0.75
- (Text) Coffee vs (Verbalized) Mars image: 0.72

(Text) Mars vs images (image verbalization):
- (Text) Mars vs (Verbalized) Mars image: 0.86
- (Text) Mars vs (Verbalized) Mars Rover image: 0.85
- (Text) Mars vs (Verbalized) Astronaut image: 0.85
- (Text) Mars vs (Verbalized) Coffee image: 0.76

(Text) Mars Rover vs images (image verbalization):
- (Text) Mars Rover vs (Verbalized) Mars Rover image: 0.91
- (Text) Mars Rover vs (Verbalized) Mars image: 0.85
- (Text) Mars Rover vs (Verbalized) Astronaut image: 0.84
- (Text) Mars Rover vs (Verbalized) Coffee image: 0.75

As these results demonstrate, the ranking order is consistently accurate, with each query successfully identifying its corresponding image as the most relevant match.

Summary

I hope that after reading this post you have a clear sense of how the Image Verbalization pattern can fit into your own projects. By using LLMs to translate visual context into searchable text, we can unlock highly descriptive and natural search experiences even without a native multimodal embedding model. Once you understand how to design a literal, search-optimized prompt and integrate it with your existing text embedding workflow, the entire process becomes surprisingly approachable.

Now you’re ready to experiment, iterate, and bring these verbalization capabilities into your applications!

Thanks for reading and see you in the next post!

Categorized in:

Data Intelligence,