Table of Contents

Introduction

You can find all the C# code samples here: Embeddings GitHub Repository

Before we delve into the topic of multimodal embeddings with Azure Vision I encourage you to read these posts first to get the most out of the information provided here:

Usually, when you start learning about embeddings and vector databases, you operate only in a single modality space. This means you either create embeddings of text and compare them with other text, or you do the same with other data modalities like images, video, or audio. In all of these cases, your embedding space can represent only a single modality.

But what if you wanted to:

  • instead of using filters in an image gallery, simply type a query like “show me all the pictures of Mars”
  • extract meaningful information from documents where key content is presented through images or diagrams

After reading this post, you will have enough knowledge to tackle this kind of challenge in your own project. Let’s get started!

The Two Approaches: Multimodal Embedding vs. Image Verbalization

There are two common techniques for addressing the challenge of multi-modal embeddings. In this post, we will focus on multimodal embeddings with Azure Vision, while image verbalization will be covered in a separate one.

ℹ️ In this post, we will use text as the 1st embedding modality and images as the 2nd (leveraging Azure Vision multi-modal capabilities), but these concepts apply equally well to other combinations such as text-video, text-audio, or audio-image.

Multimodal Embeddings

The picture showing multimodal embeddings model capabilities with Azure Vision service.

The first approach is pretty straightforward. It relies entirely on an embedding model capable of vectorizing more than one data modality, text and images in our case, which allows both types of embeddings to live in the same vector space. This is extremely important because it enables us to compare text and image vectors.

Image Verbalization

Picture showing the concept of image verbalization which is then leveraged during multi-modal similarity search (text and image).

Image verbalization technique consists of two steps. In the 1st step, we ask an LLM to describe the image. We could use such a system message:

Describe the image using clear, concise, factual language. Focus only on what is visually present. Avoid opinions, emotions, assumptions, or storytelling. Use simple sentences and keep the description short so it can be used as input for a text-embedding model.

Once we have a text description of the image, the 2nd step becomes a standard similarity search, because we compare vectors that live in the same vector space, in this case, pure text vectors.

Using Azure Vision for Multimodal Embeddings

Before creating multimodal embeddings with Azure Vision, you should first verify whether this feature is available in the region where your app is deployed (you can check it here).

Having this resource in the same region as your app is the ideal solution for achieving the best performance. However, if you cannot create it in the same region, consider placing it in the closest available region to where your app runs.

APIs: vectorizeText and vectorizeImage

The first idea which may come to your mind when building such a solution is finding an official NuGet package that would allow you to interact with the multimodal embeddings with Azure Vision. There is a package Azure.AI.Vision.ImageAnalysis but it does not provide support for multimodal embeddings generation.

This information shouldn’t discourage you, because in the end all Azure related NuGet packages are simply wrappers around the underlying REST APIs of each service.

Let’s create an HttpClient and call the REST API directly. As you can see, there are two paths defined: the 1st one targets the vectorizeText endpoint, and the 2nd one targets the vectorizeImage endpoint.

private readonly HttpClient httpClient = new()
{
    BaseAddress = new Uri("https://deployed-in-azure-vision.cognitiveservices.azure.com/")
};

private const string ENDPOINT_VECTORIZE_TEXT = "computervision/retrieval:vectorizeText?api-version=2024-02-01&model-version=2023-04-15";
private const string ENDPOINT_VECTORIZE_IMAGE = "computervision/retrieval:vectorizeImage?api-version=2024-02-01&model-version=2023-04-15";

As you can see, there are also two query parameters. I’m using the latest api-version (2024-02-01) and the model-version (2023-04-15). This combination supports 102 languages when converting a text query into a vector. The older 2022-04-11 version supports only English, so if you’re creating a new service, remember to use the latest api-version and model-version for full language coverage.

❗❗❗Another thing which is really important are the limits of these 2 API endpoints:

  • vectorizeText – text must be between 1 and 70 words.
  • vectorizeImage – file size must be less than 20MB, image dimension should be in the following pixels range 10×10 – 16 000×16000.

Let’s take a look at the functions responsible for vectorizing text and images. We’ll start with the text vectorization function:

private async Task<float[]> VectorizeTextAsync(string text)
{
    var payload = new
    {
        text = text
    };

    var response = await httpClient.PostAsJsonAsync(ENDPOINT_VECTORIZE_TEXT, payload);
    response.EnsureSuccessStatusCode();

    var result = await response.Content.ReadFromJsonAsync<AzureComputerVisionVectorizeResult>();
    return result?.Vector ?? throw new Exception("Something went wrong");
}

Below is the logic responsible for the image vectorization:

private async Task<float[]> VectorizeImageAsync(string imagePath)
{
    var content = new ByteArrayContent(File.ReadAllBytes(imagePath));
    content.Headers.ContentType = new MediaTypeHeaderValue("application/octet-stream");

    var response = await httpClient.PostAsync(ENDPOINT_VECTORIZE_IMAGE, content);
    response.EnsureSuccessStatusCode();

    var result = await response.Content.ReadFromJsonAsync<AzureComputerVisionVectorizeResult>();
    return result?.Vector ?? throw new Exception("Something went wrong");
}

Once we set up the HttpClient and defined methods for vectorizing text and image, we can discuss two ways to authenticate our requests.

API key vs RBAC

The 1st method is to use an API key. To do this, you need to add Ocp-Apim-Subscription-Key as a default request header. Of course, you should treat the API key as a secret, so ideally you would retrieve it from Key Vault using Key Vault References. But… what’s the point of using an API key when we can leverage RBAC instead (read more about RBAC and managed identities here)?

private readonly HttpClient httpClient = new()
{
    BaseAddress = new Uri("https://deployed-in-azure-vision.cognitiveservices.azure.com/"),
    DefaultRequestHeaders =
    {
        { "Ocp-Apim-Subscription-Key", "apiKey" }
    }
};

If you decide to use RBAC (which I hope you will!), the first question which arises is: what role should I assign?

I am going to use the Cognitive Services Data Contributor role. This role is still in preview, though. Let’s take a look at its JSON definition (you can read more about how to interpret an RBAC role JSON definition here).

{
    "id": "/providers/Microsoft.Authorization/roleDefinitions/19c28022-e58e-450d-a464-0b2a53034789",
    "properties": {
        "roleName": "Cognitive Services Data Contributor (Preview)",
        "description": "Allows to call data plane APIs, but not any control plane APIs for Microsoft Cognitive Services. This role is in preview and subject to change.",
        "assignableScopes": [
            "/"
        ],
        "permissions": [
            {
                "actions": [],
                "notActions": [],
                "dataActions": [
                    "Microsoft.CognitiveServices/*"
                ],
                "notDataActions": []
            }
        ]
    }
}

As you can see, it allows you to invoke various data-plane operations (including multimodal embeddings with Azure Vision) across all Cognitive Services.

Once we have an RBAC role selected and assigned to a security principal (in my case this is my user account in Azure, in the real app it will be likely managed identity), the next thing we need to take care of is obtaining a token for a specific security scope.

var token = await new DefaultAzureCredential().GetTokenAsync(new Azure.Core.TokenRequestContext(["https://cognitiveservices.azure.com/.default"]));
httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", token.Token);

As you can see I am using DefaultAzureCredential (read more about this class here if you haven’t heard about it yet). I specify the security scope https://cognitiveservices.azure.com/.default and request a token. Once the token is retrieved, I simply add a new Authorization HTTP header.

ℹ️ One of the biggest advantages of using Azure related NuGet packages is that when you specify DefaultAzureCredential (or any other TokenCredential), the library automatically retrieves a new token when the previous one expires. In this scenario, however, you would need to implement that logic yourself.

Ok, everything is configured, let’s see how it works!

Demo

The plan is as follows. I prepared four sample photos and four texts corresponding to each photo. We’ll run four samples to see how everything works in various combinations. I am going to use a simple local vector DB DeployedInAzureVectorDb (you can find more info about it here) to focus on the relevant part and not on the various possible integrations (Azure AI Search for example).

Text descriptions (truncated, you can see the full texts in the C# sample):

  • Astrounaut “An astronaut in a white space suit stands on the dusty surface of the Moon during the Apollo 11 mission…
  • Coffee – “A ceramic cup of latte… placed on a glass-top outdoor table…
  • Mars – “A high-resolution image of the planet Mars, showcasing its reddish surface…
  • Mars Rover – “A robotic Mars rover equipped with scientific instruments, cameras…

Sample 1 – Mars text vs each of the images

Mars text vs images:
- Mars text vs Mars image: 0.41
- Mars text vs Mars Rover image: 0.34
- Mars text vs Astronaut image: 0.30
- Mars text vs Coffee image: 0.25

As we can see, the order of the similarity search makes sense, but a question may come to mind: “How can I trust these results if the similarity score is relatively low?” Another valid question is: “I sometimes set a similarity threshold that must be exceeded, how does that apply here?” Both questions are reasonable, so let’s address them.

First of all, you should consider relevancy only within the context of the given query. As you’ll see later, some scenarios may return higher relevancy scores, but you shouldn’t jump to conclusions like, “Hmm… I’m seeing values around 0.40-0.50 here, while in another example I see 0.80-0.90, so something must be wrong”.

❗❗❗ The bottom line is: you cannot compare these scores between various queries so do not treat it as a confidence level.

In this particular example, even though the scores are relatively low, the correct ordering is still preserved. It simply becomes a matter of taking the top N results.

Due to the reasons highlighted above, you cannot define a fixed “similarity threshold” that must be met. It simply wouldn’t work in this context.

Sample 2 – Mars image vs each of the texts

Mars image vs texts:
- Mars image vs Mars text: 0.41
- Mars image vs Mars Rover text: 0.34
- Mars image vs Astronaut text: 0.29
- Mars image vs Coffee text: 0.21

In this example we compare Mars image to each of the texts and the correct ordering is still preserved (even if the scores are pretty low).

Sample 3 – Mars text vs each of the texts

Mars text vs other texts:
- Mars text vs Mars Rover text: 0.80
- Mars text vs Astronaut text: 0.75
- Mars text vs Coffee text: 0.60

Here we compare text-to-text vectors, and once again the correct order is preserved. You can also see that the similarity scores are much higher than in the previous examples. By now, you already know that you shouldn’t draw hasty conclusions based on that alone.

Sample 4 – Mars image vs each of the images

Mars image vs other images:
- Mars image vs Mars Rover image: 0.86
- Mars image vs Astronaut image: 0.79
- Mars image vs Coffee image: 0.66

In this final sample, we compare the Mars image against each of the other images. Once again, the ordering makes perfect sense: the Mars Rover image is the closest match, followed by the astronaut, and finally the coffee image.

Summary

I hope that after reading this post you have a clear sense of how multimodal embeddings with Azure Vision can fit into your own projects. The combination of text and image vectorization unlocks far more natural search experiences, and once you understand the API endpoints, authentication model, and the nuances of similarity scoring, the workflow becomes surprisingly approachable.

Now you’re ready to experiment, iterate, and bring these capabilities into your applications!

Thanks for reading and see you in the next post!

Categorized in:

Data Intelligence,