Table of Contents

In the previous post, we began exploring the Azure AI Search service and covered its fundamentals. Now, it’s the right time to dive deeper into two essential building blocks of the service: Data Sources and Indexers. Gaining a solid understanding of these features is especially important at the early stages of working with Azure AI Search. In this blog post, I am going to deliberately omit some more complicated features to focus on the core features first. We will cover more sophisticated features in the next posts.

Introduction to Azure AI Search Indexing

Diagram showing data flow in Azure AI Search when using indexers and so called pull apprach

Azure AI Search enables building powerful search experiences by indexing structured and unstructured data. Indexing is the process of making data searchable, and automation ensures that updates are captured without manual intervention. You can think about an indexer as a crawler responsible for extracting textual data from various data sources.

It is worth remembering that an indexer can process only one data source at a time and is limited to writing to a single index. While each indexer handles a single source, the search index itself can aggregate content from multiple data sources, with each job contributing full documents or populating specific fields.

What Is the Pull Approach?

The pull approach uses indexers to automatically connect to supported data sources and ingest content. Instead of applications pushing data using the Push REST API, Azure AI Search “pulls” it in on a schedule, reducing operational overhead.

How Indexers Work

Indexers run on a schedule (daily, hourly, or custom intervals) to detect changes. It is also possible to invoke an indexer on-demand. They support incremental indexing, meaning only new or updated records are processed, which improves efficiency.

We can split indexer execution into 5 stages:

  • Document Cracking – raw files are parsed to extract text and metadata, making content machine-readable.
  • Field Mappings – extracted data is aligned with the defined index schema.
  • Skillset execution – it is an optional step which can trigger various AI processing services like Azure Vision (OCR), Azure AI Language (NLP), Azure AI Document Intelligence and other. It can also invoke a custom skill where you can implement any tranformation you want (#Microsoft.Skills.Custom.WebApiSkill)
  • Output field mappings – used when skillsets are applied, it is mapping enriched outputs into the final index fields.
  • Push into index – the processed documents are commited into the search index making them queryable.

ℹ️ Because indexers don’t run in the background, higher levels of query throttling may occur when the service is under stress. This is especially noticeable when an indexer processes the entire data set, which is why understanding the concept of incremental indexing is important.

Incremental indexing

Let’s imagine that there are 10 million records in the Azure Cosmos DB container which you pointed out in the data source definition. You can easily imagine that if an indexer had to index that data from scratch every time it runs, it could greatly influence the performance of the service itself and indirectly also your application. Not to mention that usually, over a short period of time, only a subset of records changes (although for very large data sets that subset can still be a big number). Incremental indexing addresses this challenge by indexing just the records that have changed since the last execution. It’s a pretty common concept/technique used in many places in computer systems.

The mechanism relies on a field (pay attention to a HighWatermark pointer in the indexer definition in the next section) that tells the indexer when the last data modification occurred for a given record. Using this information together with the time of the last invocation, it can easily determine the differences since the previous execution. It’s as simple as that, no magic behind the scenes!

ℹ️ For data sources such as Azure SQL Database or Azure Cosmos DB, change detection must be enabled, whereas for Azure Storage Blob data it is automatic.

Demo

First, we generate a sample dataset representing stars, galaxies, nebulae, planets, and other celestial objects, and import it into Azure Cosmos DB. Below is an example of a sample document representing a star – 🌟 Sirius 🌟. Note that fields prefixed with an underscore (_) are added automatically, including the _ts field, which is essential for incremental indexing.

{
    "id": "1",
    "name": "Sirius",
    "type": "Star",
    "distance_light_years": 8.6,
    "_rid": "9oRCANrQQLABAAAAAAAAAA==",
    "_self": "dbs/9oRCAA==/colls/9oRCANrQQLA=/docs/9oRCANrQQLABAAAAAAAAAA==/",
    "_etag": "\"000059c4-0000-5600-0000-6937c5080000\"",
    "_attachments": "attachments/",
    "_ts": 1765262600
}

Once we have sample data in Azure Cosmos DB, we can start creating a new data source and a new indexer. To do so, I recommend starting with the visual flow, which simplifies the process. When we open Cosmos DB, there will be an ‘Add Azure AI Search’ option under the ‘Integrations’ tab. Click on it and follow the steps.

Azure Cosmos DB service in Azure Portal showing 'Add Azure AI Search' option for automatic integration

Once you complete the operation you can open the selected Azure AI Search instance and you should see a newly added data source and indexer.

Azure AI Search service in Azure Portal showing data sources section with a sample data source
Azure AI Search service in Azure Portal showing indexers section with a sample indexer

All the settings related to a data source object or an indexer are stored in JSON files. You can view these files by opening either a data source or an indexer and then clicking ‘Edit JSON’. Let’s analyze these files to make everything clear, focusing on the most relevant elements. We’ll begin with a data source definition.

{
  "@odata.context": "https://deployed-in-azure-aisearch.search.windows.net/$metadata#datasources/$entity",
  "@odata.etag": "\"0x8DE36EF7A8A52EA\"",
  "name": "space-entities",
  "description": "This data source contains a curated collection of space-themed records representing stars, galaxies, nebulae, planets, and other celestial objects.",
  "type": "cosmosdb",
  "subtype": null,
  "indexerPermissionOptions": [],
  "credentials": {
    "connectionString": "AccountEndpoint=https://deployed-in-azure-cosmosdb.documents.azure.com;AccountKey=...;Database=MyDatabase"
  },
  "container": {
    "name": "MyContainer",
    "query": "SELECT * FROM c WHERE c._ts > @HighWaterMark ORDER BY c._ts"
  },
  "dataChangeDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
    "highWaterMarkColumnName": "_ts"
  },
  "dataDeletionDetectionPolicy": null,
  "encryptionKey": null,
  "identity": null
}
Element NameDescription
nameName of the data source, which can be treated as an identifier.
typeThere are various types of data sources, but in this case we selected Azure Cosmos DB, so the value is cosmosdb.
credentialsAll the settings required to establish a connection to the data source. We used an Account Key, but in a real production app you should always use Managed Identity.
container/nameName of the container in the selected Cosmos DB instance
container/queryQuery invoked when data is being pulled. You can easily apply simple data transformations here. It does not have to be a one‑to‑one mapping of the Cosmos DB document structure. Note the existence of the @HighWaterMark query parameter.
dataChangeDetectionPolicy/highWaterMarkColumnNameA high watermark is a checkpoint that records the last successfully indexed item, ensuring subsequent runs only pick up new or changed data. It must be set to the system _ts property when Azure Cosmos DB is selected as a data source.
dataDeletionDetectionPolicyImagine you use the Soft Deletion concept in your app and there is an IsDeleted property in every document. You can specify that property here to mark specific documents in the index as deleted.
"dataDeletionDetectionPolicy": { "@odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy", "softDeleteColumnName": "isDeleted", "softDeleteMarkerValue": "true" }

Let’s analyze the indexer JSON definition now.

{
  "@odata.context": "https://deployed-in-azure-aisearch.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DE3761EEBA1FDE\"",
  "name": "space-entities-indexer",
  "description": "",
  "dataSourceName": "space-entities",
  "skillsetName": null,
  "targetIndexName": "space-entities-index",
  "disabled": null,
  "schedule": {
    "interval": "PT10M",
    "startTime": "2025-12-09T20:31:28.564Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "configuration": {
      "assumeOrderByHighWaterMarkColumn": true
    }
  },
  "fieldMappings": [],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}
Element NameDescription
nameThe name of the indexer.
dataSourceNameThe name of the data source from which data is pulled when the indexer is triggered. It is equal to a name element in the data source config.
targetIndexNameThe name of the index where the data is saved.
scheduleDefines how often the indexer runs. The current setting is every 10 minutes.

At the end, let’s check if we can find in the index the document describing the 🌟 Sirius 🌟 star that was shown at the beginning of the post.

Azure AI Search service in Azure Portal showing an index with a sample result.

It is obviously there, which confirms that we have configured everything correctly. As you can also see, the name of the index space-entities-index matches what is visible in our indexer definition in the targetIndexName element.

Summary

Data sources and indexers are essential components of the pull approach to data indexing in Azure AI Search. We discussed the role of each object, the indexing stages, and the concept of incremental indexing. Then we created a new data source and indexer and analyzed their JSON definitions. With this foundational knowledge, we are now ready to explore the opposite method, the push approach. Understanding both will enable us to make an informed choice about which approach is best suited for a given situation. See you in the next post!

Categorized in:

AI Services,