Integrating YugabyteDB with Azure AI Search

Amol Bhoite

Introduction

Modern businesses generate vast amounts of structured and unstructured data. Efficiently searching, analyzing, and deriving insights from this data is crucial for decision-making, process optimization, and customer experience enhancement. Traditional database systems often struggle to handle the scale, complexity, and variety of modern data workloads.

By combining modern distributed SQL databases and cloud-based search services, businesses can unlock new possibilities for scalable document management, transactional analytics, and search-optimized workflows. YugabyteDB and Azure AI Search together provide a robust solution for managing both structured and unstructured data with operational efficiency.

This blog demonstrates how insurers can enhance claims processing workflows by using YugabyteDB’s distributed database and Azure AI Search.

About YugabyteDB

YugabyteDB is a PostgreSQL-compatible, high-performance, distributed SQL database designed for cloud-native applications. It combines the benefits of SQL’s relational model with the scalability and resilience of a distributed database. This makes it ideal for modern applications that require global data distribution, high availability, and strong consistency. Built on top of a highly scalable architecture, YugabyteDB can handle large amounts of data and high transaction volumes across multiple regions, all while maintaining low-latency access to data.

Azure AI Search is a cloud-based service that provides enterprise-ready information retrieval, enabling advanced search capabilities over diverse content using AI-powered indexing and querying. ​Key features include a full-text search for identifying suspicious activities, a semantic search for understanding context, anomaly detection to spot unusual patterns, and data enrichment for a comprehensive view of transactions.

Use Cases for Search-Optimized Document and Data Integration

The combination of YugabyteDB and Azure AI Search can be used in various industries and applications:

  • Fraud Detection: Identifying suspicious activities through search-optimized workflows and structured fraud analytics.
  • Customer Support Optimization: Enhancing search capabilities in customer support systems for faster issue resolution.
  • Healthcare and Medical Record Search: Improving searchability and analysis of patient records, medical documents, and imaging reports.
  • E-commerce and Recommendation Engines: Enabling smart product recommendations and personalized search experiences.
  • Legal Document and Compliance Analysis: Automating contract review, legal document search, and regulatory compliance tracking.

Operational Fraud Detection: Using Transactional Queries

Fraud detection in various industries has traditionally relied on rule-based systems, which often miss complex or evolving fraud patterns. Operational analytics through structured SQL queries enables businesses to detect potential anomalies such as duplicate claims, high-value payouts, or frequent filings. These transactional queries complement the document search and indexing capabilities of Azure AI Search, providing a comprehensive approach to claims investigation and reporting. 

In industries like insurance, it’s essential that solutions comply with data governance and regulatory requirements (such as Solvency, GDPR, or HIPAA). Fraud detection systems must enforce strong data privacy controls, auditability, and transparency to meet legal and compliance standards.

This architecture shows the integration of YugabyteDB with Azure AI search, leveraging its index, indexers, skillsets, and Azure cloud storage for claims processing and fraud detection.

How to Integrate YugabyteDB with Azure AI Search

Here’s a simplified breakdown:

  1. Customer/Agent Application: A web or mobile app where customers or agents upload claim documents (PDFs, emails, or scanned images).
  2. Azure Blob Storage: This stores the uploaded claim documents securely for processing.
  3. Azure AI Search Indexer and Skillset: The Indexer fetches documents from Blob Storage, and Skillset (OCR and NLP) extracts claim details (e.g., Claim ID, Policy ID, Claim Amount, Description, etc.). Extracted text and metadata are stored in the AI Search Index for fast searches.
  4. Azure AI Search Index: This serves as the structured repository where your searchable content resides. It enables efficient full-text search, filtering, and retrieval operations across your data.
  5. Sync Service (Python Script): Fetches indexed claim data from Azure AI Search and parses and inserts it into YugabyteDB.
  6. YugabyteDB: Stores structured claim details in the claims table (Claim ID, Policy ID, Amount, Date, Status, Description, etc), runs fraud detection queries such as duplicate claims, high-value claims, and suspicious patterns, and ensures distributed and scalable database performance.
  7. Fraud Analysis Dashboard: A dashboard or API that provides insights into flagged claims for insurance companies to review.

1.1 Create an Azure AI Search Service

  1. Log in to the Azure Portal: Go to portal.azure.com.
  2. Search for Azure AI Search: In the search bar, type “Azure AI Search” and select it.
  3. Create a New Service:
    • Click Create.
    • Select your Subscription and Resource Group.
    • Choose a Region and Pricing Tier (Standard is recommended for most use cases).
    • Set the Service Name to fraud-detection-search.
    • Click Review + Create, then Create.

Azure Portal Create a New Service

1.2 Configure an Index

An index in Azure AI Search is like a search-optimized database table. It stores the extracted data in a structured way for fast searches.

  1. Create an Index.
  2. Click Indexes → Create Index.
  3. Set the Index Name to insurance-claims-index-v2.
  4. Click Create.

Azure Portal Create an Index

1.3 Add Skillsets

Skillset is designed to enrich and transform data during the indexing process. The provided JSON snippet defines a skillset named “insurance-claims-skillset,” which includes an OCR (Optical Character Recognition) skill. This particular skill is configured to extract text from PDF documents.

  1. Click Skillsets → Add skillset.
  2. Add Skillset JSON definition with name insurance-claims-skillset, then save the skillset.

Azure Portal Add Skillsets

1.4 Create an Indexer

An indexer is an automated process that extracts data from a source, e.g. Blob Storage, processes it, and adds it to the AI Search Index.

  1. Go to Indexers: In your AI Search service, click Indexers → Create Indexer.
  2. Configure the Indexer:
    • Set the Indexer Name to insurance-claims-indexer-v2
    • Choose index as the Index to insurance-claims-index-v2
    • Choose Blob Storage as the Data Source to insurance-claims-datasource
    • Choose Skillset for AI Enrichment (OCR) to insurance-claims-skillset
    • Click Create.

Azure Portal Create an Indexer

1.5 Upload Sample Documents

  1. Upload Claims to Blob Storage: ​Customers and agents upload claim documents (PDFs, emails, scanned images) via the app to Azure Blob Storage, ensuring centralized and secure storage.​
  2. Run the Indexer: Once documents are uploaded, the indexer can run periodically based on the user-defined frequency to extract text and populate the index.

Azure Portal Upload Sample Documents

Step 2: Setting Up YugabyteDB

  1. Create a YugabyteDB Cluster: Install YugabyteDB on three virtual machines or pods with three replication factors using Virtual Machines or Kubernetes Service and follow this YugabyteDB documentation to set up a distributed SQL database cluster.
  2. Create a Claims Table and index it based on the query pattern
yugabyte=# CREATE TABLE claims (
claim_id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
policy_id UUID NOT NULL,
claimant_name TEXT,
claim_status TEXT CHECK (claim_status IN ('Pending', 'Approved', 'Rejected')),
claim_amount NUMERIC,
claim_description TEXT,
claim_date TIMESTAMP,
document_url TEXT,
last_sync_time TIMESTAMP
);

Step 3: Syncing Data from Azure AI Search to YugabyteDB

The sync script serves as a bridge between Azure AI Search and YugabyteDB, ensuring that claims data extracted, indexed, and searched in Azure AI Search is also efficiently stored and processed in YugabyteDB for structured storage and advanced analysis.

The AzureAISearchtoYugabyteDBsync.py script automates the process by connecting to Azure AI Search, retrieving claim-related data via API calls, parsing the PDF extracted details (Claim ID, Policy ID, Claim Amount, Description, etc.), and inserting the structured claims data into the YugabyteDB claims table for further processing and reporting.

Optimizing Data Sync: The script uses ON CONFLICT (claim_id) DO UPDATE to update existing records in YugabyteDB, preventing duplicates. For better efficiency in the production use case, track the last sync timestamp to process only new or updated records.

This can be done by storing the last sync time in a database and filtering documents using the last_updated field in Azure AI Search, ensuring faster and more reliable synchronization.

Customers can schedule the AzureAISearchtoYugabyteDBsync.py script using the following methods:

  • Using a Cron Job: Customers running the script on Linux servers can use a cron job to schedule execution.
    Add a cron job to run the script every 5 Minutes:
    */5 * * * * /usr/bin/python3 /path/to/AzureAISearchtoYugabyteDBsync.py >> /var/log/sync 2>&1
  • Azure Function App: This process can also be achieved using an Azure Function App, which provides a serverless and automated way to execute the sync script at scheduled intervals. Customers can deploy the script as an Azure Function and schedule it to run automatically using a Timer Trigger.
    Example Azure Function Timer Trigger Schedule (Runs Every 5 Minutes):
    "schedule": "0 */5 * * * *"

Note: ​Ensure the Azure AI Search Indexer’s refresh schedule matches the execution frequency of the synchronization script.

Step 4: Running Fraud Detection Queries in YugabyteDB

Below are sample SQL queries you can use to identify potentially fraudulent activities within your data. YugabyteDB’s compatibility with PostgreSQL allows you to utilize familiar SQL constructs for this purpose.

Detecting Duplicate Claims

This helps detect duplicate or repeated claims by the same claimant, which could indicate fraudulent activity.

yugabyte=# SELECT claimant_name, COUNT(*)
yugabyte-# FROM claims
yugabyte-# WHERE claim_description ILIKE '%car%'
yugabyte-# GROUP BY claimant_name
yugabyte-# HAVING COUNT(*) > 1;
 claimant_name | count 
---------------+-------
 John Doe      |     4
(1 row)

Detecting High-Value Claims Compared to the Average

This query flags claims with unusually high payouts compared to the average for similar claim types, helping insurers detect possible overinflated claims.

yugabyte=# SELECT claim_id, claimant_name, claim_amount
yugabyte-# FROM claims
yugabyte-# WHERE claim_amount > (SELECT AVG(claim_amount) * 2 FROM claims WHERE claim_description = 'Lost luggage');
claim_id | claimant_name | claim_amount
--------------------------------------+----------------+--------------
1d939d78-040d-448c-867d-1a829e578b42 | David Brown | 30000
ad56cd2f-b735-456b-8b0e-7a327e24a9c8 | Samantha Green | 25000
6f0f3c51-9b8f-4ab2-ae18-835f51a9b153 | James Peterson | 18000
(3 rows)

Detecting Specific Accidents with Damage

This query detects claims mentioning both accidents and damage, which could indicate high-severity claims that need further review.

yugabyte=# SELECT claim_id, claimant_name, claim_description
yugabyte-# FROM claims
yugabyte-# WHERE claim_description ILIKE '%accident%'
yugabyte-# AND claim_description ILIKE '%damage%';
claim_id | claimant_name | claim_description
--------------------------------------+----------------+--------------------------------------------------------------------------
ad56cd2f-b735-456b-8b0e-7a327e24a9c8 | Samantha Green | Severe accident resulting in damage to the rear and side of the vehicle.
7f6c3d82-d4d7-4267-a847-7c61c7d2853b | David Wilson | Minor car accident with no significant damage to the vehicle.
(2 rows)

Flagging Frequent Claimants

This query helps evaluate fraud detection queries that have filed more than three claims within the past year.

yugabyte=# SELECT claimant_name, COUNT(*) AS claim_count
yugabyte-# FROM claims
yugabyte-# WHERE claim_date > CURRENT_DATE - INTERVAL '1 year'
yugabyte-# GROUP BY claimant_name
yugabyte-# HAVING COUNT(*) > 3;
claimant_name | claim_count
---------------+-------------
John Doe | 5
(1 row)

Step 5: Fraud Analysis Dashboard

You can develop a fraud analysis dashboard using custom web applications or B tools for data visualization, reporting, and analytics.

By connecting to YugabyteDB, the dashboard can display detected fraud cases, enabling insurance companies to take appropriate actions. This setup allows insurance agents to retrieve real-time fraud detection results from YugabyteDB, facilitating the approval, rejection, or flagging of claims while reviewing supporting documents.​

Sample Dashboard Report

Claim IDClaimant NameClaim TypeAmountRisk ScoreFlagged For
ad56cd2f-b735-456b-8b0e-7a327e24a9c8John DoeCar Accident$45,000🔴 HighMultiple Claims
7f6c3d82-d4d7-4267-a847-7c61c7d2853bAlice SmithLost Luggage$18,000🟠 MediumHigh Amount

AI Search vs. YugabyteDB: Key Differences

YugabyteDB provides a robust platform for managing transactional workloads with strong consistency and scalability, and AI search systems excel in intelligent data retrieval.

FeatureAzure AI SearchYugabyteDB
Search TypeFull-text and Semantic SearchSQL-based Queries
PerformanceFast for unstructured dataOptimized for structured data, Scalable distributed queries
Use CaseDocument Search and ExplorationIdentify data anomalies
ScalabilityHorizontal ScalingDistributed SQL Scaling
Text-based search (OCR, NLP)Yes, Fast AI-powered searchNo, Limited to SQL LIKE queries
Structured SQL-Based SearchNoYes, Complex queries with joins
Handles Large-Scale Relational DataNoYes
Near Real-Time Fraud DetectionNLP-based similarity detectionRule-based fraud detection

Conclusion

​Integrating YugabyteDB with Azure AI Search combines the strengths of a distributed SQL database with advanced search capabilities, offering a robust solution for data-intensive applications. This integration facilitates efficient data processing and real-time analysis, enhancing operational performance.

However, challenges such as ensuring data accuracy, maintaining regulatory compliance, and managing infrastructure costs must be carefully addressed through strategic planning and optimization. By proactively managing these factors, organizations can fully leverage the benefits of this integrated approach to scalable data management, search, and operational analytics.

Integrating YugabyteDB with Azure AI Search delivers a scalable and highly available database foundation for managing structured operational data.

Amol Bhoite

Related Posts

Explore Distributed SQL and YugabyteDB in Depth

Discover the future of data management.
Learn at Yugabyte University
Get Started
Browse Yugabyte Docs
Explore docs
PostgreSQL For Cloud Native World
Read for Free