AI-Powered Utility Bill Processing

A Utility Bill Management Organization

Client

A utility bill management organization that processes high volumes of bills across diverse providers, relying on accurate bill data to drive downstream financial workflows for its users.

Challenge

Utility bill PDFs arrived from dozens of providers, each with different layouts, fonts, and field structures. The organization had no automated system to extract, validate, or reconcile bill data, leaving staff to process every document by hand. Three core problems demanded a solution:

Manual Data Extraction: No automated system existed to extract key bill fields — provider name, account number, due date, and amount due — from uploaded PDFs. Staff processed each document by hand, creating backlogs and data quality issues as volumes grew.
No Data Reconciliation: For accounts where bill data already existed from API integrations, there was no mechanism to compare newly extracted PDF data against existing records. Conflicting values went undetected and missing fields were never auto-filled.
No Error Handling or Escalation Path: When extraction quality was uncertain — due to corrupted files, low-resolution scans, or ambiguous content — there was no structured escalation path. Failed documents were simply lost with no archival, logging, or retry logic in place.

Key Results

Extraction Accuracy & Speed

Achieved 95%+ data extraction accuracy through a tiered confidence scoring system (1.0 for critical fields, 0.8 for secondary fields).
Reduced bill document processing time to under 60 seconds per PDF, replacing the fully manual data entry workflow and eliminating processing backlogs.

Reliability & Cost

Eliminated data loss through six categorised error types with automatic S3 archival, error metadata logging, and exponential-backoff retry logic.
Deployed a cost-effective serverless architecture processing documents at approximately $0.01 per PDF, with auto-scaling that absorbed volume spikes without infrastructure changes.
Delivered a fully integrated human-in-the-loop HubSpot review workflow — corrections written back to the database automatically upon ticket closure via webhook, requiring no additional manual steps.

Solution

The team designed and deployed a fully serverless, event-driven utility bill processing pipeline on AWS. The system automated the complete lifecycle — from PDF ingestion through AI extraction, intelligent data reconciliation, and human-in-the-loop review — with no manual intervention required for standard-quality documents.

Key Components

Ingestion Layer: Dual-path S3 event-driven ingestion — manual uploads write directly to the database; batch uploads route through reconciliation before any writes
AI Extraction Layer: AWS Bedrock with Claude Sonnet 4 returning per-field confidence scores (0.0–1.0); Amazon Nova Pro as automatic fallback model
Reconciliation Layer: AWS Lambda with type-aware field comparison — decimal precision for monetary amounts, date normalisation, case-insensitive string matching
Review Layer: HubSpot-integrated ticketing with webhook handler that writes reviewer corrections back to Aurora PostgreSQL on ticket closure
Observability Layer: Amazon QuickSight dashboard connected via private VPC to Aurora PostgreSQL — surfacing active bills, overdue amounts, biller performance, and reviewer workload in real time

Technologies Used

AWS Bedrock (Claude Sonnet 4): Primary AI model for PDF field extraction with per-field confidence scoring
Amazon Nova Pro: Automatic fallback model ensuring continuous availability
AWS Lambda (Python 3.12): Serverless compute for extraction, reconciliation, review, and webhook handling
Amazon S3 & Amazon API Gateway: Event-driven document ingestion and archival of failed files with error metadata
Aurora PostgreSQL: Production database for extracted and reconciled bill records
HubSpot CRM: Human-in-the-loop review ticketing with webhook-driven database write-back on closure
Amazon QuickSight & Amazon CloudWatch: Real-time analytics dashboard and Lambda execution monitoring
Terraform & Python: Infrastructure as code and Lambda runtime

Summary

The team designed and deployed a serverless, event-driven utility bill processing pipeline on AWS to solve three critical gaps: eliminating error-prone manual PDF data entry, reconciling extracted data against existing API records, and providing a structured escalation path for exceptions. The solution leverages Claude Sonnet 4 via Amazon Bedrock with tiered confidence scoring, AWS Lambda for reconciliation and review logic, HubSpot for human-in-the-loop corrections, and Amazon QuickSight for real-time operational visibility — achieving 95%+ extraction accuracy, processing each document in under 60 seconds at approximately $0.01 per PDF, and ensuring zero data loss across all document types.

#arocom #artificialintelligence #machinelearning #datascience