AI Driven Course Curriculum Map Generation and Searching System

A fully accredited, nonprofit medical school training physicians

Client

A fully accredited, nonprofit medical school training physicians for practice in the United States and Canada. The institution positions itself as an exceptional alternative for qualified students who face limited medical school seats in North America. The client intended to utilize Generative AI technologies on Amazon Web Services (AWS) to implement an AI-driven system for ingesting IMSCC (IMS Common Cartridge) files and automatically generate a course curriculum map aligned to the AAMC framework.

Challenge

The client faced a time-intensive and inconsistent manual process for mapping course content to the AAMC (Association of American Medical Colleges) competency framework. With 22 courses and IMSCC files containing heterogeneous content — HTML pages, PDFs, PowerPoint slides, QTI assessments, and images — manual mapping was both error-prone and unscalable. Some IMSCC files exceeded 1.3 GB in size with 200+ resources, requiring processing times far beyond what Lambda's 15-minute execution limit could accommodate. The platform also lacked a searchable, structured store for generated curriculum maps and had no mechanism for aligning content across courses to surface prerequisite dependencies.

Key Results

  • Delivered a fully automated, end-to-end curriculum mapping pipeline processing 22+ courses from raw IMSCC files to structured, AAMC-aligned curriculum maps
  • Achieved intelligent AAMC competency alignment across all four domains: Interpersonal, Intrapersonal, Thinking and Reasoning, and Science Competencies
  • Solved large-file processing challenge — files up to 1.3 GB processed within Lambda's 15-minute timeout using dynamic batch sizing and Step Functions orchestration
  • Implemented prerequisite-aware module sequencing via two-pass dependency analysis, ensuring correct course ordering in final curriculum maps
  • Deployed hybrid semantic and keyword search API over all processed curriculum content, accessible via API Gateway
  • Built automatic model fallback from Claude Sonnet 4.5 to Claude Sonnet 4 when daily Bedrock token quotas are reached, ensuring pipeline continuity
  • Delivered full infrastructure-as-code using Terraform across 50+ AWS resources for repeatable, multi-environment deployment

Solution

Event-Driven Batch Processing Pipeline

  • We designed and implemented a fully serverless, event-driven batch processing pipeline on AWS.
  • The system is not a chatbot or RAG architecture — it processes entire cohorts of courses once and generates comprehensive, structured curriculum maps that persist in Aurora PostgreSQL and S3 for downstream search and reporting.
  • IMSCC files uploaded to S3 automatically trigger the pipeline after a configurable 10-minute debounce period, ensuring all files in a cohort are fully uploaded before processing begins.

Content Extraction and Hierarchical Chunking

  • Unpacked and parsed IMSCC files, extracting imsmanifest.xml to build full resource graphs and course hierarchies
  • Implemented file-type-specific extraction: HTML via BeautifulSoup, PDF via PyMuPDF, PPTX via python-pptx, DOCX via python-docx, QTI/XML for assessments, and Claude vision OCR for image-only pages
  • Applied hierarchical chunking at 800 tokens with 100-token overlap using tiktoken, structured across three levels: Course → Module/Unit → Resource/Item
  • Expanded large PDFs (over 25 pages) into 25-page parts to fit within batch time budgets; compressed images to max 1024px JPEG quality 70, reducing Bedrock token usage by 60-70%
  • Generated Titan Embed Text v2 embeddings (1536 dimensions) for every chunk and stored them in Aurora PostgreSQL with pgvector

AAMC Competency Mapping and Course Summarization

  • During chunking, Claude extracted learning objectives from each resource and mapped them to the four AAMC competency domains
  • GenerateCourseSummary Lambda computed AAMC domain coverage percentages per course, with direct strategy (up to 120k tokens) and map_reduce strategy (batched summarization) for larger courses
  • Structured JSON course summaries captured: title, description, key topics, prerequisite topics, AAMC domain coverage, and detailed module breakdowns

Two-Pass Curriculum Map Generation

  • Pass 1 — Prerequisite Dependency Map: Claude analyzed each course's prerequisite topics against other courses' key topics to build an explicit dependency graph
  • Pass 2 — Blueprint Generation: The dependency map was injected as ordering constraints; Claude generated a module sequence respecting all prerequisite relationships
  • Pass 3 — Core and Supplementary Maps: Two full curriculum maps generated from the approved blueprint — one for required AAMC competencies, one for supplementary learning
  • Outputs delivered as Excel workbooks (openpyxl) and JSON stored in S3, with structured records written to Aurora for search

Hybrid Semantic Search API

  • Implemented hybrid search combining Titan v2 vector similarity (cosine) against chunk embeddings and PostgreSQL ILIKE keyword matching across chunk text and module titles
  • Chunks appearing in both vector and keyword results ranked highest; Claude generates a natural language narrative summarizing top results with source attribution
  • Exposed via API Gateway HTTP v2 GET /search endpoint with cohort-level filtering by input folder

Step Functions Orchestration for Scale

  • AWS Step Functions orchestrated the complete pipeline with parallel Map states — up to 5 courses processed concurrently, with nested Map states for up to 3 batches per course
  • Each Lambda invocation used ThreadPoolExecutor (3-5 workers) for concurrent Bedrock API calls, maximizing throughput within Lambda's timeout ceiling
  • Built-in retry logic, checkpoint recovery, and idempotent reprocessing (force_reprocess flag) ensured resilience across long-running cohort runs
Technologies Used
  • Amazon Bedrock — Claude Sonnet 4.5 (primary), Claude Sonnet 4 (fallback)
  • Amazon Aurora PostgreSQL Serverless v2 with pgvector
  • AWS Step Functions
  • AWS Lambda (Python 3.12)
  • Amazon S3
  • Amazon DynamoDB
  • Amazon EventBridge and EventBridge Scheduler
  • API Gateway (HTTP v2)
  • Amazon Titan Embed Text v2
  • AWS Secrets Manager
Summary

A fully accredited nonprofit medical school enhanced its curriculum operations by implementing an AI-driven course curriculum map generation system on AWS. The solution automated the extraction and parsing of IMSCC course files, performed intelligent AAMC competency mapping using Amazon Bedrock Claude Sonnet, and generated prerequisite-aware curriculum maps for an entire cohort of 22+ courses. By combining AWS Step Functions orchestration, hierarchical chunking, and a hybrid semantic search API, the system addressed large-file processing challenges and delivered structured, searchable curriculum maps aligned to the four AAMC competency domains — significantly reducing manual effort and improving consistency across the institution's medical education program.

#arocom #artificialintelligence #machinelearning #datascience

Have Any Questions?