Published Video in Under 3 Minutes - How I Built an AI Film Crew on AWS

ai ai agents ai-solutions automation Mar 30, 2026

From 493 Unedited Clips to a Published Video in Under 3 Minutes

I came home from re:Invent with 493 unedited clips and told myself I'd make a yearly recap video.

I opened the folder. I looked at the files. And then I did what every engineer does when they should be editing.

I started overengineering.

Three months later: four new enterprise customers, two new partnerships, an open-source community project, and an AI pipeline that processes a raw conference video in under three minutes for €0.34. This is that story.


The Problem Nobody Talks About

Video production has two phases everyone knows about: filming and publishing. The phase that kills most projects is the one in the middle - post-production.

For a one-hour interview, industry standard is 40-60 hours of editing. At a conference like re:Invent, where a single sponsor might film 50 interviews in three days, that number becomes catastrophic. Most of it never gets published. It sits on a hard drive, or worse, on a NAS system somewhere, untouched.

According to the Wyzowl State of Video Marketing Report, 1 in 5 corporate video projects is published on time. The Cisco Annual Internet Report puts video at 82% of all internet traffic by 2025. These two facts together describe an industry-wide bottleneck.

The content exists. The infrastructure to process it at scale doesn't - or didn't.

For NetApp customers specifically, this is a storage problem as much as a workflow problem. Petabytes of unstructured video data live on enterprise NAS systems. Valuable content, invisible because there's no way to search, tag, or publish it efficiently.


Architecture Overview

The system has three conceptual layers: an intake layer that handles validation, an analysis layer that extracts everything machine-readable, and a creative layer that makes editorial decisions - which clips to use, in what order, with what structure.

 

The Fully Cloud Version

Infrastructure as Code

Everything runs on AWS and is managed with Terraform. All 27 Lambda functions, 6 Step Functions state machines, IAM roles, and S3 bucket policies are defined in code. The Makefile exposes a single deploy command with a confirmation prompt for safety.

make deploy       # Plan + confirm
make autodeploy   # Deploy without prompts (CI)

A key convention: Lambda functions are never packaged manually. A Terraform module wraps terraform-aws-modules/lambda/aws and automatically includes src/common/ as a shared Lambda layer. Every function gets the same domain models, S3 utilities, and configuration without repeating code.

# infra/modules/lambda_function/main.tf
module "lambda" {
  source  = "terraform-aws-modules/lambda/aws"
  version = "~> 7.0"

  function_name = "${var.name_prefix}-fn-${var.function_name}"
  handler       = "handler.lambda_handler"
  runtime       = "python3.11"

  # Common layer auto-attached - contains models.py, s3_utils.py, config.py
  layers = [aws_lambda_layer_version.common.arn]

  environment_variables = var.lambda_environment
}

Handler naming is absolute. Every function uses lambda_handler as the entry point. The Terraform config always references handler = "handler.lambda_handler". If you rename it, deployment fails loudly rather than silently.

Step Functions: The Orchestration Brain

The content generation workflow is a sequence of tasks with parallel branches and choice states.

The idempotency check is a practical detail that matters at scale. Before processing, the workflow counts objects in the output prefix. If five or more already exist, it skips - you can safely re-trigger workflows without paying for reprocessing.

{
  "Type": "Task",
  "Resource": "arn:aws:states:::aws-sdk:s3:listObjectsV2",
  "Parameters": {
    "Bucket.$": "$.output_bucket",
    "Prefix.$": "States.Format('{}/processed/', $.video_id)"
  },
  "ResultSelector": {
    "count.$": "$.KeyCount"
  },
  "Next": "AlreadyProcessedDecision"
}

MediaConvert: Three Outputs, One Job

Every video goes through MediaConvert first. One job creates three outputs simultaneously:

Downsized video 960x540, H.264, 1.5 Mbps - base for compilation and BDA input
Audio extraction AAC, 96 kbps - feeds Amazon Transcribe
Frame capture JPEG every 2 seconds, quality 80, max 100,000 frames - feeds Rekognition

The frame interval is configurable per video type. A fast-moving sports reel needs 0.5-second intervals. A static interview works fine at 3 seconds. The workflow sets this via a Choice state before dispatching to MediaConvert.

Note on 4K source footage: MediaConvert handles the downscale cleanly. But if you use Remotion for preview rendering, always reference the downsized proxy - not the 4K original. A 4K source at 47 Mbps causes delayRender() timeouts because frame seeking takes longer than Remotion's 28-second threshold. The 1080p proxy at ~14 Mbps renders without issues.

The Parallel Analysis Phase

After MediaConvert, three branches run concurrently - Rekognition, Transcribe, and Bedrock Data Automation.

Branch 1: Rekognition - Visual Intelligence Per Frame

Each extracted frame is passed to Rekognition's detect_labels API. The pipeline collects labels, confidence scores, and bounding box data per frame, then aggregates into a timeline of visual content.

def analyze_frame(rekognition_client, bucket: str, key: str) -> list[dict]:
    response = rekognition_client.detect_labels(
        Image={"S3Object": {"Bucket": bucket, "Name": key}},
        MaxLabels=20,
        MinConfidence=75.0,
    )
    return [
        {
            "label": label["Name"],
            "confidence": label["Confidence"],
            "instances": label.get("Instances", []),
        }
        for label in response["Labels"]
    ]

The aggregated output tells the film crew: at timestamp 0:32, there's a conference stage with 3 people visible. At 1:14, there's a whiteboard with a diagram. This context shapes which clips the crew selects for different narrative scenes.

The freeze frame below shows the exact moment Rekognition detects faces in the video - the pipeline pauses, draws the detection boxes, and surfaces the confidence scores live.

Branch 2: Transcribe - Word-Level Timestamps

The extracted audio goes to Amazon Transcribe with a custom vocabulary file. Technical terms - "FSxN", "MediaConvert", "Rekognition", "CrewAI" - are pre-registered to prevent transcription errors that would corrupt downstream prompts.

A transcription enhancer Lambda runs after the Transcribe job completes. It uses Bedrock Claude to clean up filler words, standardise speaker labels, and fix any remaining technical terms custom vocabulary missed.

Branch 3: Bedrock Data Automation (BDA) - Video-Level Understanding

BDA runs directly on the downsized video file. It extracts a structured representation of the video content: scenes, speakers, topics, sentiment, and key moments. This is distinct from Rekognition (per-frame) and Transcribe (audio-only). BDA understands the video as a whole.

BDA is asynchronous - you start a job and poll for completion. The Step Functions workflow handles this with a polling loop:

{
  "CheckBDAStatus": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke",
    "Next": "BDAComplete?",
    "Retry": [{"ErrorEquals": ["States.ALL"], "MaxAttempts": 3}]
  },
  "BDAComplete?": {
    "Type": "Choice",
    "Choices": [
      {"Variable": "$.bda_status", "StringEquals": "SUCCESS", "Next": "BDADone"},
      {"Variable": "$.bda_status", "StringEquals": "ERROR",   "Next": "BDAFailed"}
    ],
    "Default": "WaitForBDA"
  },
  "WaitForBDA": {
    "Type": "Wait",
    "Seconds": 10,
    "Next": "CheckBDAStatus"
  }
}

Content Generation

Once all three analysis branches complete, a Lambda calls Bedrock Claude with a structured prompt. It includes the full transcription, the Rekognition visual timeline, the BDA story summary, and a template type parameter controlling output format.

prompt = f"""
You are a professional video editor reviewing footage.

TRANSCRIPTION:
{transcription_text}

VISUAL TIMELINE (Rekognition labels per timestamp):
{json.dumps(visual_timeline, indent=2)}

STORY ANALYSIS (BDA):
{bda_summary}

Generate a JSON response with:
- title: A compelling video title (max 80 chars)
- summary: 2-sentence description of the content
- key_moments: List of {timestamp, description, importance_score}
- suggested_clips: List of {start_sec, end_sec, reason}
- template_type: One of [news_reporter, event_aftermovie, business_content, yearly_wrapped]
"""

The model returns structured JSON. The pipeline validates it against domain models before passing it downstream. No fallback data is ever generated - if the model returns invalid JSON, the workflow fails with a clear error rather than propagating synthetic content.

The Film Crew: 8 AI Agents Making Editorial Decisions

The most interesting part of the pipeline is what happens after content generation. A CrewAI crew of eight agents - deployed to Amazon AgentCore - reviews the analysis results and makes the actual editorial decisions.

Agent Responsibility
Director Narrative structure - which story to tell, what the arc should be
Editor Clip selection and ordering based on the Director's brief
Cinematographer Visual quality assessment - framing, composition, lighting
Sound Designer Audio quality, ambient noise, music sync points
Fact Checker Validates transcription accuracy, flags potential misinformation
Story Analyst Identifies narrative arcs, emotional beats, character moments
Pacing Analyst Edit rhythm - when to cut fast, when to let a moment breathe
Quality Checker Final gate before compilation - reviews all other agents' decisions

The crew is triggered by film_crew_trigger, a Lambda that builds the payload and invokes AgentCore.

Critical: AgentCore Payload Contract

Learned across three separate incidents totalling ~12 hours of debugging. If you pass only extracted fields instead of the full analysis_result dict, AgentCore falls back to loading data from S3. That S3 path doesn't exist in all execution contexts. The result is a silent 500 error that's extremely difficult to trace - AgentCore logs are in a different region from the Step Functions execution.

# film_crew_trigger/handler.py

def build_crew_payload(analysis_results: dict, clips: list, ...) -> dict:
    return {
        "bucket": output_bucket,
        "analysis_id": analysis_id,
        "template_type": template_type,
        "clips": clips,
        "video_contexts": video_contexts,
        "adaptive_mode": adaptive["mode"],
        # CRITICAL: pass the full dict, never extracted fields
        "analysis_result": analysis_results,
    }

The crew outputs a VariantSpec - a structured description of the final video: which clips to include, in what order, with what transitions, and for which output format. The compilation Lambda reads this spec and calls MediaConvert to produce the final output in three aspect ratios: 16:9 (YouTube), 9:16 (Shorts/Reels), and 1:1 (LinkedIn).

 

The FSxN Version: Enterprise Archives at Scale

The fully-cloud version assumes video files start in S3. The enterprise version - built for NetApp customers - starts from a different premise: the footage already exists, at petabyte scale, on a NetApp ONTAP NAS system.

FSx for NetApp ONTAP provides an S3-compatible access layer over enterprise NAS storage. For this pipeline, that means:

No data migration Files stay on-premises where compliance requires them
No egress costs The pipeline reads from FSxN, not from a cloud copy of the archive
Full NAS metadata access Creation timestamps, folder structure, project tags - all enriching pipeline context
Bidirectional Processed outputs can write back to ONTAP for archiving alongside the originals

The pipeline treats the FSxN S3 access point identically to a standard S3 bucket. The Step Functions workflow accepts an input_bucket parameter - pointing it at the FSxN access point requires no code changes.

# The Lambda reads from whichever bucket the workflow specifies
input_bucket = event.get("input_bucket") or event.get("bucket")
video_key    = event["key"]

s3_client.download_file(input_bucket, video_key, local_path)

"Find all interviews where our journalists mentioned climate policy between 2020 and 2025, and produce a highlights reel."

A broadcaster with 20 years of archived footage can now run queries like this. The pipeline indexes the archive via Rekognition + Transcribe + BDA, stores the results, and makes the content searchable. The editing happens automatically.

Economics at Enterprise Scale

A broadcaster processing 10,000 videos per month pays approximately:

Service Cost per video Monthly (10,000 videos)
MediaConvert ~€0.09 ~€900
Amazon Transcribe ~€0.11 ~€1,100
Amazon Rekognition ~€0.07 ~€700
Bedrock (Claude) ~€0.05 ~€500
Step Functions ~€0.01 ~€100
AgentCore ~€0.01 ~€100
Total ~€0.34 ~€3,400

For context: a single human editor costs more per month than the entire pipeline costs per 10,000 videos. The pipeline doesn't replace editors - it handles the 80% of footage that would never be touched otherwise.

Anti-Hallucination: Why 6 Validation Layers

AI pipelines in production video fail in a specific way: they produce confident-sounding but incorrect output. A transcription error in a political interview becomes a fabricated quote. A Rekognition misidentification causes a story that wasn't there.

1 Input validation - FileSanitizer checks codec, container, duration, and orientation before any analysis runs
2 Clip validation - does the timestamp exist? Is the duration non-zero? Does the content match the label?
3 Timestamp validation - clips are cross-referenced against source video duration; no clip can end after the video ends
4 Fact checking - the FactChecker agent compares transcription content against BDA analysis; discrepancies trigger a review flag
5 Quality checking - the QualityChecker agent runs final checks on clip selections, transition points, and narrative coherence
6 Output validation - the final VariantSpec is validated against the domain model schema before compilation starts
def validate_clip(clip: dict, video_duration_seconds: float) -> None:
    start = clip.get("start_sec", 0)
    end   = clip.get("end_sec", 0)

    if end > video_duration_seconds:
        raise ValueError(
            f"Clip end {end}s exceeds video duration {video_duration_seconds}s"
        )
    if start >= end:
        raise ValueError(
            f"Clip start {start}s must be before end {end}s"
        )
    if (end - start) < 1.0:
        raise ValueError(
            f"Clip duration {end - start}s is too short to be useful"
        )

Observability: When Things Go Wrong at 2am

A production pipeline that processes customer footage needs clear observability. Three tools matter.

Step Functions execution console - The single best debugging tool. Every state transition is logged with input and output. When a Lambda fails, the exact error is visible inline. When the BDA polling loop runs 40 times before timing out, you can see each iteration.

CloudWatch Lambda Insights - Duration, memory usage, and cold start frequency per function. The FrameObjectDetector Lambda benefits most from provisioned concurrency because Rekognition calls are time-sensitive and cold starts add 800ms-1.2s.

Custom log tail for AgentCore - AgentCore runs in us-east-1 while the main pipeline runs in eu-central-1. Cross-region log access is easy to forget when debugging. The tail script handles region switching automatically.

bash scripts/tail-agentcore-logs.sh

bash scripts/monitor-workflow.sh <execution-arn>

What I'd Do Differently

Keyframe extraction instead of frame-per-second sampling

The current approach extracts one frame every 2 seconds. A 3-minute video produces ~90 frames. Most of them are near-identical. A keyframe extractor using scene change detection would produce 15-20 meaningful frames at far lower Rekognition cost.

# FFmpeg scene change detection - outputs only frames where the scene changes
ffmpeg -i input.mp4 \
  -vf "select=gt(scene\,0.3)" \
  -vsync vfr \
  frames/keyframe_%04d.jpg

Vector search for archive querying

The current pipeline processes video on demand. An enterprise archive version needs search before processing - find the relevant footage first, then run the deep analysis. Storing Rekognition and Transcribe results as embeddings in OpenSearch would make "find all clips of our CEO from 2024" a sub-second query rather than a batch job.

Streaming compilation status

The current architecture produces the final video asynchronously. For a human editor reviewing results, a real-time status stream via WebSocket or Server-Sent Events would dramatically improve the workflow. The editor could reject the Director's narrative choice before compilation starts rather than after waiting 90 seconds.

Getting Started

Prerequisites

AWS credentials configured (pipeline uses eu-central-1 by default) · Terraform 1.5+ · Python 3.11 · ffmpeg and ffprobe on your PATH

Quick smoke test - exercises the core pipeline without real footage:

make test       # ~2-3 minutes, costs ~€0.05

make test-full  # ~10-15 minutes, costs ~€2-3

The test scripts use real AWS services - there's no mock infrastructure. This is intentional. Mock tests passed when the production migration failed. Real tests are the only tests that matter for a pipeline this dependent on service integration.


Closing

The first version of this pipeline existed because I had 493 unedited clips and no patience for manual editing. The production version exists because that problem scales - every company with a video team has a version of the same backlog.

The technical architecture is interesting. The AI film crew is novel. But the actual value is simpler: footage that would never be published now gets published. A journalist's best interview from a conference three years ago becomes findable, taggable, and distributable in minutes.

That's the problem worth solving.

 

Further Reading

Official AWS documentation for the services used in this pipeline:

Amazon Transcribe
Automatic speech recognition with custom vocabulary and word-level timestamps
Amazon Rekognition
Image and video analysis - labels, faces, objects, scenes, and text
Amazon Bedrock
Foundation models including Claude for content generation and analysis
Amazon Bedrock AgentCore
Managed runtime for deploying and scaling AI agent workloads
AWS Elemental MediaConvert
File-based video transcoding for broadcast and multi-screen delivery
AWS Step Functions
Visual workflow orchestration for distributed applications
AWS Lambda
Serverless compute - run code without provisioning or managing servers
FSx for NetApp ONTAP
Fully managed ONTAP file system with S3-compatible access on AWS
CrewAI
Open-source framework for orchestrating multi-agent AI crews

Linda Mohamed is an AWS Hero and cloud architect based in Vienna. She builds AI-powered media systems and speaks at AWS events across Europe. All code referenced in this post is production infrastructure running on AWS.

Connect on LinkedIn · AWS Hero profile · GitHub

 

✨ Stay in the Loop 

Want to know where I’m speaking next - or catch my behind-the-scenes recaps from re:Invent, Community Days, and other events?

Join me on my journey through Cloud, AI, and community innovation.