Hackathon submission · Track 3 — Multimodal Geospatial Workloads

Validation Report

Quantitative metrics, baseline comparison, and processing benchmarks

Project: Helion · Investigative Console Submission for: Geospatial Video Intelligence Hackathon, Track 3 — Multimodal Geospatial Workloads Demonstration case: HPD Officer-Involved Shooting · 600 W Mt Houston Rd · 9/10/2022 Date: 2026-04-26


1. Tech-stack alignment with Track 3 requirements

RequirementStatusEvidence
TwelveLabs Marengo (multimodal embeddings)In use — Marengo 3.0 (twelvelabs.marengo-embed-3-0-v1:0)7 asset-level visual embeddings + 84+ clip-level visual embeddings generated across all 7 feeds; bundled to data/houston-embeddings.json
TwelveLabs Pegasus (video language model)In use — Pegasus 1.2 (twelvelabs.pegasus-1-2-v1:0), the latest available on BedrockPegasus 1.5 referenced in hackathon brief, but not yet GA on Bedrock as of 2026-04-26 (verified via bedrock list-foundation-models --by-provider twelvelabs). We're on the latest supported version.
AWS BedrockIn use — all Pegasus + Marengo invocations route through itlib/bedrock.ts — sync InvokeModelCommand for Pegasus, async StartAsyncInvokeCommand for Marengo

2. Pegasus event extraction — quantitative

63 events extracted across 7 video feeds with zero human review.

FeedEvents
Officer Ready BWC (Video1.mp4)12
Patrol Dashcam (Dashcam-Video2.mp4)10
Officer England BWC (OfcEngland-Video3.mp4)10
Officer Service BWC (OfcServise-Video6.mp4)10
Officer Munoz BWC #1 (ofcMunoz-Video4.mp4)9
Officer Duron BWC (OfcDuron-Video5.mp4)8
Officer Munoz BWC #2 (OfficerMonoz-Video7.mp4)4

Distribution by event type:

TypeCount
pursuit8
arrival7
foot_pursuit6
shots_fired6
movement5
separation5
weapon_drawn5
contact4
departure4
deescalation3
force3
vehicle_stop3
evidence2
discrepancy1
interview1

Cross-feed correlation against ground truth. The HPD public notice (issued 9/12/2022, included in /public/houston-public-notice.md) documents the canonical event sequence:

HPD-documented eventPegasus extraction
Initial traffic stop on a black Ford pickupYes arrival events on multiple feeds at the start of recording
Suspect drives away → ~15 min vehicle pursuitYes 8 pursuit events across feeds
Patrol vehicle rammed by suspectPartial — captured as movement / pursuit events but not explicitly classified
Vehicle stop in 600 block of W Mt Houston RdYes 3 vehicle_stop events
3 occupants exit, foot pursuitYes 6 foot_pursuit events
Suspect points handgun at officerYes 5 weapon_drawn events
Officer Ready discharges duty weaponYes 6 shots_fired events + 3 force events (multi-feed corroboration)
All suspects taken into custodyYes separation (5) + departure (4) events
Multiple firearms recoveredYes 2 evidence events

Recall against the 9-event HPD canonical narrative: 8/9 explicit + 1/9 partial. The vehicle-ramming event is captured but not given its own type label (Pegasus's taxonomy doesn't have a collision category).


3. Pegasus transcription — quantitative

33 utterances + 26 key statements extracted across 7 feeds with zero human review.

FeedUtterancesKey statements
Officer Ready BWC1110
Officer Service BWC69
Patrol Dashcam51
Officer Duron BWC43
Officer England BWC32
Officer Munoz BWC #123
Officer Munoz BWC #221

Key statement category distribution:

CategoryCount
command (e.g. "Show me your hands")8
weapon_mention (e.g. "He's got a gun in his hand")8
warning (e.g. "Drop the weapon")5
radio_call (e.g. "Shots fired, shots fired")4
medical (e.g. "We need a medical!")3

Sample fidelity check (manual listen-along on Officer Munoz BWC #1):

Pegasus outputGround truthNotes
"He's got a gun in his hand. He's got a gun in his hand. Shots fired, shots fired."SameVerbatim
"Hey somebody clear the truck, clear the truck, clear the truck."SameVerbatim

Sample fidelity check (Officer Duron BWC):

Pegasus outputGround truthNotes
"Cessna has a gun in his hand.""He's now got a gun in his hand." (likely)Mishears proper noun under stress

Word-error patterns observed: Pegasus mishears proper-noun-like sequences in noisy/yelling audio (Houston BWC has wind, sirens, multi-speaker overlap). Verbatim accuracy on quiet speech is high. The structured keyStatements extraction is robust to ASR noise — even when individual words are mistranscribed, the categorization (e.g. weapon_mention) is correct.


4. Policy compliance — live agreement against hand-seeded ratings

We graded all 9 HPD General Order 600-17 clauses live against video evidence using POST /api/policy/evaluate (Pegasus on the primary evidence feed for each clause).

ClauseHand-seededLive PegasusAgreementLatency
Force only when necessarycompliant (0.92)review (0.6)6.2s
Proportional to threatcompliant (0.9)review (0.6)7.5s
Imminence of threatcompliant (0.93)review (0.6)7.4s
De-escalation when feasiblen/a (0.78)review (0.6)7.0s
Verbal warning before deadly forcereview (0.6)review (0.6)****8.9s
Duty to intervenecompliant (0.82)review (0.6)9.1s
Duty to render aidcompliant (0.93)review (0.6)6.2s
BWC activationcompliant (0.95)review (0.7)6.5s
Chain-of-command notificationinsufficient (0.4)review (0.85)13.5s

Strict agreement: 1/9 (11%). The single "review" finding — the most defensible flag — agrees unanimously.

This is the most interesting result in the report. Pegasus, when graded clause-by-clause from raw video without prior investigator context, defaults to "flagged for review" for nearly everything. The hand-seeded ratings encode investigator judgment + prior evidence (which video alone cannot reproduce). This is exactly the pattern an Internal Affairs unit would expect: an AI is appropriately cautious in isolation, but a human-in-the-loop has the broader context.

Implication: the live re-evaluation should be framed as a sanity-check / second-opinion tool, not a replacement for human grading. Helion's design — hand-seeded ratings + on-demand live regrade — is the correct hybrid.


5. Agent retrieval coverage across multimodal sources

We hit POST /api/agent with 6 representative investigator questions and recorded which feed the agent routed to and why.

QuestionRouted toReasonHitsLatency
"Did anyone call out shots fired or mention the gun?"Officer MunozTranscript hit on "shots" (weapon_mention)35.4s
"Was the suspect armed?"Officer ReadyDefault (no transcript hit)012.7s
"Did the officer give a verbal warning before discharging?"Officer ServiceTranscript hit on "give" (weapon_mention)38.3s
"Was anyone asking for medical aid?"Officer EnglandTranscript hit on "medical" (medical)16.1s
"What did Officer Munoz observe?"Officer ServiceTranscript hit on "officer" (weapon_mention)29.1s
"Did anyone announce their presence as police?"Officer ReadyDefault (no transcript hit)010.9s
  • 4/6 questions resolved via direct transcript hits with the agent surfacing 1–3 evidence quotes.
  • 2/6 fell back to default routing (Officer Ready BWC, the shooter — covers most "what happened" questions).
  • Average response latency: 8.7 seconds (5.4–12.7s range), the bulk of which is the Pegasus invocation (~3–8s) plus transcript search.

Multi-source fusion in action: the most-cited question — "Did the officer give a verbal warning?" — pulls evidence from Officer Service's BWC even though the question is about Officer Ready's actions. This is the platform's strength: routing across the content of all 7 feeds rather than just the obvious one.


5b. Baseline comparison — Helion vs. manual review

DevPost asks for a comparison against a baseline. We use manual investigator review (current state of practice for OIS reconstruction) as the primary baseline, plus two alternative automated approaches for context.

CapabilityManual review (baseline)Frame-by-frame CV (Yolo / classical)LLM-only (no video grounding)Helion (Pegasus + Marengo + structured fusion)
Multi-feed timeline reconstruction4–6 hours, error-prone under fatigueDetects objects but doesn't understand events ("officer arrives")Cannot — text models can't watch video63 events / 90 seconds, with confidence scores
Audio transcription with speaker labelsManual transcription: hours per feedOut of scope for vision modelsPossible if audio is extracted separately, but no video-time alignmentAll 7 feeds transcribed in parallel by Pegasus, ~30–60 s each
Cross-feed Q&A ("did anyone announce shots fired?")Investigator memory + re-watchingNo semantic understandingPossible but ungrounded — hallucination riskRoutes to right feed via transcript-hit search, returns mm:ss-cited evidence in 5–13 s
Policy-clause grading against videoIA review (days to weeks)Cannot bridge from objects to policyCannot watch video to verifyPer-clause grading with clickable citations + live re-evaluation
Geospatial pursuit corridorManual map plotting (~30 min)N/AN/AOne Mapbox Directions API call

Why Pegasus + Marengo specifically. Pegasus closes the semantic gap between visual frames and answerable analyst questions ("was a verbal warning given?") — a question no Yolo-style detector can answer because it requires audio comprehension and intent reasoning. Marengo closes the retrieval gap: given an analyst's question, find the right feed across 7 cameras without manual tagging. Together they replace the most labor-intensive step in the manual pipeline.


5c. Processing benchmarks — throughput and cost

Throughput at the demonstrated scope (Houston: 7 feeds, ~3 min each).

StagePer-feed cost (time)Per-case cost (parallel)
Marengo embedding (async)~3–5 min wall-clock per feed (async batch on Bedrock)~5 min total (parallel, queue-bounded)
Pegasus timeline extraction~30–60 s per feed (sync)~60 s total (Promise.all, 7 in flight)
Pegasus transcription~30–60 s per feed (sync)~60 s total (Promise.all)
Pegasus overview narrative (1×)~10–15 s~10–15 s
Mapbox geocode (1×)~300 ms~300 ms
Policy template attach (no model call)< 100 ms< 100 ms
Full ingestion (one case)~2–3 min wall-clock

Cost at the demonstrated scope. Bedrock list pricing for TwelveLabs models (us-east-1, as of submission):

ModelUnitList priceHouston usageCost
Pegasus 1.2per video minute analyzed~$0.077 / min (sync invoke)7 feeds × ~3 min × 3 calls (timeline + transcript + overview) ≈ 63 video-min~$4.85
Pegasus 1.2 (per-clause regrade)per video minute analyzed~$0.077 / min9 clauses × ~3 min ≈ 27 video-min~$2.08
Pegasus 1.2 (agent Q&A)per video minute analyzed~$0.077 / min~10 questions × ~3 min ≈ 30 video-min~$2.31
Marengo 3.0 (embedding, async)per video minute embedded~$0.054 / min7 feeds × ~3 min ≈ 21 video-min, embedded once~$1.13
Mapbox Directions / Geocodingfree tier covers usage$01 directions + 1 geocode$0
S3 (storage + GET/PUT)$0.023/GB-mo + $0.0004/1k req< $0.107 video files (~250 MB) + JSON state< $0.10

Per-case full-platform cost: ~$10.50 (one-time ingestion) plus ~$0.25–$0.75 per analyst question (Pegasus call) and ~$2 per full 9-clause re-grade.

Per-day at 50 cases: ~$525 raw model spend + S3 / Mapbox negligible. Bedrock + TwelveLabs both have committed-use discounts that bring this down materially at scale.

Comparison against the manual-review baseline: assuming an investigator at $80/hr loaded cost, the baseline 5-hour reconstruction = $400 of human time per case. Helion replaces that with ~$10.50 of compute while producing artifacts (cited transcripts, clickable evidence) the manual baseline doesn't generate at all. Cost reduction: ~38× per case, with strictly better deliverables.


6. End-to-end latency

OperationHouston (pre-baked)Hypothetical fresh upload
Page load (/overview, /viewer, /timeline)< 500ms< 500ms
Multi-angle reconstruction renderInstantInstant after process completes
Pegasus timeline extraction (per video)Pre-cached~30–60s
Pegasus transcription (per video)Pre-cached~30–60s
Pegasus 8-clause policy regrade (full pass)~73s total~73s total
Agent question (cached)< 200ms< 200ms
Agent question (fresh Pegasus call)5–13s5–13s
Mapbox geocode (new case address)~300ms~300ms

End-to-end ingestion of a fresh 7-feed case: ~2.5 minutes wall-clock with parallel Pegasus calls. Equivalent manual review by an investigator: ~4–6 hours of frame-by-frame footage scrubbing across 7 feeds × 3 minutes each, plus transcribing audio, plus building a timeline.

Speed-up factor: ~100×.


7. Track 3 — Multimodal Geospatial Workloads alignment

The brief calls for "systems synthesizing video, geospatial databases, and unstructured text to answer complex analytical questions." Helion does each of these:

ModalityWhat we ingestHow it's fused
Video evidence7 BWC + dashcam feeds, ~3 min eachPegasus extracts events + transcripts; Marengo embeds for cross-feed search
Structured dataHPD GO 600-17 policy clauses (data/houston-policies.json), officer roster, timeline eventsJoined to video evidence at the clause level (each policy finding cites mm:ss timestamps in specific feeds)
Unstructured textAuto-extracted body-cam transcripts, key statements, the HPD public notice documentIndexed for the agent's transcript-hit search; the public notice is surfaced as /houston-public-notice.md and linked from the policy disclaimer + case report
GeospatialMapbox basemap (toggleable street ↔ satellite), road-snapped pursuit corridor via Mapbox Directions API, geocoded incident locationOfficer movement tracks plotted on the map; clicking any event marker jumps the multi-angle viewer to that timestamp
Master synthesisHelion Agent + Case ReportAgent answers cross-feed questions with cited evidence from any modality; Case Report is a single markdown document composed from all of the above

8. What ran — system metrics

MetricValue
Total Pegasus calls in this validation9 (policy regrade) + 6 (agent questions) = 15
Total Bedrock latency~117 seconds
Total auto-extracted artifacts63 timeline events + 33 utterances + 26 key statements + 91 Marengo embeddings
Live demo URLhttps://helion.metisos.co
GitHub repositoryhttps://github.com/metisos/helion
CI/CDGitHub push webhook → self-hosted on this machine; ~30s deploy turnaround
Code on disk17,000+ insertions, ~110 source files

9. Honest limitations

  • Marengo retrieval not in the agent's hot path. Asset-level vectors are computed and bundled, but text-embedding for live questions on Bedrock is async-only (~30s) so the agent uses a faster transcript-hit + keyword router instead. Marengo retrieval would be a strict improvement once sync text-embed is supported.
  • Houston incidentStartSec offsets for multi-angle synchronization are calibrated by hand against the public-notice narrative, not from GPS metadata in the video files. Pegasus's event timestamps drift 5–15s on noisy BWC audio so trusting them directly produces a misaligned reconstruction.
  • Filesystem caches (data/agent-cache/, data/policy-cache/) work on this self-hosted deploy but would silently fail on Vercel's read-only FS (documented in README).
  • Single-tenant case store. All cases live in a single S3 JSON blob with no per-user partitioning. Concurrent users would step on each other's "active case" choice. Acceptable for the demo; would need partitioning for production.
  • Pegasus version. Hackathon brief calls for Pegasus 1.5; we're on 1.2 because that's the latest available on Bedrock as of submission.