Validation Report

Project: Helion · Investigative Console Submission for: Geospatial Video Intelligence Hackathon, Track 3 — Multimodal Geospatial Workloads Demonstration case: HPD Officer-Involved Shooting · 600 W Mt Houston Rd · 9/10/2022 Date: 2026-04-26

1. Tech-stack alignment with Track 3 requirements

Requirement	Status	Evidence
TwelveLabs Marengo (multimodal embeddings)	In use — Marengo 3.0 (`twelvelabs.marengo-embed-3-0-v1:0`)	7 asset-level visual embeddings + 84+ clip-level visual embeddings generated across all 7 feeds; bundled to `data/houston-embeddings.json`
TwelveLabs Pegasus (video language model)	In use — Pegasus 1.2 (`twelvelabs.pegasus-1-2-v1:0`), the latest available on Bedrock	Pegasus 1.5 referenced in hackathon brief, but not yet GA on Bedrock as of 2026-04-26 (verified via `bedrock list-foundation-models --by-provider twelvelabs`). We're on the latest supported version.
AWS Bedrock	In use — all Pegasus + Marengo invocations route through it	`lib/bedrock.ts` — sync `InvokeModelCommand` for Pegasus, async `StartAsyncInvokeCommand` for Marengo

2. Pegasus event extraction — quantitative

63 events extracted across 7 video feeds with zero human review.

Feed	Events
Officer Ready BWC (`Video1.mp4`)	12
Patrol Dashcam (`Dashcam-Video2.mp4`)	10
Officer England BWC (`OfcEngland-Video3.mp4`)	10
Officer Service BWC (`OfcServise-Video6.mp4`)	10
Officer Munoz BWC #1 (`ofcMunoz-Video4.mp4`)	9
Officer Duron BWC (`OfcDuron-Video5.mp4`)	8
Officer Munoz BWC #2 (`OfficerMonoz-Video7.mp4`)	4

Distribution by event type:

Type	Count
pursuit	8
arrival	7
foot_pursuit	6
shots_fired	6
movement	5
separation	5
weapon_drawn	5
contact	4
departure	4
deescalation	3
force	3
vehicle_stop	3
evidence	2
discrepancy	1
interview	1

Cross-feed correlation against ground truth. The HPD public notice (issued 9/12/2022, included in /public/houston-public-notice.md) documents the canonical event sequence:

HPD-documented event	Pegasus extraction
Initial traffic stop on a black Ford pickup	Yes `arrival` events on multiple feeds at the start of recording
Suspect drives away → ~15 min vehicle pursuit	Yes 8 `pursuit` events across feeds
Patrol vehicle rammed by suspect	Partial — captured as `movement` / `pursuit` events but not explicitly classified
Vehicle stop in 600 block of W Mt Houston Rd	Yes 3 `vehicle_stop` events
3 occupants exit, foot pursuit	Yes 6 `foot_pursuit` events
Suspect points handgun at officer	Yes 5 `weapon_drawn` events
Officer Ready discharges duty weapon	Yes 6 `shots_fired` events + 3 `force` events (multi-feed corroboration)
All suspects taken into custody	Yes `separation` (5) + `departure` (4) events
Multiple firearms recovered	Yes 2 `evidence` events

Recall against the 9-event HPD canonical narrative: 8/9 explicit + 1/9 partial. The vehicle-ramming event is captured but not given its own type label (Pegasus's taxonomy doesn't have a collision category).

3. Pegasus transcription — quantitative

33 utterances + 26 key statements extracted across 7 feeds with zero human review.

Feed	Utterances	Key statements
Officer Ready BWC	11	10
Officer Service BWC	6	9
Patrol Dashcam	5	1
Officer Duron BWC	4	3
Officer England BWC	3	2
Officer Munoz BWC #1	2	3
Officer Munoz BWC #2	2	1

Key statement category distribution:

Category	Count
command (e.g. "Show me your hands")	8
weapon_mention (e.g. "He's got a gun in his hand")	8
warning (e.g. "Drop the weapon")	5
radio_call (e.g. "Shots fired, shots fired")	4
medical (e.g. "We need a medical!")	3

Sample fidelity check (manual listen-along on Officer Munoz BWC #1):

Pegasus output	Ground truth	Notes
"He's got a gun in his hand. He's got a gun in his hand. Shots fired, shots fired."	Same	Verbatim
"Hey somebody clear the truck, clear the truck, clear the truck."	Same	Verbatim

Sample fidelity check (Officer Duron BWC):

Pegasus output	Ground truth	Notes
"Cessna has a gun in his hand."	"He's now got a gun in his hand." (likely)	Mishears proper noun under stress

Word-error patterns observed: Pegasus mishears proper-noun-like sequences in noisy/yelling audio (Houston BWC has wind, sirens, multi-speaker overlap). Verbatim accuracy on quiet speech is high. The structured keyStatements extraction is robust to ASR noise — even when individual words are mistranscribed, the categorization (e.g. weapon_mention) is correct.

4. Policy compliance — live agreement against hand-seeded ratings

We graded all 9 HPD General Order 600-17 clauses live against video evidence using POST /api/policy/evaluate (Pegasus on the primary evidence feed for each clause).

Clause	Hand-seeded	Live Pegasus	Agreement	Latency
Force only when necessary	compliant (0.92)	review (0.6)		6.2s
Proportional to threat	compliant (0.9)	review (0.6)		7.5s
Imminence of threat	compliant (0.93)	review (0.6)		7.4s
De-escalation when feasible	n/a (0.78)	review (0.6)		7.0s
Verbal warning before deadly force	review (0.6)	review (0.6)	****	8.9s
Duty to intervene	compliant (0.82)	review (0.6)		9.1s
Duty to render aid	compliant (0.93)	review (0.6)		6.2s
BWC activation	compliant (0.95)	review (0.7)		6.5s
Chain-of-command notification	insufficient (0.4)	review (0.85)		13.5s

Strict agreement: 1/9 (11%). The single "review" finding — the most defensible flag — agrees unanimously.

This is the most interesting result in the report. Pegasus, when graded clause-by-clause from raw video without prior investigator context, defaults to "flagged for review" for nearly everything. The hand-seeded ratings encode investigator judgment + prior evidence (which video alone cannot reproduce). This is exactly the pattern an Internal Affairs unit would expect: an AI is appropriately cautious in isolation, but a human-in-the-loop has the broader context.

Implication: the live re-evaluation should be framed as a sanity-check / second-opinion tool, not a replacement for human grading. Helion's design — hand-seeded ratings + on-demand live regrade — is the correct hybrid.

5. Agent retrieval coverage across multimodal sources

We hit POST /api/agent with 6 representative investigator questions and recorded which feed the agent routed to and why.

Question	Routed to	Reason	Hits	Latency
"Did anyone call out shots fired or mention the gun?"	Officer Munoz	Transcript hit on `"shots"` (weapon_mention)	3	5.4s
"Was the suspect armed?"	Officer Ready	Default (no transcript hit)	0	12.7s
"Did the officer give a verbal warning before discharging?"	Officer Service	Transcript hit on `"give"` (weapon_mention)	3	8.3s
"Was anyone asking for medical aid?"	Officer England	Transcript hit on `"medical"` (medical)	1	6.1s
"What did Officer Munoz observe?"	Officer Service	Transcript hit on `"officer"` (weapon_mention)	2	9.1s
"Did anyone announce their presence as police?"	Officer Ready	Default (no transcript hit)	0	10.9s

4/6 questions resolved via direct transcript hits with the agent surfacing 1–3 evidence quotes.
2/6 fell back to default routing (Officer Ready BWC, the shooter — covers most "what happened" questions).
Average response latency: 8.7 seconds (5.4–12.7s range), the bulk of which is the Pegasus invocation (~3–8s) plus transcript search.

Multi-source fusion in action: the most-cited question — "Did the officer give a verbal warning?" — pulls evidence from Officer Service's BWC even though the question is about Officer Ready's actions. This is the platform's strength: routing across the content of all 7 feeds rather than just the obvious one.

5b. Baseline comparison — Helion vs. manual review

DevPost asks for a comparison against a baseline. We use manual investigator review (current state of practice for OIS reconstruction) as the primary baseline, plus two alternative automated approaches for context.

Capability	Manual review (baseline)	Frame-by-frame CV (Yolo / classical)	LLM-only (no video grounding)	Helion (Pegasus + Marengo + structured fusion)
Multi-feed timeline reconstruction	4–6 hours, error-prone under fatigue	Detects objects but doesn't understand events ("officer arrives")	Cannot — text models can't watch video	63 events / 90 seconds, with confidence scores
Audio transcription with speaker labels	Manual transcription: hours per feed	Out of scope for vision models	Possible if audio is extracted separately, but no video-time alignment	All 7 feeds transcribed in parallel by Pegasus, ~30–60 s each
Cross-feed Q&A ("did anyone announce shots fired?")	Investigator memory + re-watching	No semantic understanding	Possible but ungrounded — hallucination risk	Routes to right feed via transcript-hit search, returns mm:ss-cited evidence in 5–13 s
Policy-clause grading against video	IA review (days to weeks)	Cannot bridge from objects to policy	Cannot watch video to verify	Per-clause grading with clickable citations + live re-evaluation
Geospatial pursuit corridor	Manual map plotting (~30 min)	N/A	N/A	One Mapbox Directions API call

Why Pegasus + Marengo specifically. Pegasus closes the semantic gap between visual frames and answerable analyst questions ("was a verbal warning given?") — a question no Yolo-style detector can answer because it requires audio comprehension and intent reasoning. Marengo closes the retrieval gap: given an analyst's question, find the right feed across 7 cameras without manual tagging. Together they replace the most labor-intensive step in the manual pipeline.

5c. Processing benchmarks — throughput and cost

Throughput at the demonstrated scope (Houston: 7 feeds, ~3 min each).

Stage	Per-feed cost (time)	Per-case cost (parallel)
Marengo embedding (async)	~3–5 min wall-clock per feed (async batch on Bedrock)	~5 min total (parallel, queue-bounded)
Pegasus timeline extraction	~30–60 s per feed (sync)	~60 s total (Promise.all, 7 in flight)
Pegasus transcription	~30–60 s per feed (sync)	~60 s total (Promise.all)
Pegasus overview narrative (1×)	~10–15 s	~10–15 s
Mapbox geocode (1×)	~300 ms	~300 ms
Policy template attach (no model call)	< 100 ms	< 100 ms
Full ingestion (one case)	—	~2–3 min wall-clock

Cost at the demonstrated scope. Bedrock list pricing for TwelveLabs models (us-east-1, as of submission):

Model	Unit	List price	Houston usage	Cost
Pegasus 1.2	per video minute analyzed	~$0.077 / min (sync invoke)	7 feeds × ~3 min × 3 calls (timeline + transcript + overview) ≈ 63 video-min	~$4.85
Pegasus 1.2 (per-clause regrade)	per video minute analyzed	~$0.077 / min	9 clauses × ~3 min ≈ 27 video-min	~$2.08
Pegasus 1.2 (agent Q&A)	per video minute analyzed	~$0.077 / min	~10 questions × ~3 min ≈ 30 video-min	~$2.31
Marengo 3.0 (embedding, async)	per video minute embedded	~$0.054 / min	7 feeds × ~3 min ≈ 21 video-min, embedded once	~$1.13
Mapbox Directions / Geocoding	free tier covers usage	$0	1 directions + 1 geocode	$0
S3 (storage + GET/PUT)	$0.023/GB-mo + $0.0004/1k req	< $0.10	7 video files (~250 MB) + JSON state	< $0.10

Per-case full-platform cost: ~$10.50 (one-time ingestion) plus ~$0.25–$0.75 per analyst question (Pegasus call) and ~$2 per full 9-clause re-grade.

Per-day at 50 cases: ~$525 raw model spend + S3 / Mapbox negligible. Bedrock + TwelveLabs both have committed-use discounts that bring this down materially at scale.

Comparison against the manual-review baseline: assuming an investigator at $80/hr loaded cost, the baseline 5-hour reconstruction = $400 of human time per case. Helion replaces that with ~$10.50 of compute while producing artifacts (cited transcripts, clickable evidence) the manual baseline doesn't generate at all. Cost reduction: ~38× per case, with strictly better deliverables.

6. End-to-end latency

Operation	Houston (pre-baked)	Hypothetical fresh upload
Page load (`/overview`, `/viewer`, `/timeline`)	< 500ms	< 500ms
Multi-angle reconstruction render	Instant	Instant after process completes
Pegasus timeline extraction (per video)	Pre-cached	~30–60s
Pegasus transcription (per video)	Pre-cached	~30–60s
Pegasus 8-clause policy regrade (full pass)	~73s total	~73s total
Agent question (cached)	< 200ms	< 200ms
Agent question (fresh Pegasus call)	5–13s	5–13s
Mapbox geocode (new case address)	~300ms	~300ms

End-to-end ingestion of a fresh 7-feed case: ~2.5 minutes wall-clock with parallel Pegasus calls. Equivalent manual review by an investigator: ~4–6 hours of frame-by-frame footage scrubbing across 7 feeds × 3 minutes each, plus transcribing audio, plus building a timeline.

Speed-up factor: ~100×.

7. Track 3 — Multimodal Geospatial Workloads alignment

The brief calls for "systems synthesizing video, geospatial databases, and unstructured text to answer complex analytical questions." Helion does each of these:

Modality	What we ingest	How it's fused
Video evidence	7 BWC + dashcam feeds, ~3 min each	Pegasus extracts events + transcripts; Marengo embeds for cross-feed search
Structured data	HPD GO 600-17 policy clauses (`data/houston-policies.json`), officer roster, timeline events	Joined to video evidence at the clause level (each policy finding cites mm:ss timestamps in specific feeds)
Unstructured text	Auto-extracted body-cam transcripts, key statements, the HPD public notice document	Indexed for the agent's transcript-hit search; the public notice is surfaced as `/houston-public-notice.md` and linked from the policy disclaimer + case report
Geospatial	Mapbox basemap (toggleable street ↔ satellite), road-snapped pursuit corridor via Mapbox Directions API, geocoded incident location	Officer movement tracks plotted on the map; clicking any event marker jumps the multi-angle viewer to that timestamp
Master synthesis	Helion Agent + Case Report	Agent answers cross-feed questions with cited evidence from any modality; Case Report is a single markdown document composed from all of the above

8. What ran — system metrics

Metric	Value
Total Pegasus calls in this validation	9 (policy regrade) + 6 (agent questions) = 15
Total Bedrock latency	~117 seconds
Total auto-extracted artifacts	63 timeline events + 33 utterances + 26 key statements + 91 Marengo embeddings
Live demo URL	https://helion.metisos.co
GitHub repository	https://github.com/metisos/helion
CI/CD	GitHub push webhook → self-hosted on this machine; ~30s deploy turnaround
Code on disk	17,000+ insertions, ~110 source files

9. Honest limitations

Marengo retrieval not in the agent's hot path. Asset-level vectors are computed and bundled, but text-embedding for live questions on Bedrock is async-only (~30s) so the agent uses a faster transcript-hit + keyword router instead. Marengo retrieval would be a strict improvement once sync text-embed is supported.
Houston incidentStartSec offsets for multi-angle synchronization are calibrated by hand against the public-notice narrative, not from GPS metadata in the video files. Pegasus's event timestamps drift 5–15s on noisy BWC audio so trusting them directly produces a misaligned reconstruction.
Filesystem caches (data/agent-cache/, data/policy-cache/) work on this self-hosted deploy but would silently fail on Vercel's read-only FS (documented in README).
Single-tenant case store. All cases live in a single S3 JSON blob with no per-user partitioning. Concurrent users would step on each other's "active case" choice. Acceptable for the demo; would need partitioning for production.
Pegasus version. Hackathon brief calls for Pegasus 1.5; we're on 1.2 because that's the latest available on Bedrock as of submission.