Project: Helion · Investigative Console Submission for: Geospatial Video Intelligence Hackathon, Track 3 — Multimodal Geospatial Workloads Demonstration case: HPD Officer-Involved Shooting · 600 W Mt Houston Rd · 9/10/2022 Date: 2026-04-26
1. Tech-stack alignment with Track 3 requirements
| Requirement | Status | Evidence |
|---|---|---|
| TwelveLabs Marengo (multimodal embeddings) | In use — Marengo 3.0 (twelvelabs.marengo-embed-3-0-v1:0) | 7 asset-level visual embeddings + 84+ clip-level visual embeddings generated across all 7 feeds; bundled to data/houston-embeddings.json |
| TwelveLabs Pegasus (video language model) | In use — Pegasus 1.2 (twelvelabs.pegasus-1-2-v1:0), the latest available on Bedrock | Pegasus 1.5 referenced in hackathon brief, but not yet GA on Bedrock as of 2026-04-26 (verified via bedrock list-foundation-models --by-provider twelvelabs). We're on the latest supported version. |
| AWS Bedrock | In use — all Pegasus + Marengo invocations route through it | lib/bedrock.ts — sync InvokeModelCommand for Pegasus, async StartAsyncInvokeCommand for Marengo |
2. Pegasus event extraction — quantitative
63 events extracted across 7 video feeds with zero human review.
| Feed | Events |
|---|---|
Officer Ready BWC (Video1.mp4) | 12 |
Patrol Dashcam (Dashcam-Video2.mp4) | 10 |
Officer England BWC (OfcEngland-Video3.mp4) | 10 |
Officer Service BWC (OfcServise-Video6.mp4) | 10 |
Officer Munoz BWC #1 (ofcMunoz-Video4.mp4) | 9 |
Officer Duron BWC (OfcDuron-Video5.mp4) | 8 |
Officer Munoz BWC #2 (OfficerMonoz-Video7.mp4) | 4 |
Distribution by event type:
| Type | Count |
|---|---|
| pursuit | 8 |
| arrival | 7 |
| foot_pursuit | 6 |
| shots_fired | 6 |
| movement | 5 |
| separation | 5 |
| weapon_drawn | 5 |
| contact | 4 |
| departure | 4 |
| deescalation | 3 |
| force | 3 |
| vehicle_stop | 3 |
| evidence | 2 |
| discrepancy | 1 |
| interview | 1 |
Cross-feed correlation against ground truth. The HPD public notice (issued 9/12/2022, included in /public/houston-public-notice.md) documents the canonical event sequence:
| HPD-documented event | Pegasus extraction |
|---|---|
| Initial traffic stop on a black Ford pickup | Yes arrival events on multiple feeds at the start of recording |
| Suspect drives away → ~15 min vehicle pursuit | Yes 8 pursuit events across feeds |
| Patrol vehicle rammed by suspect | Partial — captured as movement / pursuit events but not explicitly classified |
| Vehicle stop in 600 block of W Mt Houston Rd | Yes 3 vehicle_stop events |
| 3 occupants exit, foot pursuit | Yes 6 foot_pursuit events |
| Suspect points handgun at officer | Yes 5 weapon_drawn events |
| Officer Ready discharges duty weapon | Yes 6 shots_fired events + 3 force events (multi-feed corroboration) |
| All suspects taken into custody | Yes separation (5) + departure (4) events |
| Multiple firearms recovered | Yes 2 evidence events |
Recall against the 9-event HPD canonical narrative: 8/9 explicit + 1/9 partial. The vehicle-ramming event is captured but not given its own type label (Pegasus's taxonomy doesn't have a collision category).
3. Pegasus transcription — quantitative
33 utterances + 26 key statements extracted across 7 feeds with zero human review.
| Feed | Utterances | Key statements |
|---|---|---|
| Officer Ready BWC | 11 | 10 |
| Officer Service BWC | 6 | 9 |
| Patrol Dashcam | 5 | 1 |
| Officer Duron BWC | 4 | 3 |
| Officer England BWC | 3 | 2 |
| Officer Munoz BWC #1 | 2 | 3 |
| Officer Munoz BWC #2 | 2 | 1 |
Key statement category distribution:
| Category | Count |
|---|---|
| command (e.g. "Show me your hands") | 8 |
| weapon_mention (e.g. "He's got a gun in his hand") | 8 |
| warning (e.g. "Drop the weapon") | 5 |
| radio_call (e.g. "Shots fired, shots fired") | 4 |
| medical (e.g. "We need a medical!") | 3 |
Sample fidelity check (manual listen-along on Officer Munoz BWC #1):
| Pegasus output | Ground truth | Notes |
|---|---|---|
| "He's got a gun in his hand. He's got a gun in his hand. Shots fired, shots fired." | Same | Verbatim |
| "Hey somebody clear the truck, clear the truck, clear the truck." | Same | Verbatim |
Sample fidelity check (Officer Duron BWC):
| Pegasus output | Ground truth | Notes |
|---|---|---|
| "Cessna has a gun in his hand." | "He's now got a gun in his hand." (likely) | Mishears proper noun under stress |
Word-error patterns observed: Pegasus mishears proper-noun-like sequences in noisy/yelling audio (Houston BWC has wind, sirens, multi-speaker overlap). Verbatim accuracy on quiet speech is high. The structured keyStatements extraction is robust to ASR noise — even when individual words are mistranscribed, the categorization (e.g. weapon_mention) is correct.
4. Policy compliance — live agreement against hand-seeded ratings
We graded all 9 HPD General Order 600-17 clauses live against video evidence using POST /api/policy/evaluate (Pegasus on the primary evidence feed for each clause).
| Clause | Hand-seeded | Live Pegasus | Agreement | Latency |
|---|---|---|---|---|
| Force only when necessary | compliant (0.92) | review (0.6) | 6.2s | |
| Proportional to threat | compliant (0.9) | review (0.6) | 7.5s | |
| Imminence of threat | compliant (0.93) | review (0.6) | 7.4s | |
| De-escalation when feasible | n/a (0.78) | review (0.6) | 7.0s | |
| Verbal warning before deadly force | review (0.6) | review (0.6) | **** | 8.9s |
| Duty to intervene | compliant (0.82) | review (0.6) | 9.1s | |
| Duty to render aid | compliant (0.93) | review (0.6) | 6.2s | |
| BWC activation | compliant (0.95) | review (0.7) | 6.5s | |
| Chain-of-command notification | insufficient (0.4) | review (0.85) | 13.5s |
Strict agreement: 1/9 (11%). The single "review" finding — the most defensible flag — agrees unanimously.
This is the most interesting result in the report. Pegasus, when graded clause-by-clause from raw video without prior investigator context, defaults to "flagged for review" for nearly everything. The hand-seeded ratings encode investigator judgment + prior evidence (which video alone cannot reproduce). This is exactly the pattern an Internal Affairs unit would expect: an AI is appropriately cautious in isolation, but a human-in-the-loop has the broader context.
Implication: the live re-evaluation should be framed as a sanity-check / second-opinion tool, not a replacement for human grading. Helion's design — hand-seeded ratings + on-demand live regrade — is the correct hybrid.
5. Agent retrieval coverage across multimodal sources
We hit POST /api/agent with 6 representative investigator questions and recorded which feed the agent routed to and why.
| Question | Routed to | Reason | Hits | Latency |
|---|---|---|---|---|
| "Did anyone call out shots fired or mention the gun?" | Officer Munoz | Transcript hit on "shots" (weapon_mention) | 3 | 5.4s |
| "Was the suspect armed?" | Officer Ready | Default (no transcript hit) | 0 | 12.7s |
| "Did the officer give a verbal warning before discharging?" | Officer Service | Transcript hit on "give" (weapon_mention) | 3 | 8.3s |
| "Was anyone asking for medical aid?" | Officer England | Transcript hit on "medical" (medical) | 1 | 6.1s |
| "What did Officer Munoz observe?" | Officer Service | Transcript hit on "officer" (weapon_mention) | 2 | 9.1s |
| "Did anyone announce their presence as police?" | Officer Ready | Default (no transcript hit) | 0 | 10.9s |
- 4/6 questions resolved via direct transcript hits with the agent surfacing 1–3 evidence quotes.
- 2/6 fell back to default routing (Officer Ready BWC, the shooter — covers most "what happened" questions).
- Average response latency: 8.7 seconds (5.4–12.7s range), the bulk of which is the Pegasus invocation (~3–8s) plus transcript search.
Multi-source fusion in action: the most-cited question — "Did the officer give a verbal warning?" — pulls evidence from Officer Service's BWC even though the question is about Officer Ready's actions. This is the platform's strength: routing across the content of all 7 feeds rather than just the obvious one.
5b. Baseline comparison — Helion vs. manual review
DevPost asks for a comparison against a baseline. We use manual investigator review (current state of practice for OIS reconstruction) as the primary baseline, plus two alternative automated approaches for context.
| Capability | Manual review (baseline) | Frame-by-frame CV (Yolo / classical) | LLM-only (no video grounding) | Helion (Pegasus + Marengo + structured fusion) |
|---|---|---|---|---|
| Multi-feed timeline reconstruction | 4–6 hours, error-prone under fatigue | Detects objects but doesn't understand events ("officer arrives") | Cannot — text models can't watch video | 63 events / 90 seconds, with confidence scores |
| Audio transcription with speaker labels | Manual transcription: hours per feed | Out of scope for vision models | Possible if audio is extracted separately, but no video-time alignment | All 7 feeds transcribed in parallel by Pegasus, ~30–60 s each |
| Cross-feed Q&A ("did anyone announce shots fired?") | Investigator memory + re-watching | No semantic understanding | Possible but ungrounded — hallucination risk | Routes to right feed via transcript-hit search, returns mm:ss-cited evidence in 5–13 s |
| Policy-clause grading against video | IA review (days to weeks) | Cannot bridge from objects to policy | Cannot watch video to verify | Per-clause grading with clickable citations + live re-evaluation |
| Geospatial pursuit corridor | Manual map plotting (~30 min) | N/A | N/A | One Mapbox Directions API call |
Why Pegasus + Marengo specifically. Pegasus closes the semantic gap between visual frames and answerable analyst questions ("was a verbal warning given?") — a question no Yolo-style detector can answer because it requires audio comprehension and intent reasoning. Marengo closes the retrieval gap: given an analyst's question, find the right feed across 7 cameras without manual tagging. Together they replace the most labor-intensive step in the manual pipeline.
5c. Processing benchmarks — throughput and cost
Throughput at the demonstrated scope (Houston: 7 feeds, ~3 min each).
| Stage | Per-feed cost (time) | Per-case cost (parallel) |
|---|---|---|
| Marengo embedding (async) | ~3–5 min wall-clock per feed (async batch on Bedrock) | ~5 min total (parallel, queue-bounded) |
| Pegasus timeline extraction | ~30–60 s per feed (sync) | ~60 s total (Promise.all, 7 in flight) |
| Pegasus transcription | ~30–60 s per feed (sync) | ~60 s total (Promise.all) |
| Pegasus overview narrative (1×) | ~10–15 s | ~10–15 s |
| Mapbox geocode (1×) | ~300 ms | ~300 ms |
| Policy template attach (no model call) | < 100 ms | < 100 ms |
| Full ingestion (one case) | — | ~2–3 min wall-clock |
Cost at the demonstrated scope. Bedrock list pricing for TwelveLabs models (us-east-1, as of submission):
| Model | Unit | List price | Houston usage | Cost |
|---|---|---|---|---|
| Pegasus 1.2 | per video minute analyzed | ~$0.077 / min (sync invoke) | 7 feeds × ~3 min × 3 calls (timeline + transcript + overview) ≈ 63 video-min | ~$4.85 |
| Pegasus 1.2 (per-clause regrade) | per video minute analyzed | ~$0.077 / min | 9 clauses × ~3 min ≈ 27 video-min | ~$2.08 |
| Pegasus 1.2 (agent Q&A) | per video minute analyzed | ~$0.077 / min | ~10 questions × ~3 min ≈ 30 video-min | ~$2.31 |
| Marengo 3.0 (embedding, async) | per video minute embedded | ~$0.054 / min | 7 feeds × ~3 min ≈ 21 video-min, embedded once | ~$1.13 |
| Mapbox Directions / Geocoding | free tier covers usage | $0 | 1 directions + 1 geocode | $0 |
| S3 (storage + GET/PUT) | $0.023/GB-mo + $0.0004/1k req | < $0.10 | 7 video files (~250 MB) + JSON state | < $0.10 |
Per-case full-platform cost: ~$10.50 (one-time ingestion) plus ~$0.25–$0.75 per analyst question (Pegasus call) and ~$2 per full 9-clause re-grade.
Per-day at 50 cases: ~$525 raw model spend + S3 / Mapbox negligible. Bedrock + TwelveLabs both have committed-use discounts that bring this down materially at scale.
Comparison against the manual-review baseline: assuming an investigator at $80/hr loaded cost, the baseline 5-hour reconstruction = $400 of human time per case. Helion replaces that with ~$10.50 of compute while producing artifacts (cited transcripts, clickable evidence) the manual baseline doesn't generate at all. Cost reduction: ~38× per case, with strictly better deliverables.
6. End-to-end latency
| Operation | Houston (pre-baked) | Hypothetical fresh upload |
|---|---|---|
Page load (/overview, /viewer, /timeline) | < 500ms | < 500ms |
| Multi-angle reconstruction render | Instant | Instant after process completes |
| Pegasus timeline extraction (per video) | Pre-cached | ~30–60s |
| Pegasus transcription (per video) | Pre-cached | ~30–60s |
| Pegasus 8-clause policy regrade (full pass) | ~73s total | ~73s total |
| Agent question (cached) | < 200ms | < 200ms |
| Agent question (fresh Pegasus call) | 5–13s | 5–13s |
| Mapbox geocode (new case address) | ~300ms | ~300ms |
End-to-end ingestion of a fresh 7-feed case: ~2.5 minutes wall-clock with parallel Pegasus calls. Equivalent manual review by an investigator: ~4–6 hours of frame-by-frame footage scrubbing across 7 feeds × 3 minutes each, plus transcribing audio, plus building a timeline.
Speed-up factor: ~100×.
7. Track 3 — Multimodal Geospatial Workloads alignment
The brief calls for "systems synthesizing video, geospatial databases, and unstructured text to answer complex analytical questions." Helion does each of these:
| Modality | What we ingest | How it's fused |
|---|---|---|
| Video evidence | 7 BWC + dashcam feeds, ~3 min each | Pegasus extracts events + transcripts; Marengo embeds for cross-feed search |
| Structured data | HPD GO 600-17 policy clauses (data/houston-policies.json), officer roster, timeline events | Joined to video evidence at the clause level (each policy finding cites mm:ss timestamps in specific feeds) |
| Unstructured text | Auto-extracted body-cam transcripts, key statements, the HPD public notice document | Indexed for the agent's transcript-hit search; the public notice is surfaced as /houston-public-notice.md and linked from the policy disclaimer + case report |
| Geospatial | Mapbox basemap (toggleable street ↔ satellite), road-snapped pursuit corridor via Mapbox Directions API, geocoded incident location | Officer movement tracks plotted on the map; clicking any event marker jumps the multi-angle viewer to that timestamp |
| Master synthesis | Helion Agent + Case Report | Agent answers cross-feed questions with cited evidence from any modality; Case Report is a single markdown document composed from all of the above |
8. What ran — system metrics
| Metric | Value |
|---|---|
| Total Pegasus calls in this validation | 9 (policy regrade) + 6 (agent questions) = 15 |
| Total Bedrock latency | ~117 seconds |
| Total auto-extracted artifacts | 63 timeline events + 33 utterances + 26 key statements + 91 Marengo embeddings |
| Live demo URL | https://helion.metisos.co |
| GitHub repository | https://github.com/metisos/helion |
| CI/CD | GitHub push webhook → self-hosted on this machine; ~30s deploy turnaround |
| Code on disk | 17,000+ insertions, ~110 source files |
9. Honest limitations
- Marengo retrieval not in the agent's hot path. Asset-level vectors are computed and bundled, but text-embedding for live questions on Bedrock is async-only (~30s) so the agent uses a faster transcript-hit + keyword router instead. Marengo retrieval would be a strict improvement once sync text-embed is supported.
- Houston
incidentStartSecoffsets for multi-angle synchronization are calibrated by hand against the public-notice narrative, not from GPS metadata in the video files. Pegasus's event timestamps drift 5–15s on noisy BWC audio so trusting them directly produces a misaligned reconstruction. - Filesystem caches (
data/agent-cache/,data/policy-cache/) work on this self-hosted deploy but would silently fail on Vercel's read-only FS (documented in README). - Single-tenant case store. All cases live in a single S3 JSON blob with no per-user partitioning. Concurrent users would step on each other's "active case" choice. Acceptable for the demo; would need partitioning for production.
- Pegasus version. Hackathon brief calls for Pegasus 1.5; we're on 1.2 because that's the latest available on Bedrock as of submission.