E6 — Pose-template cricket batting detection

YOLO11n-pose templates from CricketVision (broadcast 720p) → matched against 11 4K@120fps controlled test videos (8 Batting Pose + 3 Balling) + 3 wild YouTube broadcast clips (Wild Batting, v3.2 only). Goal: detect batting impact moments via pure pose similarity.
2026-05-26 · v1 status: method fail · 2026-05-27 · v2 update: bat-grounded batsman selection refutes the "wrong-person" hypothesis; the impact-pool gap shrinks but does not flip. · 2026-05-27 · v3.2 update: 3 user-curated action templates + bat-grounded + single-pose vis flips the per-class gap to the correct direction (Batting > Balling by +0.013). See v2 · v3.2 below.

v1 headline finding. The single-frame pose-template approach has no discriminative power between batting and balling videos.

Top-peak similarity across 11 videos (v1 with heuristic selection):

Group	n	mean	min	max
Batting Pose	8	0.894	0.882	0.908
Balling	3	0.908	0.894	0.930

Balling videos score higher on batting templates than the Batting Pose videos themselves do. No threshold separates the two groups.

What the per-video montages actually show

The top-K "matches" at sim ≥ 0.87 are not batting impacts. In the Balling-134448 montage below they include: a standing batsman (not swinging), the main umpire, an outfielder running, and a celebrating player. Under 17-keypoint hip-shoulder normalization, any cricket-attired person standing in side profile becomes geometrically close to the 200+ impact poses in the template pool.

Why it fails (signal-level diagnosis)

What pose template captures

Coarse 17-joint skeleton (shoulders, hips, knees, ankles, elbows, wrists, head)
Hip-centered, hip→shoulder-midpoint unit length normalization
Confidence-masked L2 over the joint intersection

What it misses (the actual impact signal)

Bat position — the single most informative joint, not in COCO-17
Motion gradient — single frame, no temporal velocity
Game-context ROI — no awareness of batsman vs umpire vs fielder

Net: 16 of 17 normalized joints are torso/leg structure that look similar across any standing person; the one wrist joint that could distinguish swinging from standing is L2-averaged to insignificance.

v2 update — bat-grounded batsman selection 2026-05-27

The v1 failure left an obvious hypothesis: maybe the heuristic (size + centrality + KP confidence) picks the wrong person in the 4K test frames — an umpire or a standing batsman instead of the one actually playing the shot. Templates would then be matched against poses of cricket-attired non-actors. v2 tests this hypothesis directly.

What changed

For each frame: YOLO11n-pose as before → all person bboxes.
If >1 person, run OWLv2-base zero-shot with prompt "cricket bat" on a 960-px downsample.
Keep bat boxes with conf ≥ 0.15; take top-conf box.
Pick the person whose bbox contains the bat center (or nearest if none contain it).
If no bat clears 0.15, fall back to the v1 heuristic.

Why OWLv2 (not YOLO-World)

Phase-A spike: yolov8s-world with prompt "cricket bat" recalls only 22% of test-frame bats (4/18), 8% of broadcast (1/12). Phase-B: OWLv2-base jumps to 70% recall on the same frames (median top-1 score 0.31, max 0.54). Bat-bbox precision is essentially 100% in both: when fired, the box always lands on the batsman.

Per-class top-1 smoothed similarity (n=11 videos)

Class	n	Impact pool (top-1)			FT pool (top-1)
Class	n	v1	v2	Δ	v1	v2	Δ
Batting Pose	8	0.8909	0.8936	+0.0027	0.8810	0.8832	+0.0022
Balling	3	0.9035	0.9041	+0.0006	0.8860	0.8840	-0.0021
Cross-class gap (Balling − Batting)		+0.0126	+0.0105	-0.0021	+0.0050	+0.0008	-0.0042

Balling > Batting on top-1 impact similarity means the method is "more confident this is a batting impact" on the Balling group than on the Batting group — the failure mode. v2 narrows the gap but does not flip it on the impact pool; it nearly eliminates it on the ft pool.

v2 conclusion

Bat-grounded selection works as designed at the selection step: when the bat detector fires, it correctly identifies the batsman. But the headline numbers move only ~0.002-0.003 per class. The hypothesis "wrong-person selection drives the failure" is refuted — even with the right person, the 17-joint normalized pose representation still cannot distinguish a swinging batsman from a standing batsman or a running fielder. The remaining impact-pool gap (Balling +0.0105 above Batting) is intrinsic to the pose feature, not the selection. v2 is a clean diagnostic that promotes the next bottleneck from "selection" to "feature".

v3.2 update — 3 user-curated action templates · single-pose visualization 2026-05-27

v2 showed that the 201-template impact pool was contaminated: many of the auto-built templates were probably not the actual swinging batsman (umpire, standing batsman, etc.) but bat-grounded selection on query frames couldn't fix the contamination on the template side. v3 attacks this from the other end — hand-pick 3 broadcast frames of visible swing motion as the entire pool.

v3.2 fix vs v3.1 (visualization only — detection unchanged): pose detection is still YOLO11n-pose on every frame. The matcher pipeline is unchanged from v3.1:

Each template image → YOLO11n-pose → OWLv2 bat-grounded pick (or sidecar hint) → one normalized 17-joint pose into the NPZ.
Each query frame → YOLO11n-pose → OWLv2 bat-grounded pick → one normalized 17-joint pose → pool_best_match against the 3 templates.

The old sanity / topk thumbs called the raw res.plot() which drew every YOLO detection (e.g. all 11 persons including the crowd in the harmanpreet_pull broadcast shot), making it look like the matcher consumed multiple poses. It never did. v3.2 swaps the drawing call to res[idx:idx+1].plot() on the bat-grounded idx, so the picture matches what the matcher actually sees: same YOLO native style, just filtered to the one selected batsman.

What changed (2026-05-27 second iteration)

3 user-supplied images in dataset/pose_templates/: a kid playing a drive (cricket_kid_drive.jpg), Harmanpreet Kaur playing a pull (harmanpreet_pull.jpg), a club batsman driving (screenshot_batsman.png).
Each template's batsman pose is extracted with the same OWLv2 bat-grounded selection as v2 query frames. 2 of 3 needed a manual force_bbox sidecar hint because OWLv2 / YOLO failed differently on each (see notes panel).
Test videos run unchanged from v2 — same YOLO11n-pose + BatsmanDetector pipeline, same normalization, same Gaussian smoothing + NMS. Only the matching pool differs.

Per-template selection notes

cricket_kid_drive: single person in frame → trivial single_person pick, no hint needed.
harmanpreet_pull: 11 YOLO persons (crowd). OWLv2's bat center landed 18px outside the main batsman's bbox, so pick fell to a crowd person. Hint forces the correct bbox.
screenshot_batsman: YOLO split the batsman into two overlapping half-detections (upper + lower body); lower half had no valid shoulders → normalize failed. Hint forces the upper-body detection (15/17 valid kps).

cricket_kid_drive — drive follow-through (single_person)

harmanpreet_pull — pull/hook (manual hint, bbox forced)

screenshot_batsman — front-foot drive (manual hint, upper-body detection)

Notes

Absolute v3 sim values are lower than v1/v2 (~0.74 vs ~0.89). Expected: only 3 templates means the matcher rarely finds a near-perfect fit. The thing to compare is the class gap, not the absolute level.
v3 trades discriminative direction for coverage: per-frame action templates miss the exact pose configuration in some batting videos, so some Batting-Pose top-1 sims land within the Balling group's range — a clean threshold cannot separate yet.

Per-class top-1 smoothed similarity — v1 vs v2 vs v3

Class	n	v1 impact	v2 impact	v3 action	v1 ft	v2 ft
Batting Pose	8	0.8909	0.8936	0.7437	0.8810	0.8832
Balling	3	0.9035	0.9041	0.7312	0.8860	0.8840
Wild Batting (YouTube)	3	n/a	n/a	0.7988	n/a	n/a
Gap (Batting − Balling, positive = correct direction)		−0.0126	−0.0105	+0.0125	−0.0050	−0.0008
Wild Batting − controlled Balling		—	—	+0.0676	—	—

Wild-batting top-1 mean (0.7988) is the highest of any group — unexpected but explainable: YouTube reels concatenate many discrete swings back-to-back, so the NMS-top-1 peak easily finds a near-perfect match. The Batting−Balling gap (controlled set) is +0.0125; the Wild-vs-Balling gap is +0.0676 (~5× larger). v3.2's discriminative direction holds in the wild and is in fact stronger there.

v3 keeps the discriminative direction correct across two template-pool revisions (first set was +0.0133, current set is +0.0125): Batting Pose videos score higher than Balling videos on the matching pool. Magnitude is small, and absolute sim values dropped because 3 templates is a sparse pool. The signal is fragile but consistent.

v3 conclusion

v3 confirms a clean diagnosis: the failure in v1/v2 was largely about template-pool composition, not just batsman selection or COCO joint representation. When the templates are guaranteed to depict actual swing motion (small but clean pool), the matcher does separate Batting from Balling — though weakly. Caveats: (a) absolute sims dropped because 3 templates is a sparse pool; (b) within-Batting variance is large (top-1 ranges 0.711–0.836); (c) a single threshold still cannot perfectly separate the classes. v3 demonstrates the direction works; the next step is to grow the pool to ~20–30 hand-picked swing frames covering more shot types and viewpoints, then re-evaluate.

How templates are built

From each CricketVision segment (broadcast 720p, labelled GT phases), two templates are extracted and stored in the pool:

Pool	Frame source	Pipeline	Count
impact	Midpoint of `gt_phases[Execution]` phase, re-decoded from the 25-fps mp4	YOLO11n-pose → pick batsman (heuristic: size+centrality+conf) → hip-centered, hip→shoulder-midpoint unit normalization → confidence-mask each joint at threshold 0.30 → store `(kp_norm: 17×2, mask: 17, meta)`	201 / 202
ft	The picked `best_filename` JPG from the VLM grid-pick pipeline (FollowThrough frame)		201 / 202

Quality: mean 15.8 / 17 valid joints per template; bbox heights range 151-531 px (broadcast scale).

Sample raw template constructions

YOLO output on representative CV frames (skeleton + bbox + person confidence). The batsman heuristic picks the highest-scoring person, then that person's 17 joints get normalized into the pool.

impact pool — P1_V28 stroke_0000 (OffDrive)

impact pool — P2_V14 stroke_0043 (Cut)

ft pool — P2_V8 stroke_0044 (Glance)

ft pool — P2_V2 stroke_0020 (Cut)

How matching works

For each sampled test_data frame the same pose-extraction pipeline runs. The query pose is compared to every template in each pool via a confidence-mask-weighted L2 distance:

mj = q_maskj & t_maskj   // joint usable in both

d(q, t) = meanj ∈ m ‖ qj − tj ‖2   // require |m| ≥ 4 else ∞

sim(q, t) = 1 / (1 + d(q, t))   // maps to (0, 1]

best_match(q, pool) = argmint ∈ pool d(q, t)

Per query frame this yields two scores: best similarity vs impact pool and best similarity vs ft pool. The series is Gaussian-smoothed (σ=3 frames ≈ 100 ms at 30 fps) and greedy-NMS picks the top-K peaks per pool (window ±1 s).

Methodology overview

Step	What	Tools
1. Template extraction	Per CV segment: impact + ft template; total 201+201 templates	YOLO11n-pose, OpenCV decode @25fps mp4
2. Query extraction	Test video sequentially decoded; sample every 4th frame (120→30 fps)	YOLO11n-pose; v2 adds OWLv2-base for batsman selection
3. Pose match	Vectorized argmin distance over the 201-template pool per frame	NumPy broadcasting
4. Temporal smoothing	NaN-tolerant Gaussian σ=3 frames (≈100 ms)	NumPy convolve
5. Peak picking	Greedy NMS with ±1 s window, top-5 per pool	NumPy

Engineering note

Initial implementation used cap.set(CAP_PROP_POS_FRAMES, idx) to seek to each sampled frame. On H.264 4K@120fps this costs ~3 s per seek (walk from prior keyframe), giving ~0.3 fps throughput. Switching to sequential decode + modulo-skip gives ~6.5 fps (≈25× speedup, 11 videos in ~22 minutes instead of ~9 hours).

Per-video results

Each card shows the v2 (bat-grounded) run. The card header reports v1↔v2 top-1 sims and the v2 selection-method split (green = bat-grounded, grey = single person, orange = heuristic fallback). The two side-by-side images per card show top-5 peaks for each pool: query frame (red border) → matched template.

Batting Pose (n=8) — controlled 4K@120fps

Batting_Pose__video_20260218_122230 Batting Pose

dur 22.4s sampled 672@30fps pose hits 603/672 (90%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8938	0.8897	-0.0041	0.7316
ft top-1	0.9018	0.8989	-0.0029	0.7316

v2/v3.2 selection split: bat-grounded 55.4% · single 31.7% · fallback 12.9%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_122427 (sanity video) Batting Pose

dur 30.1s sampled 904@30fps pose hits 730/904 (81%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8850	0.9078	+0.0228	0.7479
ft top-1	0.8898	0.8902	+0.0004	0.7479

v2/v3.2 selection split: bat-grounded 77.9% · single 16.8% · fallback 5.2%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_122901 Batting Pose

dur 13.6s sampled 406@30fps pose hits 382/406 (94%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.9077	0.9076	-0.0001	0.7258
ft top-1	0.8614	0.8846	+0.0232	0.7258

v2/v3.2 selection split: bat-grounded 46.6% · single 17.3% · fallback 36.1%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_122930 Batting Pose

dur 8.8s sampled 212@30fps pose hits 183/212 (86%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8888	0.8878	-0.0010	0.6871
ft top-1	0.8527	0.8509	-0.0017	0.6871

v2/v3.2 selection split: bat-grounded 86.9% · single 0.5% · fallback 12.6%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_134132 Batting Pose

dur 31.5s sampled 944@30fps pose hits 872/944 (92%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8909	0.8917	+0.0008	0.7324
ft top-1	0.8928	0.8931	+0.0003	0.7324

v2/v3.2 selection split: bat-grounded 52.9% · single 0.9% · fallback 46.2%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_134332 Batting Pose

dur 49.4s sampled 1480@30fps pose hits 1450/1480 (98%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8823	0.8846	+0.0023	0.8004
ft top-1	0.8782	0.8783	+0.0001	0.8004

v2/v3.2 selection split: bat-grounded 45.8% · single 47.3% · fallback 6.9%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_141416 Batting Pose

dur 41.2s sampled 1236@30fps pose hits 1209/1236 (98%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8777	0.8790	+0.0013	0.7750
ft top-1	0.8854	0.8827	-0.0028	0.7750

v2/v3.2 selection split: bat-grounded 70.9% · single 0.0% · fallback 29.1%

v3.2 — 3-template action pool · single-pose vis

Batting_Pose__video_20260218_141525 Batting Pose

dur 13.2s sampled 396@30fps pose hits 326/396 (82%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.9009	0.9006	-0.0003	0.7339
ft top-1	0.8863	0.8872	+0.0009	0.7339

v2/v3.2 selection split: bat-grounded 12.9% · single 60.1% · fallback 27.0%

v3.2 — 3-template action pool · single-pose vis

Balling (n=3) — controlled 4K@120fps

Balling__video_20260218_134448 Balling

dur 20.5s sampled 614@30fps pose hits 608/614 (99%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8866	0.8868	+0.0002	0.7307
ft top-1	0.8998	0.8937	-0.0061	0.7307

v2/v3.2 selection split: bat-grounded 37.5% · single 26.8% · fallback 35.7%

v3.2 — 3-template action pool · single-pose vis

Balling__video_20260218_135139 Balling

dur 26.6s sampled 796@30fps pose hits 719/796 (90%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.9297	0.9300	+0.0003	0.7269
ft top-1	0.8729	0.8736	+0.0007	0.7269

v2/v3.2 selection split: bat-grounded 20.6% · single 31.8% · fallback 47.6%

v3.2 — 3-template action pool · single-pose vis

Balling__video_20260218_141224 Balling

dur 35.2s sampled 1055@30fps pose hits 770/1055 (73%)

pool	v1 (heuristic)	v2 (bat-grounded)	Δ v2-v1	v3.2 (3-tpl action)
impact top-1	0.8943	0.8955	+0.0012	0.7359
ft top-1	0.8854	0.8846	-0.0008	0.7359

v2/v3.2 selection split: bat-grounded 6.2% · single 11.6% · fallback 82.2%

v3.2 — 3-template action pool · single-pose vis

Wild Batting (n=3) — YouTube broadcast / compilation reels v3.2-only · 5fps

Long YouTube clips (5–12 min) sampled at 5 fps (vs 30 fps on the controlled set) — these are minutes-long highlight reels with many discrete swings, so coarse sampling is enough to localize peaks. v1/v2 templates don't apply; only v3.2 (3 user-curated swing templates + bat-grounded selection) was run.

HK6B2da3DPA_001_720p Wild Batting v3.2-only

dur 712.0s sampled 3560@5fps pose hits 1978/3560 (56%)

v3.2 action	top-1 (smoothed)	max raw	mean (when pose)
—	0.8091	0.8449	0.6818

v3.2 selection split: bat-grounded 37.1% · single 17.2% · fallback 45.7%

v3.2 — 3-template action pool

rYiybyiJ4w8_002_720p Wild Batting v3.2-only

dur 700.8s sampled 3501@5fps pose hits 3035/3501 (87%)

v3.2 action	top-1 (smoothed)	max raw	mean (when pose)
—	0.8043	0.8361	0.6645

v3.2 selection split: bat-grounded 3.8% · single 95.6% · fallback 0.6%

v3.2 — 3-template action pool

KExVvhIESKA_002_720p Wild Batting v3.2-only

dur 339.1s sampled 1696@5fps pose hits 1638/1696 (97%)

v3.2 action	top-1 (smoothed)	max raw	mean (when pose)
—	0.7807	0.8529	0.7134

v3.2 selection split: bat-grounded 48.8% · single 48.7% · fallback 2.5%

v3.2 — 3-template action pool

Conclusion & next steps

Three iterations land on a clean diagnosis:

v1: heuristic selection on a 201-template pool — Balling > Batting (failure).
v2: bat-grounded selection on the same pool — gap shrinks but doesn't flip.
v3.2: bat-grounded selection on a 3-template hand-curated swing pool — Batting > Balling for the first time (+0.013). Single-pose visualization now matches the matcher (only the selected pose enters pool_best_match; the v3.1 thumbs misled by drawing all YOLO detections).

The combined evidence: (1) v1 → v2 shows person-selection is not the dominant bottleneck; (2) v2 → v3.2 shows template-pool composition matters far more than the matcher's selection logic on query frames. Templates auto-built from "midpoint of Execution phase" capture many non-swing stance/recovery poses; user-curated swing frames give the matcher something semantically specific to align to.

Productive paths from here (E7+):

Grow the v3.2 pool to 20–30 templates covering pull / hook / drive / cut / sweep / glance / block, multiple camera angles. Cheapest signal-to-effort ratio — same code, more data.
Bat keypoint as a joint — concatenate the OWLv2 bat-center (now available per frame) to the 17-joint vector before normalization. Single most informative joint missing from COCO-17.
Motion gradient over ±0.2 s — swinging produces strong wrist/elbow velocity; standing does not.
Game-context ROI — pitch / wicket region gating to keep the search to where real shots happen.

Generated 2026-05-26, v2 update 2026-05-27, v3.2 update 2026-05-27 · motionclip / cricket_highlight · v1 results: cricket_highlight/results/e6_m{1,2,3}/ · v2 results: cricket_highlight/results/e6_m2_v2/ · v3.2 results: cricket_highlight/results/e6_v3_2_m{2,3}/ + e6_v3_2_templates/ · build script: scripts/e6_m3_build_reviewer.py