E6 — Pose-template cricket batting detection

YOLO11n-pose templates from CricketVision (broadcast 720p) → matched against 11 4K@120fps controlled test videos (8 Batting Pose + 3 Balling) + 3 wild YouTube broadcast clips (Wild Batting, v3.2 only). Goal: detect batting impact moments via pure pose similarity.
2026-05-26 · v1 status: method fail  ·  2026-05-27 · v2 update: bat-grounded batsman selection refutes the "wrong-person" hypothesis; the impact-pool gap shrinks but does not flip.  ·  2026-05-27 · v3.2 update: 3 user-curated action templates + bat-grounded + single-pose vis flips the per-class gap to the correct direction (Batting > Balling by +0.013). See v2 · v3.2 below.

v1 headline finding. The single-frame pose-template approach has no discriminative power between batting and balling videos.

Top-peak similarity across 11 videos (v1 with heuristic selection):

Groupnmeanminmax
Batting Pose 80.8940.8820.908
Balling 30.9080.8940.930

Balling videos score higher on batting templates than the Batting Pose videos themselves do. No threshold separates the two groups.

What the per-video montages actually show

The top-K "matches" at sim ≥ 0.87 are not batting impacts. In the Balling-134448 montage below they include: a standing batsman (not swinging), the main umpire, an outfielder running, and a celebrating player. Under 17-keypoint hip-shoulder normalization, any cricket-attired person standing in side profile becomes geometrically close to the 200+ impact poses in the template pool.

Why it fails (signal-level diagnosis)

What pose template captures

What it misses (the actual impact signal)

Net: 16 of 17 normalized joints are torso/leg structure that look similar across any standing person; the one wrist joint that could distinguish swinging from standing is L2-averaged to insignificance.

v2 update — bat-grounded batsman selection 2026-05-27

The v1 failure left an obvious hypothesis: maybe the heuristic (size + centrality + KP confidence) picks the wrong person in the 4K test frames — an umpire or a standing batsman instead of the one actually playing the shot. Templates would then be matched against poses of cricket-attired non-actors. v2 tests this hypothesis directly.

What changed

  1. For each frame: YOLO11n-pose as before → all person bboxes.
  2. If >1 person, run OWLv2-base zero-shot with prompt "cricket bat" on a 960-px downsample.
  3. Keep bat boxes with conf ≥ 0.15; take top-conf box.
  4. Pick the person whose bbox contains the bat center (or nearest if none contain it).
  5. If no bat clears 0.15, fall back to the v1 heuristic.

Why OWLv2 (not YOLO-World)

Phase-A spike: yolov8s-world with prompt "cricket bat" recalls only 22% of test-frame bats (4/18), 8% of broadcast (1/12). Phase-B: OWLv2-base jumps to 70% recall on the same frames (median top-1 score 0.31, max 0.54). Bat-bbox precision is essentially 100% in both: when fired, the box always lands on the batsman.

Per-class top-1 smoothed similarity (n=11 videos)

Classn Impact pool (top-1) FT pool (top-1)
v1v2Δ v1v2Δ
Batting Pose 8 0.8909 0.8936+0.0027 0.8810 0.8832+0.0022
Balling 3 0.9035 0.9041+0.0006 0.8860 0.8840-0.0021
Cross-class gap (Balling − Batting) +0.0126 +0.0105-0.0021 +0.0050 +0.0008-0.0042

Balling > Batting on top-1 impact similarity means the method is "more confident this is a batting impact" on the Balling group than on the Batting group — the failure mode. v2 narrows the gap but does not flip it on the impact pool; it nearly eliminates it on the ft pool.

v2 conclusion

Bat-grounded selection works as designed at the selection step: when the bat detector fires, it correctly identifies the batsman. But the headline numbers move only ~0.002-0.003 per class. The hypothesis "wrong-person selection drives the failure" is refuted — even with the right person, the 17-joint normalized pose representation still cannot distinguish a swinging batsman from a standing batsman or a running fielder. The remaining impact-pool gap (Balling +0.0105 above Batting) is intrinsic to the pose feature, not the selection. v2 is a clean diagnostic that promotes the next bottleneck from "selection" to "feature".

v3.2 update — 3 user-curated action templates · single-pose visualization 2026-05-27

v2 showed that the 201-template impact pool was contaminated: many of the auto-built templates were probably not the actual swinging batsman (umpire, standing batsman, etc.) but bat-grounded selection on query frames couldn't fix the contamination on the template side. v3 attacks this from the other end — hand-pick 3 broadcast frames of visible swing motion as the entire pool.

v3.2 fix vs v3.1 (visualization only — detection unchanged): pose detection is still YOLO11n-pose on every frame. The matcher pipeline is unchanged from v3.1:
  1. Each template image → YOLO11n-pose → OWLv2 bat-grounded pick (or sidecar hint) → one normalized 17-joint pose into the NPZ.
  2. Each query frame → YOLO11n-pose → OWLv2 bat-grounded pick → one normalized 17-joint pose → pool_best_match against the 3 templates.
The old sanity / topk thumbs called the raw res.plot() which drew every YOLO detection (e.g. all 11 persons including the crowd in the harmanpreet_pull broadcast shot), making it look like the matcher consumed multiple poses. It never did. v3.2 swaps the drawing call to res[idx:idx+1].plot() on the bat-grounded idx, so the picture matches what the matcher actually sees: same YOLO native style, just filtered to the one selected batsman.

What changed (2026-05-27 second iteration)

  1. 3 user-supplied images in dataset/pose_templates/: a kid playing a drive (cricket_kid_drive.jpg), Harmanpreet Kaur playing a pull (harmanpreet_pull.jpg), a club batsman driving (screenshot_batsman.png).
  2. Each template's batsman pose is extracted with the same OWLv2 bat-grounded selection as v2 query frames. 2 of 3 needed a manual force_bbox sidecar hint because OWLv2 / YOLO failed differently on each (see notes panel).
  3. Test videos run unchanged from v2 — same YOLO11n-pose + BatsmanDetector pipeline, same normalization, same Gaussian smoothing + NMS. Only the matching pool differs.

Per-template selection notes

cricket_kid_drive — drive follow-through (single_person)
harmanpreet_pull — pull/hook (manual hint, bbox forced)
screenshot_batsman — front-foot drive (manual hint, upper-body detection)
Notes

Per-class top-1 smoothed similarity — v1 vs v2 vs v3

Classn v1 impactv2 impactv3 action v1 ftv2 ft
Batting Pose8 0.89090.8936 0.7437 0.88100.8832
Balling3 0.90350.9041 0.7312 0.88600.8840
Wild Batting (YouTube)3 n/an/a 0.7988 n/an/a
Gap (Batting − Balling, positive = correct direction) −0.0126 −0.0105 +0.0125 −0.0050 −0.0008
Wild Batting − controlled Balling +0.0676

Wild-batting top-1 mean (0.7988) is the highest of any group — unexpected but explainable: YouTube reels concatenate many discrete swings back-to-back, so the NMS-top-1 peak easily finds a near-perfect match. The Batting−Balling gap (controlled set) is +0.0125; the Wild-vs-Balling gap is +0.0676 (~5× larger). v3.2's discriminative direction holds in the wild and is in fact stronger there.

v3 keeps the discriminative direction correct across two template-pool revisions (first set was +0.0133, current set is +0.0125): Batting Pose videos score higher than Balling videos on the matching pool. Magnitude is small, and absolute sim values dropped because 3 templates is a sparse pool. The signal is fragile but consistent.

v3 conclusion

v3 confirms a clean diagnosis: the failure in v1/v2 was largely about template-pool composition, not just batsman selection or COCO joint representation. When the templates are guaranteed to depict actual swing motion (small but clean pool), the matcher does separate Batting from Balling — though weakly. Caveats: (a) absolute sims dropped because 3 templates is a sparse pool; (b) within-Batting variance is large (top-1 ranges 0.711–0.836); (c) a single threshold still cannot perfectly separate the classes. v3 demonstrates the direction works; the next step is to grow the pool to ~20–30 hand-picked swing frames covering more shot types and viewpoints, then re-evaluate.

How templates are built

From each CricketVision segment (broadcast 720p, labelled GT phases), two templates are extracted and stored in the pool:

PoolFrame sourcePipelineCount
impact Midpoint of gt_phases[Execution] phase, re-decoded from the 25-fps mp4 YOLO11n-pose → pick batsman (heuristic: size+centrality+conf) → hip-centered, hip→shoulder-midpoint unit normalization → confidence-mask each joint at threshold 0.30 → store (kp_norm: 17×2, mask: 17, meta) 201 / 202
ft The picked best_filename JPG from the VLM grid-pick pipeline (FollowThrough frame) 201 / 202

Quality: mean 15.8 / 17 valid joints per template; bbox heights range 151-531 px (broadcast scale).

Sample raw template constructions

YOLO output on representative CV frames (skeleton + bbox + person confidence). The batsman heuristic picks the highest-scoring person, then that person's 17 joints get normalized into the pool.

impact pool — P1_V28 stroke_0000 (OffDrive)
impact pool — P2_V14 stroke_0043 (Cut)
ft pool — P2_V8 stroke_0044 (Glance)
ft pool — P2_V2 stroke_0020 (Cut)

How matching works

For each sampled test_data frame the same pose-extraction pipeline runs. The query pose is compared to every template in each pool via a confidence-mask-weighted L2 distance:

mj = q_maskj & t_maskj   // joint usable in both
d(q, t) = meanj ∈ m ‖ qj − tj2   // require |m| ≥ 4 else ∞
sim(q, t) = 1 / (1 + d(q, t))   // maps to (0, 1]
best_match(q, pool) = argmint ∈ pool d(q, t)

Per query frame this yields two scores: best similarity vs impact pool and best similarity vs ft pool. The series is Gaussian-smoothed (σ=3 frames ≈ 100 ms at 30 fps) and greedy-NMS picks the top-K peaks per pool (window ±1 s).

Methodology overview

StepWhatTools
1. Template extraction Per CV segment: impact + ft template; total 201+201 templates YOLO11n-pose, OpenCV decode @25fps mp4
2. Query extraction Test video sequentially decoded; sample every 4th frame (120→30 fps) YOLO11n-pose; v2 adds OWLv2-base for batsman selection
3. Pose match Vectorized argmin distance over the 201-template pool per frame NumPy broadcasting
4. Temporal smoothing NaN-tolerant Gaussian σ=3 frames (≈100 ms) NumPy convolve
5. Peak picking Greedy NMS with ±1 s window, top-5 per pool NumPy

Engineering note

Initial implementation used cap.set(CAP_PROP_POS_FRAMES, idx) to seek to each sampled frame. On H.264 4K@120fps this costs ~3 s per seek (walk from prior keyframe), giving ~0.3 fps throughput. Switching to sequential decode + modulo-skip gives ~6.5 fps (≈25× speedup, 11 videos in ~22 minutes instead of ~9 hours).

Per-video results

Each card shows the v2 (bat-grounded) run. The card header reports v1↔v2 top-1 sims and the v2 selection-method split (green = bat-grounded, grey = single person, orange = heuristic fallback). The two side-by-side images per card show top-5 peaks for each pool: query frame (red border) → matched template.

Batting Pose (n=8) — controlled 4K@120fps

Batting_Pose__video_20260218_122230  Batting Pose

dur 22.4s sampled 672@30fps pose hits 603/672 (90%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.89380.8897-0.00410.7316
ft top-10.90180.8989-0.0029
v2/v3.2 selection split: bat-grounded 55.4% · single 31.7% · fallback 12.9%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_122427 (sanity video)  Batting Pose

dur 30.1s sampled 904@30fps pose hits 730/904 (81%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.88500.9078+0.02280.7479
ft top-10.88980.8902+0.0004
v2/v3.2 selection split: bat-grounded 77.9% · single 16.8% · fallback 5.2%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_122901  Batting Pose

dur 13.6s sampled 406@30fps pose hits 382/406 (94%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.90770.9076-0.00010.7258
ft top-10.86140.8846+0.0232
v2/v3.2 selection split: bat-grounded 46.6% · single 17.3% · fallback 36.1%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_122930  Batting Pose

dur 8.8s sampled 212@30fps pose hits 183/212 (86%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.88880.8878-0.00100.6871
ft top-10.85270.8509-0.0017
v2/v3.2 selection split: bat-grounded 86.9% · single 0.5% · fallback 12.6%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_134132  Batting Pose

dur 31.5s sampled 944@30fps pose hits 872/944 (92%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.89090.8917+0.00080.7324
ft top-10.89280.8931+0.0003
v2/v3.2 selection split: bat-grounded 52.9% · single 0.9% · fallback 46.2%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_134332  Batting Pose

dur 49.4s sampled 1480@30fps pose hits 1450/1480 (98%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.88230.8846+0.00230.8004
ft top-10.87820.8783+0.0001
v2/v3.2 selection split: bat-grounded 45.8% · single 47.3% · fallback 6.9%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_141416  Batting Pose

dur 41.2s sampled 1236@30fps pose hits 1209/1236 (98%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.87770.8790+0.00130.7750
ft top-10.88540.8827-0.0028
v2/v3.2 selection split: bat-grounded 70.9% · single 0.0% · fallback 29.1%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Batting_Pose__video_20260218_141525  Batting Pose

dur 13.2s sampled 396@30fps pose hits 326/396 (82%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.90090.9006-0.00030.7339
ft top-10.88630.8872+0.0009
v2/v3.2 selection split: bat-grounded 12.9% · single 60.1% · fallback 27.0%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks
Balling (n=3) — controlled 4K@120fps

Balling__video_20260218_134448  Balling

dur 20.5s sampled 614@30fps pose hits 608/614 (99%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.88660.8868+0.00020.7307
ft top-10.89980.8937-0.0061
v2/v3.2 selection split: bat-grounded 37.5% · single 26.8% · fallback 35.7%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Balling__video_20260218_135139  Balling

dur 26.6s sampled 796@30fps pose hits 719/796 (90%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.92970.9300+0.00030.7269
ft top-10.87290.8736+0.0007
v2/v3.2 selection split: bat-grounded 20.6% · single 31.8% · fallback 47.6%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks

Balling__video_20260218_141224  Balling

dur 35.2s sampled 1055@30fps pose hits 770/1055 (73%)
poolv1 (heuristic)v2 (bat-grounded)Δ v2-v1 v3.2 (3-tpl action)
impact top-10.89430.8955+0.00120.7359
ft top-10.88540.8846-0.0008
v2/v3.2 selection split: bat-grounded 6.2% · single 11.6% · fallback 82.2%
v1/v2 time-series + top-K peak markers
impact pool top-K peaks
ft pool top-K peaks

v3.2 — 3-template action pool · single-pose vis

v3.2 action sim curve v3.2 top-K action peaks
Wild Batting (n=3) — YouTube broadcast / compilation reels v3.2-only · 5fps

Long YouTube clips (5–12 min) sampled at 5 fps (vs 30 fps on the controlled set) — these are minutes-long highlight reels with many discrete swings, so coarse sampling is enough to localize peaks. v1/v2 templates don't apply; only v3.2 (3 user-curated swing templates + bat-grounded selection) was run.

HK6B2da3DPA_001_720p  Wild Batting  v3.2-only

dur 712.0s sampled 3560@5fps pose hits 1978/3560 (56%)
v3.2 actiontop-1 (smoothed)max rawmean (when pose)
0.8091 0.84490.6818
v3.2 selection split: bat-grounded 37.1% · single 17.2% · fallback 45.7%

v3.2 — 3-template action pool

v3.2 action sim curve v3.2 top-K action peaks

rYiybyiJ4w8_002_720p  Wild Batting  v3.2-only

dur 700.8s sampled 3501@5fps pose hits 3035/3501 (87%)
v3.2 actiontop-1 (smoothed)max rawmean (when pose)
0.8043 0.83610.6645
v3.2 selection split: bat-grounded 3.8% · single 95.6% · fallback 0.6%

v3.2 — 3-template action pool

v3.2 action sim curve v3.2 top-K action peaks

KExVvhIESKA_002_720p  Wild Batting  v3.2-only

dur 339.1s sampled 1696@5fps pose hits 1638/1696 (97%)
v3.2 actiontop-1 (smoothed)max rawmean (when pose)
0.7807 0.85290.7134
v3.2 selection split: bat-grounded 48.8% · single 48.7% · fallback 2.5%

v3.2 — 3-template action pool

v3.2 action sim curve v3.2 top-K action peaks

Conclusion & next steps

Three iterations land on a clean diagnosis:

The combined evidence: (1) v1 → v2 shows person-selection is not the dominant bottleneck; (2) v2 → v3.2 shows template-pool composition matters far more than the matcher's selection logic on query frames. Templates auto-built from "midpoint of Execution phase" capture many non-swing stance/recovery poses; user-curated swing frames give the matcher something semantically specific to align to.

Productive paths from here (E7+):

  1. Grow the v3.2 pool to 20–30 templates covering pull / hook / drive / cut / sweep / glance / block, multiple camera angles. Cheapest signal-to-effort ratio — same code, more data.
  2. Bat keypoint as a joint — concatenate the OWLv2 bat-center (now available per frame) to the 17-joint vector before normalization. Single most informative joint missing from COCO-17.
  3. Motion gradient over ±0.2 s — swinging produces strong wrist/elbow velocity; standing does not.
  4. Game-context ROI — pitch / wicket region gating to keep the search to where real shots happen.