YOLO11n-pose templates from CricketVision (broadcast 720p) →
matched against 11 4K@120fps controlled test videos
(8 Batting Pose +
3 Balling)
+ 3 wild YouTube broadcast clips
(Wild Batting, v3.2 only).
Goal: detect batting impact moments via pure pose similarity.
2026-05-26 · v1 status: method fail
· 2026-05-27 · v2 update:
bat-grounded batsman selection
refutes the "wrong-person" hypothesis; the impact-pool gap shrinks but does not flip.
· 2026-05-27 · v3.2 update:
3 user-curated action templates + bat-grounded + single-pose vis
flips the per-class gap to the correct direction
(Batting > Balling by +0.013). See v2 · v3.2 below.
v1 headline finding. The single-frame pose-template approach has no discriminative power between batting and balling videos.
Top-peak similarity across 11 videos (v1 with heuristic selection):
| Group | n | mean | min | max |
|---|---|---|---|---|
| Batting Pose | 8 | 0.894 | 0.882 | 0.908 |
| Balling | 3 | 0.908 | 0.894 | 0.930 |
Balling videos score higher on batting templates than the Batting Pose videos themselves do. No threshold separates the two groups.
The top-K "matches" at sim ≥ 0.87 are not batting impacts. In the Balling-134448 montage below they include: a standing batsman (not swinging), the main umpire, an outfielder running, and a celebrating player. Under 17-keypoint hip-shoulder normalization, any cricket-attired person standing in side profile becomes geometrically close to the 200+ impact poses in the template pool.
Net: 16 of 17 normalized joints are torso/leg structure that look similar across any standing person; the one wrist joint that could distinguish swinging from standing is L2-averaged to insignificance.
The v1 failure left an obvious hypothesis: maybe the heuristic (size + centrality + KP confidence) picks the wrong person in the 4K test frames — an umpire or a standing batsman instead of the one actually playing the shot. Templates would then be matched against poses of cricket-attired non-actors. v2 tests this hypothesis directly.
"cricket bat" on a 960-px downsample.Phase-A spike:
yolov8s-world with prompt "cricket bat" recalls
only 22% of test-frame bats (4/18), 8% of broadcast (1/12).
Phase-B: OWLv2-base jumps to 70% recall on the same frames
(median top-1 score 0.31, max 0.54). Bat-bbox precision is essentially
100% in both: when fired, the box always lands on the batsman.
| Class | n | Impact pool (top-1) | FT pool (top-1) | ||||
|---|---|---|---|---|---|---|---|
| v1 | v2 | Δ | v1 | v2 | Δ | ||
| Batting Pose | 8 | 0.8909 | 0.8936 | +0.0027 | 0.8810 | 0.8832 | +0.0022 |
| Balling | 3 | 0.9035 | 0.9041 | +0.0006 | 0.8860 | 0.8840 | -0.0021 |
| Cross-class gap (Balling − Batting) | +0.0126 | +0.0105 | -0.0021 | +0.0050 | +0.0008 | -0.0042 | |
Balling > Batting on top-1 impact similarity means the method is "more confident this is a batting impact" on the Balling group than on the Batting group — the failure mode. v2 narrows the gap but does not flip it on the impact pool; it nearly eliminates it on the ft pool.
Bat-grounded selection works as designed at the selection step: when the bat detector fires, it correctly identifies the batsman. But the headline numbers move only ~0.002-0.003 per class. The hypothesis "wrong-person selection drives the failure" is refuted — even with the right person, the 17-joint normalized pose representation still cannot distinguish a swinging batsman from a standing batsman or a running fielder. The remaining impact-pool gap (Balling +0.0105 above Batting) is intrinsic to the pose feature, not the selection. v2 is a clean diagnostic that promotes the next bottleneck from "selection" to "feature".
v2 showed that the 201-template impact pool was contaminated: many of the auto-built templates were probably not the actual swinging batsman (umpire, standing batsman, etc.) but bat-grounded selection on query frames couldn't fix the contamination on the template side. v3 attacks this from the other end — hand-pick 3 broadcast frames of visible swing motion as the entire pool.
pool_best_match
against the 3 templates.res.plot() which
drew every YOLO detection (e.g. all 11 persons including the crowd
in the harmanpreet_pull broadcast shot), making it look like the matcher
consumed multiple poses. It never did. v3.2 swaps the drawing call to
res[idx:idx+1].plot() on the bat-grounded idx, so the picture
matches what the matcher actually sees: same YOLO native style, just
filtered to the one selected batsman.
dataset/pose_templates/:
a kid playing a drive (cricket_kid_drive.jpg),
Harmanpreet Kaur playing a pull (harmanpreet_pull.jpg),
a club batsman driving (screenshot_batsman.png).force_bbox sidecar hint because OWLv2 / YOLO failed
differently on each (see notes panel).cricket_kid_drive: single person in frame →
trivial single_person pick, no hint needed.harmanpreet_pull: 11 YOLO persons (crowd). OWLv2's
bat center landed 18px outside the main batsman's bbox, so pick
fell to a crowd person. Hint forces the correct bbox.screenshot_batsman: YOLO split the batsman into two
overlapping half-detections (upper + lower body); lower half had no
valid shoulders → normalize failed. Hint forces the upper-body
detection (15/17 valid kps).
| Class | n | v1 impact | v2 impact | v3 action | v1 ft | v2 ft |
|---|---|---|---|---|---|---|
| Batting Pose | 8 | 0.8909 | 0.8936 | 0.7437 | 0.8810 | 0.8832 |
| Balling | 3 | 0.9035 | 0.9041 | 0.7312 | 0.8860 | 0.8840 |
| Wild Batting (YouTube) | 3 | n/a | n/a | 0.7988 | n/a | n/a |
| Gap (Batting − Balling, positive = correct direction) | −0.0126 | −0.0105 | +0.0125 | −0.0050 | −0.0008 | |
| Wild Batting − controlled Balling | — | — | +0.0676 | — | — | |
Wild-batting top-1 mean (0.7988) is the highest of any group — unexpected but explainable: YouTube reels concatenate many discrete swings back-to-back, so the NMS-top-1 peak easily finds a near-perfect match. The Batting−Balling gap (controlled set) is +0.0125; the Wild-vs-Balling gap is +0.0676 (~5× larger). v3.2's discriminative direction holds in the wild and is in fact stronger there.
v3 keeps the discriminative direction correct across two template-pool revisions (first set was +0.0133, current set is +0.0125): Batting Pose videos score higher than Balling videos on the matching pool. Magnitude is small, and absolute sim values dropped because 3 templates is a sparse pool. The signal is fragile but consistent.
v3 confirms a clean diagnosis: the failure in v1/v2 was largely about template-pool composition, not just batsman selection or COCO joint representation. When the templates are guaranteed to depict actual swing motion (small but clean pool), the matcher does separate Batting from Balling — though weakly. Caveats: (a) absolute sims dropped because 3 templates is a sparse pool; (b) within-Batting variance is large (top-1 ranges 0.711–0.836); (c) a single threshold still cannot perfectly separate the classes. v3 demonstrates the direction works; the next step is to grow the pool to ~20–30 hand-picked swing frames covering more shot types and viewpoints, then re-evaluate.
From each CricketVision segment (broadcast 720p, labelled GT phases), two templates are extracted and stored in the pool:
| Pool | Frame source | Pipeline | Count |
|---|---|---|---|
| impact | Midpoint of gt_phases[Execution] phase, re-decoded from the 25-fps mp4 |
YOLO11n-pose → pick batsman (heuristic: size+centrality+conf) →
hip-centered, hip→shoulder-midpoint unit normalization →
confidence-mask each joint at threshold 0.30 → store
(kp_norm: 17×2, mask: 17, meta) |
201 / 202 |
| ft | The picked best_filename JPG from the VLM grid-pick pipeline (FollowThrough frame) |
201 / 202 |
Quality: mean 15.8 / 17 valid joints per template; bbox heights range 151-531 px (broadcast scale).
YOLO output on representative CV frames (skeleton + bbox + person confidence). The batsman heuristic picks the highest-scoring person, then that person's 17 joints get normalized into the pool.
For each sampled test_data frame the same pose-extraction pipeline runs. The query pose is compared to every template in each pool via a confidence-mask-weighted L2 distance:
Per query frame this yields two scores: best similarity vs impact pool and best similarity vs ft pool. The series is Gaussian-smoothed (σ=3 frames ≈ 100 ms at 30 fps) and greedy-NMS picks the top-K peaks per pool (window ±1 s).
| Step | What | Tools |
|---|---|---|
| 1. Template extraction | Per CV segment: impact + ft template; total 201+201 templates | YOLO11n-pose, OpenCV decode @25fps mp4 |
| 2. Query extraction | Test video sequentially decoded; sample every 4th frame (120→30 fps) | YOLO11n-pose; v2 adds OWLv2-base for batsman selection |
| 3. Pose match | Vectorized argmin distance over the 201-template pool per frame | NumPy broadcasting |
| 4. Temporal smoothing | NaN-tolerant Gaussian σ=3 frames (≈100 ms) | NumPy convolve |
| 5. Peak picking | Greedy NMS with ±1 s window, top-5 per pool | NumPy |
Initial implementation used cap.set(CAP_PROP_POS_FRAMES, idx)
to seek to each sampled frame. On H.264 4K@120fps this costs ~3 s per seek
(walk from prior keyframe), giving ~0.3 fps throughput. Switching to
sequential decode + modulo-skip gives ~6.5 fps (≈25× speedup,
11 videos in ~22 minutes instead of ~9 hours).
Each card shows the v2 (bat-grounded) run. The card header reports v1↔v2 top-1 sims and the v2 selection-method split (green = bat-grounded, grey = single person, orange = heuristic fallback). The two side-by-side images per card show top-5 peaks for each pool: query frame (red border) → matched template.
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8938 | 0.8897 | -0.0041 | 0.7316 |
| ft top-1 | 0.9018 | 0.8989 | -0.0029 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8850 | 0.9078 | +0.0228 | 0.7479 |
| ft top-1 | 0.8898 | 0.8902 | +0.0004 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.9077 | 0.9076 | -0.0001 | 0.7258 |
| ft top-1 | 0.8614 | 0.8846 | +0.0232 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8888 | 0.8878 | -0.0010 | 0.6871 |
| ft top-1 | 0.8527 | 0.8509 | -0.0017 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8909 | 0.8917 | +0.0008 | 0.7324 |
| ft top-1 | 0.8928 | 0.8931 | +0.0003 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8823 | 0.8846 | +0.0023 | 0.8004 |
| ft top-1 | 0.8782 | 0.8783 | +0.0001 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8777 | 0.8790 | +0.0013 | 0.7750 |
| ft top-1 | 0.8854 | 0.8827 | -0.0028 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.9009 | 0.9006 | -0.0003 | 0.7339 |
| ft top-1 | 0.8863 | 0.8872 | +0.0009 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8866 | 0.8868 | +0.0002 | 0.7307 |
| ft top-1 | 0.8998 | 0.8937 | -0.0061 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.9297 | 0.9300 | +0.0003 | 0.7269 |
| ft top-1 | 0.8729 | 0.8736 | +0.0007 |
| pool | v1 (heuristic) | v2 (bat-grounded) | Δ v2-v1 | v3.2 (3-tpl action) |
|---|---|---|---|---|
| impact top-1 | 0.8943 | 0.8955 | +0.0012 | 0.7359 |
| ft top-1 | 0.8854 | 0.8846 | -0.0008 |
Long YouTube clips (5–12 min) sampled at 5 fps (vs 30 fps on the controlled set) — these are minutes-long highlight reels with many discrete swings, so coarse sampling is enough to localize peaks. v1/v2 templates don't apply; only v3.2 (3 user-curated swing templates + bat-grounded selection) was run.
| v3.2 action | top-1 (smoothed) | max raw | mean (when pose) |
|---|---|---|---|
| — | 0.8091 | 0.8449 | 0.6818 |
| v3.2 action | top-1 (smoothed) | max raw | mean (when pose) |
|---|---|---|---|
| — | 0.8043 | 0.8361 | 0.6645 |
| v3.2 action | top-1 (smoothed) | max raw | mean (when pose) |
|---|---|---|---|
| — | 0.7807 | 0.8529 | 0.7134 |
Three iterations land on a clean diagnosis:
pool_best_match; the v3.1 thumbs misled by drawing all YOLO detections).The combined evidence: (1) v1 → v2 shows person-selection is not the dominant bottleneck; (2) v2 → v3.2 shows template-pool composition matters far more than the matcher's selection logic on query frames. Templates auto-built from "midpoint of Execution phase" capture many non-swing stance/recovery poses; user-curated swing frames give the matcher something semantically specific to align to.
Productive paths from here (E7+):