# Golf datasets

Sourced 2026-05-25 for the motion-analysis project. Four sub-corpora; mix of
in-the-wild swings, controlled-camera features, lab-grade 3D pose, and
broadcast highlights.

## Contents

| Path | Source | Size | What's there | What's missing |
|---|---|---|---|---|
| `golfdb/` | [wmcnally/GolfDB](https://github.com/wmcnally/golfdb) — McNally et al. 2019 + [Kaggle mirror](https://www.kaggle.com/datasets/marcmarais/videos-160) | 680 MB | metadata + **1400 trimmed swing clips at 160×160** (`videos_160/<id>.mp4`, verified to match `events[-1] - events[0] + 1`) | Full-res 580-video YouTube pull still blocked by bot-detection. Only needed for >160×160 work. |
| `caddieset/` | [damilab/CaddieSet](https://github.com/damilab/CaddieSet) — Jung et al. CVPRW 2025 | 508 KB | `CaddieSet.csv` (1757 shots × 80 cols: view, club, ball flight, 8-phase joint features) | Authors did not release raw videos. CSV is the entire public release. |
| `athletepose3d/` | [calvinyeungck/AthletePose3D](https://github.com/calvinyeungck/AthletePose3D) — Yeung et al. CVSports@CVPR 2025 | 36 GB | `data.zip` (raw video+motion, 35 GB), `cam_param.json`, 3 model checkpoints (1 GB total) | **No golf.** Verified motion types are skating jumps (Axel/Flip/Loop/Lutz/Salchow/Toeloop/Comb) + throws (Discus/Javelin/Shot_put) + Running. `pose_2d.zip` / `pose_3d.zip` blocked by Google Drive's 24h quota. Consider deleting `data.zip` if not useful for cross-sport pose reference. |
| `broadcast/` | YouTube curation (TBD) | — | empty placeholder | Tournament source to be picked (PGA / Masters / Open / LIV). Needs user direction. |

## Licenses (read before downstream use)

- **GolfDB**: CC-BY-NC 4.0 (non-commercial only).
- **CaddieSet**: see `caddieset/LICENSE` in this dir.
- **AthletePose3D**: non-commercial research only, full text in `athletepose3d/LICENSE.md`.

## GolfDB videos: which route to use

- **For 160×160 trimmed swing clips (default)**: nothing to do. They're already
  on disk at `golfdb/videos_160/<id>.mp4`, pulled from the Kaggle mirror
  ([marcmarais/videos-160](https://www.kaggle.com/datasets/marcmarais/videos-160),
  CC-BY 4.0). 1400 clips, 680 MB.
- **For full-resolution from YouTube**: the 580-video native-res pull via
  `golfdb/meta/download_videos.py` is blocked by YouTube bot-detection.
  Requires a user-supplied `cookies.txt` (exported from a logged-in browser
  session via the "Get cookies.txt LOCALLY" extension), placed at
  `golfdb/meta/cookies.txt`. Only worth the trouble if 160×160 is too small.

## AthletePose3D — verified motion types

Enumerated from `data.zip` namelist (18 distinct strings, 12 logical sports
once `_error` variants are folded in):

```
Running     2796     Comb        462
Axel        1440     Javelin     456
Flip        1440     Discus      480
Salchow     1440     Spin_discus 480
Loop        1437     Shot_put    480
Toeloop     1437     Glide_shot_put 480
Lutz        1434     ( + _error variants for the throws )
```

So **no golf**. If you're after generic athletic-pose pretraining the
checkpoints + `data.zip` are still useful, otherwise this whole subdir can
be removed.

## How to add a broadcast corpus

Mirror the pattern in `../../cricket_highlight/` (it's not actually under
`data/` — it's a sibling sub-project): one config-driven pipeline per
tournament source, downloads landing under `dataset/raw/` and trimmed clips
under `dataset/clips/`. For golf the natural candidates are the four majors
(Masters / PGA Championship / US Open / The Open) plus PGA Tour weekly
highlights. Pick a source first, then we set up the scraping.
