# GolfDB (McNally et al. 2019)

Video database for **golf swing sequencing** — detecting 8 swing events
(Address, Toe-up, Mid-backswing, Top, Mid-downswing, Impact, Mid-follow-through,
Finish) in trimmed swing clips.

- Paper: [arXiv:1903.06528](https://arxiv.org/abs/1903.06528)
- Upstream: [github.com/wmcnally/golfdb](https://github.com/wmcnally/golfdb)
- License: CC-BY-NC 4.0
- Records: 1400 swing clip annotations across **580 unique YouTube videos**
  (videos contain regular- and slow-mo swings; multiple records per video).

## On-disk layout

```
golfdb/
├── meta/
│   ├── golfDB.mat               # records: id, youtube_id, player, sex, club, view, slow, events, bbox, split
│   ├── golfDB.pkl               # ⚠ pandas version mismatch with current env — read .mat instead
│   ├── GolfDB.csv               # subset (no youtube_id / split) shipped by the Kaggle mirror — convenient for pandas
│   ├── generate_splits.py       # upstream: .mat → DataFrame
│   ├── preprocess_videos.py     # upstream: clip videos using event labels
│   ├── download_videos.py       # ours: pull 580 unique YouTube videos via yt-dlp (blocked, see below)
│   └── UPSTREAM_README.md
├── videos_160/                   # ✅ 1400 trimmed per-swing clips, 160×160, ~30 fps (Kaggle mirror)
│                                 #    filename = record id, frame count = events[-1] - events[0] + 1
└── download.log                  # per-ID status from download_videos.py (full-res route, currently blocked)
```

## Record schema (golfDB.mat)

- `id` — record index (0..1399)
- `youtube_id` — 11-char YouTube ID; 580 unique values
- `player`, `sex` — golfer (uppercase), 'm'/'f'
- `club` — driver / iron / fairway / hybrid / wedge
- `view` — face-on / down-the-line / other
- `slow` — 0 = native speed, 1 = slow-mo
- `events` — 10 frame indices: `[start_pad, 8 swing events, end_pad]`
- `bbox` — normalized `[x, y, w, h]` for the golfer
- `split` — 1..4 (4-fold CV)

## Distributions (from this release)

```
total records:    1400
unique videos:    580
slow=1:           642
view: down-the-line 585, face-on 461, other 354
club: driver 952, iron 229, fairway 162, hybrid 34, wedge 23
```

## Two routes to the videos

### ✅ Route A: Kaggle mirror (already done — 160×160 preprocessed)

```bash
kaggle datasets download -d marcmarais/videos-160 -p videos_160 --unzip
```

Lands as `videos_160/<record_id>.mp4`, **bypasses the YouTube bot-block entirely**.
1400 trimmed swing clips, total ~680 MB. Verified that ffprobe frame count
matches `events[-1] - events[0] + 1` for sample IDs 0/1/700/1399.

This is the recommended route for almost all downstream work — small, fast,
already trimmed to the swing window.

### ⏸ Route B: full-resolution from YouTube (blocked)

`meta/download_videos.py` pulls the 580 unique source videos at native
resolution. Currently this host's IP gets `Sign in to confirm you're not a bot`
for every video; tried `player_client` variants (tv, ios, web_safari,
tv_simply, android_vr) — all blocked. To unblock, supply a YouTube
`cookies.txt` exported from a real browser session (see `../README.md`).
Only needed if you want resolutions above 160×160.
