NPS Availability Data Pipeline
Self-hosted pipeline that snapshots 627 reservable National Park Service facilities every 6 hours — building a historical dataset of campsite, permit, and timed-entry availability.
A self-hosted data pipeline (codename CampingReservation) that discovers every reservable National Park Service facility on Recreation.gov and captures real-time availability snapshots on a recurring schedule, building a historical dataset for trend analysis on campsite, permit, and timed-entry reservations.
Problem
Booking a campsite at popular national parks (Yosemite, Grand Canyon, Glacier) is essentially a race against bots and instant sellouts. Public APIs expose current availability — but no one publishes the historical record of when sites open, when they get claimed, and how booking patterns shift across seasons. This project fills that gap: it builds and continuously refreshes a private dataset of availability over time, so I can mine it later for cancellation patterns, release-day behavior, and the best windows to book specific sites.
What it does
- Discovers
- Every reservable NPS facility across six categories — campgrounds, permits, timed entry, tickets, activity passes, and venue reservations — via the official RIDB API
- Normalizes
- Four-strategy state-code normalization cascade to handle inconsistent location metadata in the upstream feed
- Snapshots
- Real-time availability for every facility across a rolling six-month window via Recreation.gov's unofficial availability endpoint
- Schedules
- cron (Linux) or launchd (macOS), with lock files preventing overlapping runs and automatic log rotation
- Persists
- Local disk, AWS S3 (gzip-compressed), and a mounted NAS — fan-out storage so no single backend is a single point of failure
Approach
A Python pipeline running unattended on a Proxmox-hosted Ubuntu VM, scheduled via cron. Discovers facilities through the official RIDB API, snapshots availability via Recreation.gov's unofficial endpoint, with rate-limit-aware HTTP clients, exponential backoff, and a four-strategy state-code normalization cascade to handle inconsistent upstream metadata. Persists every run to local disk, gzip-compressed S3, and a NAS NFS mount — fan-out storage so no single backend is a single point of failure.
Notable engineering decisions
Rate-limit-aware client
A reusable RateLimiter token bucket caps RIDB calls at 45 req/min (10% under the documented 50/min ceiling) and the unofficial availability endpoint at 55 req/min. Wraps every request in retry-with-exponential-backoff for 429, 5xx, and timeout errors — so transient failures don't surface as data gaps in the historical record.
Reliable facility discovery
A naive keyword search misses facilities (e.g. "Grand Canyon" doesn't return all three of its campgrounds). The pipeline uses a two-step pattern: search recreation areas first, then enumerate facilities under each matching area. This caught ~30% more facilities than a flat keyword scan during testing.
Four-strategy state normalization
Upstream data has missing, malformed, and inconsistent state fields. utils.py cascades through four strategies — string cleaning → full-name lookup → parent-park lookup → bounding-box geocoding — to fill in state for every facility. Ships with a --fix-states mode that re-runs normalization without re-fetching from the API.
Operational hygiene
A lock file in the snapshot directory prevents overlapping cron runs (a 68-minute full snapshot can collide with the next 6-hour tick). Log rotation prunes anything older than 30 days. An idempotent setup script installs Python 3.12+, uv, and project deps on a fresh VPS — re-runnable safely. An interactive cron installer walks through schedule and storage choices. Dry-run mode lists every facility and request that would be made, without touching either API.
Trend analytics dashboard
Built an interactive analytics dashboard that surfaces trends from the historical snapshot archive — release-day heatmaps, cancellation-window detection, and per-site "best time to book" recommendations. Turns raw JSON into actionable booking intelligence across hundreds of campgrounds, permits, and timed-entry facilities.
Real-time availability + notification service
Diffs each snapshot against the prior run to detect newly opened sites and notifies users within minutes of a cancellation — enabling instant booking on otherwise sold-out high-demand reservations.
Architecture
Two upstream APIs (RIDB for facility metadata, Recreation.gov for availability) feed a shared Python layer with rate-limited clients, exponential backoff, and normalization utilities. Output is fanned out to three storage backends — local disk, gzip-compressed S3, and a NAS NFS mount. Every run produces a timestamped directory with a manifest.json (counts, errors, duration) and per-facility JSON files, trivially diffable and greppable.
Outcome
Live in production: snapshotting 627 NPS facilities every 6 hours, ~3,762 API calls per full snapshot, ~68 minutes per run under rate limits, 70–80% size reduction on S3 archives. Historical dataset is accumulating for downstream analysis — release-day heatmaps, cancellation-window detection, and "best time to book" recommendations from the archive.
What I learned
Rate limits are a system design problem, not just a flag.
Documented limits are upper bounds, not safe targets. Building a reusable RateLimiter + backoff utility once paid back across both APIs, and the 10% safety margin made the difference between clean runs and getting throttled into incomplete datasets.
Production cron is mostly about what happens when things go wrong.
Locking, log rotation, idempotent setup, and dry-run modes were more impactful for reliability than any algorithmic choice.
Data normalization belongs in the ingestion path.
Doing the four-strategy state normalization at write time — and shipping a --fix-states reprocessing mode — meant downstream analysis can trust the schema without rewriting it.
JSON on disk is enough until it isn't.
No DB, no ORM, no migrations — just versioned snapshot directories that are trivially diffable, greppable, and S3-uploadable. I'll move to a real store when query patterns demand it.
What I’d build next
- Web app (FastAPI + lightweight frontend) — browse the registry, build trip itineraries, and configure watches without editing JSON by hand
- Multi-agency support — the discovery script already accepts an --all-orgs flag; extend snapshots beyond NPS to USFS, BLM, Army Corps, etc.
- Push-notification fan-out — extend the existing alert service from email to Discord and iOS push for faster reaction to high-demand cancellations