# Gitcrawl
Gitcrawl is a local-first [[GitHub]] triage tool and a drop-in caching shim for the [[GitHub CLI]]. It mirrors issues, PRs, and adjacent metadata into a local [[SQLite]] database, then transparently serves read-only `gh` calls from that cache while letting mutating commands hit GitHub live.
The pitch is two-in-one: it solves API rate-limit exhaustion *and* makes triage tractable on repos with hundreds or thousands of open issues — without giving up the muscle memory of `gh`.
## The problem
GitHub's API rate limits punish exactly the workflows that need it most: monorepo triage, automated agents, dashboards, large-scale issue reviews. The web UI's search and filters are too coarse for serious triage. And every script that calls `gh issue list` or `gh pr view` is hitting the live API, redundantly, every time.
Gitcrawl flips this. The cache becomes the primary store; GitHub becomes the upstream replication source.
## How it works
- **SQLite-backed mirror.** Stores issues, PRs, comments, reviews, files, commits, checks, and workflow runs in `~/.config/gitcrawl/gitcrawl.db`.
- **`gh`-compatible shim.** Intercepts read-only `gh` commands and serves them from the local DB. Mutating commands (create, comment, merge) pass through to live GitHub unchanged. Drop-in: existing scripts don't need rewriting.
- **Cache visibility.** `gh xcache stats` exposes hit rate, freshness, and size of the local store.
- **Semantic clustering.** Groups related issues and PRs using [[Embeddings]] (OpenAI), but clusters are anchored to deterministic GitHub references (linked issues, PRs, commits) so weak similarity scores can't fuse unrelated tickets into mega-clusters.
- **TUI + JSON output.** Keyboard/mouse-driven terminal UI for human triage; structured JSON output for agents and automation pipelines.
## Design choices worth noting
- **Local-first, deliberately CLI-only.** No hosted service, no HTTP API, no web UI. The tool stays small and the data stays on your machine.
- **Drop-in over reinvention.** Rather than building a new triage CLI and asking users to relearn it, gitcrawl wraps `gh`. The cost of adoption is near zero.
- **Hybrid clustering.** Pure embedding clustering tends to glue everything together once a similarity threshold is crossed. Anchoring clusters to explicit GitHub cross-references trades recall for precision — usually the right call in triage, where false positives waste maintainer time.
- **Two layers, one promise.** The triage UI and the API cache could have shipped as separate tools. Bundling them means the cache is always populated by real workflows, not a synthetic sync job.
## Where it fits
For maintainers of busy repos, gitcrawl is the missing layer between `gh` (the command) and a real triage workflow. For agentic systems that read GitHub heavily, it's a transparent rate-limit shield: point your scripts at `gh` as usual, and gitcrawl absorbs the read traffic locally. Same family as [[Discrawl]] (Discord → SQLite + FTS5) — the broader pattern is *vendor chat or vendor issue tracker → local queryable mirror*.
## References
- Project site: <https://gitcrawl.sh/>
- Source: <https://github.com/openclaw/gitcrawl>
- License: MIT
- GitHub CLI: <https://cli.github.com/>
## Related
- [[Discrawl]]
- [[GitHub]]
- [[GitHub CLI]]
- [[SQLite]]
- [[Embeddings]]