I’ve been traveling a bunch and skipped a regularly scheduled post, so this is a three-week roundup. In that time, 23 news organizations created or opened public repositories on GitHub.
Note: I’m becoming more interested in tracking signals of AI coding-agents in newsroom repos, and will be noting interesting cases for a while.
Highlights
The Washington Post released data behind its investigation into widespread use of force at U.S. Immigration and Customs Enforcement detention centers. The Post obtained nearly 1,500 use-of-force incident reports dated January 2024 to February 2026, extracting dates, detainee counts, facility names, and pepper-spray deployments. The data backs Post reporting published May 4. The reporters, Andrew Ba Tran and Douglas MacMillan, used Google Pinpoint to help identify reports where a detainee may have been injured, then verified each injury description manually.
The Guardian published records covering immigration-court filings from fiscal year 2022 through August 2025. The data, published separately as a Google Drive link, was released after the newsroom sued the Dept. of Homeland Security. It covers demographics, apprehension details, criminal history, and minor-child counts, with personally identifiable information removed. The Guardian, ProPublica, and Oregon Public Broadcasting have all already published reporting drawing on it. Stories are listed in the repo’s README.
Decoherence Media released three repositories backing epstein.photos, its public archive of Jeffrey Epstein court documents and photographs, including a 19-step Python pipeline that uses AWS Rekognition to detect faces in document images, clusters them by person, and builds a network of co-appearances.
DW (Deutsche Welle) published the Jupyter notebooks and clean datasets behind its analysis of the 2026 Reporters Without Borders World Press Freedom Index. The repo documents 13 years of rankings (2013–2026) and merges in data on country demographics and development indicators. The newsroom published a story based on the data on April 30.
CalMatters released two datasets: The first tracks California’s $3.8 billion Homekey program, which converts hotels and other buildings into housing for people experiencing homelessness. The reporters, Lauren Hepler AND Marisa Kendall, published a story about the data.
The second is the analysis pipeline behind its examination of Latino voter shifts on the 2025 Proposition 50, combining statewide precinct results with Census voting-age population through geographic interpolation. The Prop 50 repo ships an AGENTS.md. The data was used in a story about Latino Voters in California.
BTW, a big shout-out to CalMatters for structuring their README files so it’s easy to see where their data is used!
Lighthouse Reports published an analysis by Purity Mukami and Gabriel Geiger of Kenya’s Proxy Means Testing system for public healthcare eligibility. Using household survey data from thousands of Kenyan families, the project trains machine-learning models to predict household consumption and tests the current system’s fairness across urban/rural, gender, education, and county lines.
Bellingcat released a browser-based OSINT tool for finding satellite imagery of specific ships. Users upload a CSV of Automatic Identification System (AIS) ship-position records, and the tool cross-references those positions and timestamps against Sentinel and Landsat satellite archives and displays them on an interactive map.
Thomson Reuters released Claude Forge, a research-preview CLI that wraps Claude Code and adds persistent sessions, multi-model routing, and autonomous verification. Users can run it instead of claude and route to different model providers per session — for example, GPT for planning, Gemini for review. The repo ships a CLAUDE.md and AGENTS.md.
ICIJ released Kuroi, a tool that uses large language models — Anthropic’s Claude by default — to identify and redact sensitive information from PDF documents. It includes a verification workflow that lets editors review, diff, and undo redactions before finalizing a sanitized document, making it suitable for safely publishing leaked records where personal data must be removed first.
Buried Signals released two AI-for-journalism projects. Mycroft is a Goose extension pack with curated workflows for investigative journalism. Scoutpost is an open-source local-news monitoring platform that lets journalists create automated “scouts” tracking pages, search queries, social profiles, and data APIs for changes, then email notifications when criteria are met. Both ship agent-memory docs and a skills/ directory.
By the Numbers
Beyond new repos, 86 news organizations made a combined 5,688 public commits to GitHub during this period. The most active by commit count (excluding, as best we can, commits done by bots, gh-actions, or cron):
| Organization | Commits |
|---|---|
| The Guardian | 1,519 |
| ICIJ | 723 |
| Freedom of the Press Foundation | 661 |
| OpenSanctions | 249 |
| Spotlight PA | 170 |
| PRX | 138 |
| MuckRock | 127 |
| Bloomberg | 121 |
| Buried Signals | 103 |
| OpenNews | 94 |
Data comes from the Open Journalism Bot, which monitors ~360 news organization GitHub accounts. Follow @openjournalism.bsky.social for real-time alerts. Commit counts shown here exclude commits we identified as automated (gh-actions, scrapers, dependabot, etc.).