Data
Datasets, training data, data engineering, RAG
13 links across all digests
From Week 17, 2026
IEEE Spectrum · 9 min read
IEEE Spectrum walks through the 2026 AI Index — training
compute curves, evaluation saturation, open-weights share, and
a sharp rise in domain-specific benchmarks. The least
breathless read of the Index so far.
From Week 14, 2026
Dropbox Engineering · 12 min read
The Dropbox engineering team walks through how they used the DSPy
framework to systematically optimize prompt-based relevance
judgments for Dash at production scale. Covers the full loop from
metric definition through prompt compilation, with real before-and-
after numbers. The clearest DSPy-in-production case study to date.
From Week 14, 2026
Hugging Face / OpenMed · 10 min read
An end-to-end walkthrough of training transformer-based mRNA
models across 25 organisms in 55 GPU-hours for under $165.
Compares architectures and demonstrates species-conditioned
codon modeling.
From Week 13, 2026
Hugging Face · 10 min read
NVIDIA's pipeline for rapid embedding model fine-tuning using
synthetic data generation, with step-by-step instructions for
adapting general-purpose embeddings to specialized domains.
From Week 13, 2026
Answer.AI analyzed PyPI data to test whether AI tools are actually
boosting software production and found no obvious increase in
package creation post-ChatGPT. The only measurable effect is a
concentrated surge in updates to AI-related packages themselves —
likely driven by funding cycles rather than universal productivity
gains. It's the most data-grounded skepticism of AI productivity
claims published this year.
From Week 12, 2026
Forge lets enterprises build frontier-grade models trained on their own
data, positioned as an alternative to fine-tuning and RAG for
domain-specific performance.
From Week 12, 2026
Zak El Fassi · 8 min read
A developer experiment where an AI agent iteratively redesigned its
own memory system, improving recall accuracy from 60% to 93% for
about two dollars in API costs.
From Week 12, 2026
Andrej Karpathy · 5 min read
An interactive tool displaying growth projections, compensation,
and AI exposure metrics for hundreds of US occupations. Built by
Karpathy, it's the kind of side project that reframes how you
think about career risk.
From Week 11, 2026
Andreessen Horowitz · 10 min read
The sixth edition of a16z's consumer AI ranking reshuffles
the leaderboard. ChatGPT still dominates, but vertical apps
are climbing fast and retention patterns are shifting.
From Week 11, 2026
Search Engine Land · 6 min read
New study quantifying AI assistants' share of global search
volume — with mobile dominance accelerating the shift away
from traditional search engines.
From Week 10, 2026
A practitioner had Claude read and categorize every AI safety paper published since 2020, producing a curated, searchable database for navigating the field's rapidly expanding literature.
From Week 10, 2026
Fortune reports on Anthropic's landmark research introducing "observed exposure" as a new metric for AI labor impact. Computer programmers show 74.5% theoretical exposure but only 33% actual usage, and employment data shows no unemployment impact yet — though hiring for young workers in exposed occupations is slowing.
From Week 10, 2026
Apoorv Jain · 10 min read
Independent analysis showing AI apps have crossed 1B weekly users with ChatGPT holding 900M, but questioning whether usage patterns reflect genuine habit formation or novelty-driven adoption.