Data

Datasets, training data, data engineering, RAG

13 links across all digests

From Week 17, 2026

IEEE Spectrum · 9 min read

Stanford's AI Index 2026 — inside the numbers (opens in new tab)

IEEE Spectrum walks through the 2026 AI Index — training compute curves, evaluation saturation, open-weights share, and a sharp rise in domain-specific benchmarks. The least breathless read of the Index so far.

Research Safety Data

From Week 14, 2026

Dropbox Engineering · 12 min read

Optimizing Dropbox Dash's relevance judge with DSPy (opens in new tab)

The Dropbox engineering team walks through how they used the DSPy framework to systematically optimize prompt-based relevance judgments for Dash at production scale. Covers the full loop from metric definition through prompt compilation, with real before-and- after numbers. The clearest DSPy-in-production case study to date.

Dev Data Workflow

From Week 14, 2026

Hugging Face / OpenMed · 10 min read

Training mRNA language models across 25 species for $165 (opens in new tab)

An end-to-end walkthrough of training transformer-based mRNA models across 25 organisms in 55 GPU-hours for under $165. Compares architectures and demonstrates species-conditioned codon modeling.

Research Data

From Week 13, 2026

Hugging Face · 10 min read

Domain-specific embedding fine-tuning with synthetic data (opens in new tab)

NVIDIA's pipeline for rapid embedding model fine-tuning using synthetic data generation, with step-by-step instructions for adapting general-purpose embeddings to specialized domains.

Data Models

From Week 13, 2026

Answer.AI · 12 min read

So where are all the AI apps? (opens in new tab)

Answer.AI analyzed PyPI data to test whether AI tools are actually boosting software production and found no obvious increase in package creation post-ChatGPT. The only measurable effect is a concentrated surge in updates to AI-related packages themselves — likely driven by funding cycles rather than universal productivity gains. It's the most data-grounded skepticism of AI productivity claims published this year.

Dev Data

From Week 12, 2026

Mistral AI · 5 min read

Mistral launches Forge for enterprise model training (opens in new tab)

Forge lets enterprises build frontier-grade models trained on their own data, positioned as an alternative to fine-tuning and RAG for domain-specific performance.

Models Infra Data

From Week 12, 2026

Zak El Fassi · 8 min read

How do you want to remember? An AI agent self-optimizes its memory (opens in new tab)

A developer experiment where an AI agent iteratively redesigned its own memory system, improving recall accuracy from 60% to 93% for about two dollars in API costs.

Agents Data

From Week 12, 2026

Andrej Karpathy · 5 min read

US Job Market Visualizer — AI exposure across 342 occupations (opens in new tab)

An interactive tool displaying growth projections, compensation, and AI exposure metrics for hundreds of US occupations. Built by Karpathy, it's the kind of side project that reframes how you think about career risk.

Data Workflow

From Week 11, 2026

Andreessen Horowitz · 10 min read

The Top 100 Gen AI Consumer Apps (6th edition) (opens in new tab)

The sixth edition of a16z's consumer AI ranking reshuffles the leaderboard. ChatGPT still dominates, but vertical apps are climbing fast and retention patterns are shifting.

Models Data

From Week 11, 2026

Search Engine Land · 6 min read

AI assistants now equal 56% of global search engine volume (opens in new tab)

New study quantifying AI assistants' share of global search volume — with mobile dominance accelerating the shift away from traditional search engines.

Data Tools

From Week 10, 2026

LessWrong · 8 min read

A Searchable Database of Nearly 4,000 AI Safety Papers Built With Claude (opens in new tab)

A practitioner had Claude read and categorize every AI safety paper published since 2020, producing a curated, searchable database for navigating the field's rapidly expanding literature.

Safety Data Research

From Week 10, 2026

Fortune · 8 min read

AI Labor Market Impacts: Actual Displacement Remains Limited Despite High Theoretical Exposure (opens in new tab)

Fortune reports on Anthropic's landmark research introducing "observed exposure" as a new metric for AI labor impact. Computer programmers show 74.5% theoretical exposure but only 33% actual usage, and employment data shows no unemployment impact yet — though hiring for young workers in exposed occupations is slowing.

Data Research

From Week 10, 2026

Apoorv Jain · 10 min read

The State of Consumer AI: 1 Billion Weekly Users, But Is It Real Habit Formation? (opens in new tab)

Independent analysis showing AI apps have crossed 1B weekly users with ChatGPT holding 900M, but questioning whether usage patterns reflect genuine habit formation or novelty-driven adoption.

Data