Research
6,600+ citations, h-index 23. My research spans evaluation methodology, efficient inference, embedding models, and retrieval-augmented generation. The connecting thread is making information systems smarter, faster, and more trustworthy. Google Scholar.
Research Themes
Evaluation & Benchmarks
How do you know a search or RAG system is actually working? I've spent a decade building the evaluation infrastructure the field uses to answer this question.
- MS MARCO (2016–present) — Co-created the benchmark that defined neural information retrieval. 3,400+ citations, used by 10,000+ researchers. Designed to distinguish genuine reading comprehension from shallow pattern matching.
- TREC RAG Track (2024–2026) — Co-organizer. Defining how retrieval-augmented generation systems should be evaluated: nugget-based fact extraction, support assessment comparing human vs. LLM judges, and the AutoNuggetizer framework for automated scoring.
- TREC Deep Learning Track (2018–2023) — Co-organizer. Built reusable large-scale test collections for neural retrieval. 900+ citations collectively across track overview papers.
- TREC Product Search Track (2023–2025) — Principal coordinator. Benchmarking end-to-end product retrieval.
Efficient Inference & Model Compression
Making language models fast and cheap enough to deploy at web scale. My work on sparsity, pruning, and distillation predates the current wave of interest in efficient LLM inference.
- The Optimal BERT Surgeon (EMNLP 2022, 214 citations) — Scalable second-order pruning for LLMs. Remove most parameters using curvature information with minimal accuracy loss.
- Sparse*BERT (ICML 2022 Workshop) — Demonstrated that sparse models generalize robustly to new tasks and domains.
- oBERTa (SustaiNLP @ ACL 2023) — Improved sparse transfer learning through better initialization, distillation, and pruning regimes.
- STUN (ACL 2025) — Structured-then-unstructured pruning for Mixture-of-Experts models.
- SuffixDecoding (NeurIPS 2025) — Extreme speculative decoding for emerging AI applications. Model-free approach to LLM inference acceleration.
- Curriculum Learning for Language Modeling (2019–2020) — Explored data ordering strategies for language model training, predating widespread adoption of curriculum learning in LLM training pipelines.
Embedding Models & Retrieval
Training and deploying embedding models that make retrieval work for everyone. All models released under Apache 2.0 and have reached millions of monthly downloads.
- Arctic-Embed (2024, 85 citations) — Open-source embedding models that matched the best proprietary alternatives—trained on just two H100 nodes over a few weeks. Contrastive learning, hard-negative mining, matryoshka dimension reduction. Models on HuggingFace.
- Arctic-Embed 2.0 (2024, 67 citations) — Multilingual retrieval without compromise. Models on HuggingFace.
- CAPOT (2023) — Contrastive Alignment Post Training. Improved dense retrieval robustness on noisy queries by 55% without retraining or re-indexing.
- KALE (SustaiNLP @ ACL 2023) — Post-training KL alignment for asymmetric bi-encoders. 4x throughput with minimal accuracy loss.
- Dense Sparse Retrieval (2023) — Using sparse language models for inference-efficient dense retrieval.
RAG & Agentic Systems
From one-shot answers to agents that continuously search, explore, and accumulate understanding over time.
- Ragnarok (ECIR 2025) — Reusable RAG framework and baselines for the TREC RAG Track.
- The Great Nugget Recall (SIGIR 2025) — Automating fact extraction and RAG evaluation with LLMs.
- Inference Scaling for Bridging Retrieval and Augmented Generation (NAACL 2025) — How inference-time compute scaling improves RAG quality.
- EnronQA (2025) — Towards personalized RAG over private documents.
- Zipf AI (2025–present) — Production system with closed-loop self-healing agents, information-gain scoring, and preference-based search optimization. Eight potentially patentable inventions.
Community