Clarify-Then-Search

Clarify-Then-Search: A Clarification Benchmark for Deep Search with End-to-End Nugget Restoration

Deqiang Huang1, Jingbo Zhou2, Xinjiang Lu2, Tong Xu1, Hua Wu2, Enhong Chen1
1University of Science and Technology of China    2Baidu Inc.
🎉 Accepted to KDD 2026 Datasets & Benchmarks Track
2026-05-17: Paper accepted! Clarify-Then-Search introduces a closed-book, leakage-resistant protocol and an end-to-end nugget restoration metric for evaluating clarification in deep search. The Hard518 dataset, prompts, scripts, and evaluation artifacts are publicly released below.
Dataset size
518 queries
Core metric
restore_score_100
Gold format
weighted nuggets
Licenses
MIT · CC BY 4.0
Tip: GitHub Pages may take a few minutes to reflect commits due to build and CDN cache.

Framework Overview

Closed-book clarify-then-search pipeline
Clarify-Then-Search framework overview
Clarify-Then-Search evaluates clarification under a leakage-resistant closed-book protocol. Each benchmark instance contains a clear intent query and a blurred underspecified query. The Clarifier only sees the blurred query and asks clarification questions; the User Answerer answers strictly from the hidden intent; and the Rewriter uses only the observed Q&A pairs to construct a retrieval-ready query. This design prevents the Rewriter from directly accessing oracle intent while preserving an end-to-end deep-search evaluation. We run WebDancer on the rewritten query and measure utility by how well the final answer restores static, evidence-grounded golden nuggets.

Quickstart

Unzip, set your judge endpoint (Qianfan OpenAI-compatible), and run evaluation. The judge computes nugget coverage (full/partial/none) and outputs per-item and summary stats.

# 1) unzip
unzip clarify-then-search-518-release.zip
cd release

# 2) env (Qianfan OpenAI-compatible)
export QIANFAN_API_KEY="YOUR_KEY"
# optional:
export QIANFAN_BASE_URL="https://qianfan.baidubce.com/v2"
export EVAL_MODEL_NAME="ernie-4.5-turbo-128k"

# 3) run one candidate
python ./code/eval_gold.py \
  --gold_a data/gold_public_hard518.jsonl \
  --gold_b results/candidates/ebk1__cand_hard518.jsonl \
  --out_dir outputs/ebk1
Outputs: outputs/*/per_item.jsonl and outputs/*/summary.json

What’s inside the release

Core artifacts needed to evaluate any model against the static golden nuggets.

data/hard_518_queries.csv
data/gold_public_hard518.jsonl
results/candidates/*.jsonl
results/clarify_only/*.csv
code/eval_gold.py
release/
  data/
    hard_518_queries.csv
    gold_public_hard518.jsonl
  results/
    candidates/
    clarify_only/
  code/
    eval_gold.py
    make_candidate.py
    prepare_public_eval.py

Paper

KDD 2026 D&B Track

Clarify-Then-Search: A Clarification Benchmark for Deep Search with End-to-End Nugget Restoration evaluates whether LLM clarification improves downstream deep search under a closed-book protocol.

Replace the paper button with the ACM Digital Library, OpenReview, or arXiv link when available.

Dataset

Hard518 subset

The benchmark contains 518 information-seeking queries selected to be underspecified, where clarification is expected to provide high utility. Each instance has an intent query fused_query and an underspecified query blurred_query.

Under a closed-book protocol, a system observes only blurred_query, asks k clarification questions, receives constrained answers, rewrites to q̂, and is evaluated by running a fixed deep-search backend on q̂ and scoring nugget restoration against a static gold built from fused_query.

Released fields
fused_query, blurred_query
No “gold clarifications”
clarification is a system output

Why no supervised “gold” clarification labels? We evaluate whether the model asks for information that is answerable under the hidden intent and useful for downstream deep search, avoiding a single canonical clarification target.

Static Golden Nuggets & Evaluation

LLM-judge coverage

We provide a static golden reference per intent query (JSONL). Each gold record contains weighted nuggets and traceability fields. At evaluation time, a candidate answer is scored by weighted nugget recall with partial credit: full=1, partial=0.5, none=0.

restore_score_100 = 100 * ( sum_j w_j * s(cov_j) ) / ( sum_j w_j )
s(full)=1, s(partial)=0.5, s(none)=0

Only gold.nuggets are required by eval_gold.py for scoring; other fields are included for debugging and analysis.

Baseline Results

restore_score_100

Below are compact summary tables for one-turn and three-turn clarification results reported in the paper. At k=1, GPT achieves the strongest one-turn result; at k=3, ERNIE-4.5-Turbo-128K achieves the best overall performance.

One-turn clarification results (k=1)

System k mean p50 p90
orig–19.43420.45435.294
Qwen3-235B-A22B-Instruct122.52719.56547.211
ERNIE-4.5-Turbo-128K123.46020.00050.000
DeepSeek-3.2123.23421.05348.416
Kimi-K2-Instruct122.68720.00046.236
GPT-5.2126.47425.00050.000
Claude-Sonnet-4.5125.91123.33352.996
Gemini-2.5-Pro125.87421.42952.996

Three-turn clarification results (k=3)

System k mean p50 p90
orig–19.43420.45435.294
Qwen3-235B-A22B-Instruct326.23522.99751.744
ERNIE-4.5-Turbo-128K328.33926.08757.143
DeepSeek-3.2326.48625.00052.424
Kimi-K2-Instruct326.56823.50956.310
GPT-5.2326.29723.38554.407
Claude-Sonnet-4.5327.02925.00053.329
Gemini-2.5-Pro327.12624.19452.628

The three-turn setting highlights that larger clarification budgets reward sustained question-selection quality: ERNIE-4.5-Turbo-128K obtains the highest mean, median, and p90 scores at k=3.

Citation

BibTeX

Please cite our paper if you use the benchmark, code, or evaluation artifacts.

@inproceedings{huang2026clarifythensearch,
  title     = {Clarify-Then-Search: A Clarification Benchmark for Deep Search with End-to-End Nugget Restoration},
  author    = {Huang, Deqiang and Zhou, Jingbo and Lu, Xinjiang and Xu, Tong and Wu, Hua and Chen, Enhong},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  year      = {2026}
}
Update DOI / pages after ACM DL metadata is finalized.

License

MIT + CC BY 4.0

Code: MIT License (see LICENSE)

Data: CC BY 4.0 (see DATA_LICENSE)

This release includes model-generated outputs and automatically judged scores. Provided answers may contain errors.

© Clarify-Then-Search · Hard518 release
Built with plain HTML/CSS for GitHub Pages
```