Figure 1: REL benchmarks for chemistry, biology, and algebra.
Figure 2: Example questions from REL.
Authors: Lukas Fesser*, Yasha Ektefaie*, Ada Fang*, Sham M. Kakakde, Marinka Zitnik * indicates equal contribution
This repo is configured around uv and the local helper script in setup_uv_env.sh.
curl -LsSf https://astral.sh/uv/install.sh | sh
source setup_uv_env.sh
source "$UV_PROJECT_ENVIRONMENT/bin/activate"If you want the full environment notes, including cache locations and troubleshooting, see UV_SETUP.md.
The questions are provided in REL/ or you can download them from Hugging Face with hf download ada-f/rel --repo-type dataset --local-dir ..
Run your LLM on the questions and evaluate the responses with the domain evaluators. Examples of how to run fronteir LLMs (Claude, Gemini, GPT-5) are provided in chem_benchmark/llm_runner.py.
If you already have model responses and just want scoring, use the domain evaluators directly:
- Chemistry:
chem_benchmark.evaluation - Biology:
bio_benchmark.evaluation - Algebra:
algebra_benchmark.evaluation
The expected answer formats and minimal examples are in docs/EVALUATION.md.
- docs/EVALUATION.md: scoring details and example commands
- docs/DEVELOPMENT.md: tests and how to generate new benchmark questions
- docs/DATASETS.md: unified dataset format and task layout
@article{fesser2026rel,
title = {Evaluating Relational Reasoning in LLMs with REL},
author = {Lukas Fesser and Yasha Ektefaie and Ada Fang and Sham M. Kakade and Marinka Zitnik},
year = {2026},
journal = {arXiv preprint arXiv:2604.12176},
eprint = {2604.12176},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2604.12176}
}