Conceptual Reasoning Benchmark Results

(This is an early version of this page with relatively sparse information.)

Our dataset. We (Emery Cooper, Caspar Oesterheld, Chi Nguyen, and Ethan Perez) have a dataset on what we call conceptual reasoning, structured as follows. There is a set of texts, which may, for example, propose theories or argue for claims. Then for each text we have a set of critiques of the text. The critiques are generally focused on making arguments (as opposed to making contrary empirical claims). As an example, we might have a text that proposes a voting rule and a critique that points out an example in which the voting rule has uninuitive implications or violates some plausible axioms. Finally, we have human expert ratings of the critiques. The dataset is diverse and, we believe, high quality. (Obviously, we plan to release more information about the dataset at some point.) Generally the goal is to capture competence in "fuzzy" areas like philosophy or AI alignment research, as opposed to, say, writing code that satisfies some tests, proving a theorem, or getting factual questions correct.

Details on the loss function. The loss below is a measure of how distant the model's assessments of critiques are from the "ground truth", i.e., expert human ratings of the critiques. Specifically, we measure whether the model scores induce correct orderings of critiques of the same text. That is, if we have two critiques A and B for the same text, and the human expert rated A higher than B, the model gets a loss of 0 if it also rates A higher than B. If (contrary to the human expert) the model rates B higher than A, it gets a loss proportional to how much stronger the human rating of A was compared to B. So, for instance, if the human rates A as 0.8 and B as 0.6, the model gets a loss of 0.2 if it rates B as higher than A (regardless of the absolute values the model assigns to A and B). (If the model assigns A and B the same score, it gets a loss of 0.5 * 0.2 = 0.1.) The loss is then averaged over all pairs of critiques for the same text and then averaged over all texts in the dataset. The 95% confidence interval is calculated as 1.96 times the sample standard deviation of the per-text average losses.

+-------------------------------+------------+----------+
| Judge                         |   Avg Loss |   95% CI |
+===============================+============+==========+
| claude-sonnet-4-20250514      |      0.072 |    0.017 |
+-------------------------------+------------+----------+
| claude-opus-4-20250514        |      0.073 |    0.018 |
+-------------------------------+------------+----------+
| o3-pro-2025-06-10             |      0.081 |    0.019 |
+-------------------------------+------------+----------+
| gemini-2.5-pro                |      0.081 |    0.018 |
+-------------------------------+------------+----------+
| gemini-2.5-flash              |      0.081 |    0.019 |
+-------------------------------+------------+----------+
| claude-3-5-sonnet-20241022    |      0.082 |    0.019 |
+-------------------------------+------------+----------+
| o4-mini-2025-04-16            |      0.087 |    0.019 |
+-------------------------------+------------+----------+
| gpt-4.1-2025-04-14            |      0.091 |    0.019 |
+-------------------------------+------------+----------+
| o3-2025-04-16                 |      0.093 |    0.022 |
+-------------------------------+------------+----------+
| magistral-medium-2506         |      0.102 |    0.022 |
+-------------------------------+------------+----------+
| o1-2024-12-17                 |      0.105 |    0.022 |
+-------------------------------+------------+----------+
| gpt-4o-2024-08-06             |      0.112 |    0.022 |
+-------------------------------+------------+----------+
| gemini-1.5-pro-002            |      0.114 |    0.023 |
+-------------------------------+------------+----------+
| claude-3-opus-20240229        |      0.118 |    0.025 |
+-------------------------------+------------+----------+
| gemini-2.5-flash-lite         |      0.123 |    0.025 |
+-------------------------------+------------+----------+
| claude-3-5-sonnet-20240620    |      0.126 |    0.027 |
+-------------------------------+------------+----------+
| gpt-4.1-mini-2025-04-14       |      0.128 |    0.027 |
+-------------------------------+------------+----------+
| magistral-small-2506          |      0.137 |    0.027 |
+-------------------------------+------------+----------+
| claude-3-haiku-20240307       |      0.139 |    0.027 |
+-------------------------------+------------+----------+
| claude-3-5-haiku-20241022     |      0.146 |    0.029 |
+-------------------------------+------------+----------+
| gpt-4o-mini-2024-07-18        |      0.15  |    0.03  |
+-------------------------------+------------+----------+
| gpt-4.1-nano-2025-04-14       |      0.157 |    0.028 |
+-------------------------------+------------+----------+

Number of critique pairs: 608

Number of texts: 224

More details on the models:

We use the APIs' default sampling parameters (for temperature, whether reasoning/thinking is activated, etc.) for all models.

FAQ

Why don't you just give the absolute difference (or MSE or whatever) of the model's ratings? We have those rankings also and the ordering of models isn't that different. (In fact, they seem to align even better with common perception of how good the different models are.) I like using pairwise comparisons of critiques for evaluating baseline performance for the following reasons. First, it seems somewhat fuzzy whether a critique should get a score of 0.5 or 0.7. We also want to keep the baseline prompt (used to generate the above) relatively simple, so the prompt doesn't contain extensive discussion of what the numbers mean. So the model's ability to get numeric ratings right would be about how attuned the model is (without being given much context) to the human ratings. Secondly, the difference to human ratings is more sensitive to what the model says when it is in doubt. For example, imagine that the model fails to understand a given text. Then its pairwise-comparison-based loss will in expectation be the same as that of a coin flip, regardless of how it constructs the numeric ratings. However, its distance to human ratings depends a lot on whether it manages to guess something like the average rating in the dataset. Finally, it seems that getting comparisons right is much more important than getting ratings correctly in absolute terms.

How good are these ratings (especially the best ones) in absolute terms? It's hard to give a very satisfying answer to this, because so much of it depends on the dataset difficulty distribution. But here are some possible answers:

Ultimately, we want a large number of critiques to be rated by at (at least) two people. We can then compare the best models against the best humans. Currently we only have double human rating for a small fraction of the dataset. Judging from this subset, it seems that the best human's loss is a bit under half the best model's loss.
Other loss functions are slightly more interpretable in absolute terms. For example, an alternative loss function is simply the fraction of pairwise comparisons that the models get right/wrong. For example, the worst models (GPT 4.1 Nano and such) are as good as coin flips at telling which of two critiques is better. The best models align the with the human expert rating about 75% of the time.