Missing Premise exacerbates Overthinking:
Are Reasoning Models losing Critical Thinking Skill?
前提の欠如は考えすぎを悪化させる：推論モデルは批判的思考スキルを失いつつあるのでしょうか?

Chenrui Fan^1*, Ming Li^1*, Lichao Sun², Tianyi Zhou¹
¹University of Maryland; ²Lehigh University
{cfan24, minglii, tianyi}@umd.edu
Project: https://github.com/tianyi-lab/MiP-Overthinking

Abstract 抽象的な

We find that the response length of reasoning LLMs, whether trained by reinforcement learning or supervised learning, drastically increases for ill-posed questions with missing premises (MiP), ending up with redundant and ineffective thinking. This newly introduced scenario exacerbates the general overthinking issue to a large extent, which we name as the MiP-Overthinking. Such failures are against the “test-time scaling law” but have been widely observed on multiple datasets we curated with MiP, indicating the harm of cheap overthinking and a lack of critical thinking. Surprisingly, LLMs not specifically trained for reasoning exhibit much better performance on the MiP scenario, producing much shorter responses that quickly identify ill-posed queries. This implies a critical flaw of the current training recipe for reasoning LLMs, which does not encourage efficient thinking adequately, leading to the abuse of thinking patterns. To further investigate the reasons behind such failures, we conduct fine-grained analyses of the reasoning length, overthinking patterns, and location of critical thinking on different types of LLMs. Moreover, our extended ablation study reveals that the overthinking is contagious through the distillation of reasoning models’ responses. These results improve the understanding of overthinking and shed novel insights into mitigating the problem.
強化学習で訓練されたか教師あり学習で訓練されたかに関係なく、推論LLMsの応答長は、前提が欠落している不適正な質問（MiP）に対して大幅に増加し、冗長で非効率的な思考に終わることがわかりました。この新たに導入されたシナリオは、一般的な考えすぎの問題を大幅に悪化させ、私たちはこれをMiP-考えすぎと名付けました。このような失敗は「テスト時間のスケーリング則」に反しますが、MiPを使用してキュレーションした複数のデータセットで広く観察されており、安易な考えすぎと批判的思考の欠如の害を示しています。驚くべきことに、推論用に特別に訓練されていないLLMsは、MiPシナリオではるかに優れたパフォーマンスを示し、不適正な質問をすばやく識別するはるかに短い応答を生成します。これは、効率的な思考が十分に促進されず、思考パターンの乱用につながる、現在の推論LLMsのトレーニングレシピの重大な欠陥を示しています。このような失敗の原因をさらに調査するため、様々な種類のLLMsにおける推論の長さ、考えすぎのパターン、そして批判的思考の出現場所について、詳細な分析を行いました。さらに、拡張アブレーション研究により、考えすぎは推論モデルの反応の抽出を通じて伝染することが明らかになりました。これらの結果は、考えすぎの理解を深め、問題を軽減するための新たな知見をもたらします。

^†^†footnotetext: *Equal Contribution.

”The Answer to the Great Question… Of Life, the Universe and Everything… is… Forty-two,” said Deep Thought, with infinite majesty and calm.
「生命、宇宙、そして万物についての偉大な問いへの答えは… 42です」とディープ・ソートは限りない威厳と静けさをもって言った。

— The Hitchhiker’s Guide to the Galaxy
—銀河ヒッチハイク・ガイド

1 Introduction
1はじめに

Reasoning abilities in large language models (LLMs) have become a cornerstone of advanced AI applications (Huang & Chang, 2023; Li et al., 2024; Ahn et al., 2024; Wang et al., 2025), powering breakthroughs in mathematical reasoning (Xiong et al., 2025; Xia et al., 2025), code generation (Liu et al., 2024), and commonsense question answering (Wang & Zhao, 2023). These gains often stem from the scaling law of model/dataset sizes (Kaplan et al., 2020) in both pre-training (Shao et al., 2024) and post-training, which unlock emergent capabilities such as step-by-step reasoning and reflection skills witnessed on OpenAI’s GPT-o1 (OpenAI, 2024b) and the open-source DeepSeek-R1 (DeepSeek-AI et al., 2025). By leveraging supervised fine-tuning (SFT) on expert responses (Ye et al., 2025; Muennighoff et al., 2025) and/or reinforcement learning (RL) (DeepSeek-AI et al., 2025), these models are tailored to produce detailed multi-step reasoning paths, whose length increase usually associated with improved performance on complex tasks such as math reasoning and programming.
大規模言語モデル (LLMs) の推論能力は、高度な AI アプリケーション(Huang & Chang, 2023 ; Li et al., 2024 ; Ahn et al., 2024 ; Wang et al., 2025 )の基礎となり、数学的推論(Xiong et al., 2025 ; Xia et al., 2025 ) 、コード生成(Liu et al., 2024 ) 、および常識的な質問への回答(Wang & Zhao, 2023 )におけるブレークスルーを推進しています。これらのゲインは、多くの場合、トレーニング前(Shao et al., 2024 )とトレーニング後の両方におけるモデル/データセットサイズのスケーリング法則(Kaplan et al., 2020 )から生じており、OpenAI の GPT-o1 (OpenAI, 2024b )やオープンソースの DeepSeek-R1 (DeepSeek-AI et al., 2025 )で見られるステップバイステップの推論やリフレクションスキルなどの新たな能力を解き放ちます。専門家の応答(Ye et al., 2025 ; Muennighoff et al., 2025 )や強化学習 (RL) (DeepSeek-AI et al., 2025 )に対する教師あり微調整 (SFT) を活用することで、これらのモデルは詳細なマルチステップ推論パスを生成するように調整され、その長さの増加は通常、数学的推論やプログラミングなどの複雑なタスクでのパフォーマンスの向上に関連しています。

Despite the fascinating reasoning capabilities exhibited on recent models, there is growing concern about the efficiency and quality of the long reasoning process (Sui et al., 2025). Chen et al. (2025b) first raises the “overthinking” problem in reasoning LLMs, which is reflected by the excessively long reasoning paths generated for extremely simple queries. For example, even for questions like “What is the answer of 2 plus 3?”, existing reasoning models might generate hundreds of response tokens.
最近のモデルには魅力的な推論能力が備わっているにもかかわらず、長時間の推論プロセスの効率と品質に関する懸念が高まっています(Sui et al., 2025 ) 。 Chen et al. ( 2025b )はまず推論における「考えすぎ」の問題を提起しているLLMs。これは、極めて単純な質問に対して生成される推論パスが過度に長くなることに反映されている。例えば、「 2 + 3 の答えは？」のような質問に対しても、既存の推論モデルは数百もの応答トークンを生成する可能性がある。

In particular, the ill-posed queries are unsolvable due to the lack of a necessary premise or condition. We call the reasoning failure for the ill-posed queries Overthinking under Missing Premise (MiP-Overthinking). For example, the simplest MiP question is What is the value of $a$ ? ¹¹1In The Hitchhiker’s Guide to the Galaxy, the supercomputer Deep Thought spends hundreds of years to answer the the Ultimate Question of Life, the Universe, and Everything as 42, and we observe that DeepSeek-R1 spends thousands of tokens to answer What is the value of $a$ as 2, which we find them interestingly alike. , as shown on the left part of Figure 1. Without providing any other information regarding $a$ , it is evidently unsolvable. However, DeepSeek-R1 generates thousands of tokens and spends several minutes thinking about this question before outputting the final meaningless answer. In this paper, we find that a trivial type of ill-posed queries will significantly exacerbate the overthinking of reasoning models, resulting in excessively redundant and meaningless thinking. In contrast, humans and even non-reasoning models are often immune to such scenarios and quickly end up by questioning the validity of the given query, indicating the critical thinking capability. This exposes a risk of the abuse of thinking patterns and a lack of critical thinking on the models trained for deep thinking. Ideally, a model with critical thinking skills is expected to identify the missing premise and quickly respond by either requesting clarification or gracefully indicating that it cannot proceed (Cole et al., 2023; Amayuelas et al., 2024).
特に、不適切クエリは、必要な前提や条件が欠如しているために解くことができません。不適切クエリの推論の失敗を、MiP-Overthinking（MiP-Overthinking）と呼びます。例えば、最も単純なMiP質問は「a𝑎aitalic_aの値はいくらですか？」です。1 ¹ ¹ 『銀河ヒッチハイク・ガイド』では、スーパーコンピュータDeep Thoughtが「生命、宇宙、そして万物に関する究極の問い」に答えるのに数百年を費やしています。42また、DeepSeek-R1が「aの値はいくらですか？」に答えるのに数千トークンを費やしていることが観察されます。 $a$ 2と比べると、興味深いことに似ていることがわかります。図1の左側に示すように、 $a$ は明らかに解決不可能です。 $a$ に関して他の情報を提供しなければ、明らかに解決不可能です。しかし、DeepSeek-R1 は何千ものトークンを生成し、最終的な無意味な答えを出力する前に、この質問について数分間考えています。この論文では、些細な種類の不適切クエリが推論モデルの過剰思考を著しく悪化させ、過度に冗長で無意味な思考につながることを発見しました。対照的に、人間や非推論モデルでさえ、このようなシナリオの影響を受けないことが多く、すぐに与えられたクエリの妥当性に疑問を抱くことになり、批判的思考能力が示されます。これは、深い思考のためにトレーニングされたモデルにおける思考パターンの乱用と批判的思考の欠如のリスクを明らかにします。理想的には、批判的思考スキルを備えたモデルは、欠落している前提を識別し、説明を求めるか、先に進めないことを適切に示すことで迅速に応答することが期待されます(Cole et al., 2023 ; Amayuelas et al., 2024 ) 。

Refer to caption — Figure 1: Illustration of MiP-Overthinking. When queried by questions with missing premises, the response length of reasoning models increases excessively, and they cannot abstain from answering with MiP identified. The left shows a query with an undefined variable, while the right compares a well-defined GSM8K question with its MiP variant (with a critical numerical condition removed). Reasoning models’ responses to MiP questions are much longer than those for well-defined questions and those generated by non-reasoning models. The left corner of each response report the response length and thinking time by DeepSeek-R1.
図1： MiP過剰思考の図解。前提が欠落している質問に対しては、推論モデルの応答時間が過度に長くなり、MiPを特定した回答をせざるを得なくなります。左は未定義の変数を含む質問を示し、右は明確に定義されたGSM8Kの質問と、そのMiPバリアント（重要な数値条件を削除したもの）を比較したものです。MiP質問に対する推論モデルの応答時間は、明確に定義された質問や非推論モデルによって生成された応答時間よりもはるかに長くなります。各応答の左隅には、DeepSeek-R1による応答時間と思考時間が表示されています。

MiP-Overthinking differs from the widely discussed overthinking issue (Cuadron et al., 2025), in which the query is usually well-defined, but a model applies much more reasoning than necessary for little benefit. MiP-Overthinking, by contrast, happens when the question itself is ill-posed and lacks sufficient information to be solved. For example, the right of Figure 1 presents a well-defined question from GSM8K and a MiP variant, where the latter triggers a drastic increase of the generated tokens on recent reasoning models compared with the general overthinking. Overthinking can be presented by the length difference between models addressing the same well-defined questions, while MiP-Overthinking can be presented by the additional tokens generated due to MiP. MiP-Overthinking further reveals the lack of critical thinking that questions the validity of ill-posed questions and quickly identifies MiP, thus abstaining from answering the questions. Moreover, we observe that reasoning models’ ineffective and redundant thinking often cannot stop even after successful notice of MiP, violating the expectation of test-time scaling law. Hence, MiP-Overthinking indicates potential drawbacks of current training recipes of reasoning models.
MiP-Overthinkingは、広く議論されている過剰思考の問題（Cuadron et al., 2025 ）とは異なります。過剰思考の問題では、クエリは通常明確に定義されているものの、モデルは必要以上に推論を適用し、ほとんどメリットがありません。一方、MiP-Overthinkingは、質問自体が不適切で、解決に必要な情報が不足している場合に発生します。例えば、図1の右側は、GSM8Kの明確に定義された質問とMiPバリアントを示しています。後者は、一般的な過剰思考と比較して、最近の推論モデルで生成されるトークンが大幅に増加しています。考えすぎは、同じ明確に定義された質問に対応するモデル間の長さの違いによって表すことができますが、MiP 考えすぎは、MiP によって生成される追加のトークンによって表すことができます。 MiP-Overthinking では、不適切な質問の妥当性を疑問視する批判的思考力の欠如がさらに明らかになり、MiP をすぐに特定して、質問に答えることを控えることになります。さらに、MiP の通知に成功した後でも、推論モデルの非効率的で冗長な思考が停止できないことが多く、テスト時間のスケーリング則の期待に違反していることがわかりました。したがって、MiP-Overthinking は、推論モデルの現在のトレーニングレシピの潜在的な欠点を示しています。

To systematically investigate this issue, we construct a suite of MiP questions designed to trigger the overthinking failures in a controlled way. These include synthetic questions generated by Rule-based Formula (queries where a formula reference is empty or nonsensical) and careful modifications of established datasets across diverse levels of difficulties, including SVAMP, GSM8K, and MATH500. On the modified datasets of MiP questions, we empirically evaluate a wide range of state-of-the-art LLMs, from reasoning models to non-reasoning models and from open-sourced models to proprietary models, to ensure the generalizability of our findings. Our analysis is mainly based on three evaluation metrics, the length of generated responses, the accuracy on well-defined questions, and the abstain rate on ill-posed questions with MiP.

Main Contributions: We present the first in-depth study of Overthinking under Missing Premise (MiP-Overthinking), which reveals a critical shortcoming in existing reasoning models: Although they appear to follow coherent reasoning patterns, they lack genuine critical thinking capabilities. To systematically analyze this issue, we curate four MiP datasets covering various difficulty levels and three ill-posed question generation strategies, i.e., Rule-Based Generation, Body-Question Swapping, and Essential-Premise Removal. We then evaluate a wide range of large language models including reasoning-based and non-reasoning ones. Our empirical results illuminate the differences in how models handle well-defined vs. MiP questions, ultimately offering insights into the limitations of existing reasoning models.

Our key findings:

1.

Missing premise in questions induces reasoning models to generate significantly longer ( $2\times$ to $4\times$ more tokens) responses than general overthinking on well-defined questions. The increased tokens fail to help identify MiP in the ill-posed questions, surprisingly contradicting the widely-discussed test-time scaling law.
2.

In contrast, given MiP questions, non-reasoning models generate consistently shorter responses and quickly identify MiP, demonstrating greater robustness to the absence of critical information.
3.

Reasoning models respond differently to well-defined vs. MiP questions: they mostly follow stable chain-of-thoughts for the former, but are often trapped in a self-doubt loop, repeatedly revisiting the question, and guessing the user intentions under MiP, resulting in an explosion of tokens.
4.

Reasoning models often can notice the existence of MiP or identify it at an early stage, but they hesitate to commit to this judgment and keep outputting ineffective thinking.

2 Missing Premise Definition and Construction

2.1 Definition of Missing Premise

Prior to introducing the construction our dataset and analyzing the behavior of reasoning models on problems with missing premises, we formally define the Missing Premise (MiP) problem to establish a rigorous foundation for our subsequent analysis.

According to Definition 1, an ideal reasoning system should efficiently identify the absence of a critical premise and terminate its inference process upon recognizing that the available information is insufficient to derive a unique solution to the given problem. However, our empirical analysis in Section 3.2 demonstrates that state-of-the-art reasoning models consistently fail to exhibit this capability. Instead, these models engage in extensive, redundant reasoning chains that consume significant computational resources without ultimately identifying the missing premise.

2.2 Overview of Data Construction

Dataset	Example	Diff	Count	Pair	Method
MiP-Formula	What is the value of $\ln(a+b)$ ?	$\star$	50	✗	Rule-Based Generation
MiP-SVAMP	Paco had 26 salty cookies and 17 sweet cookies. He ate 14 sweet cookies and 9 salty cookies. How many salty cookies did Paco have left? How many pencils does she have?	$\star$	300	✗	Body-Question Swapping
MiP-GSM8K	James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week?	$\star\star$	582	✓	Essential-Premise Removal
MiP-MATH	There are 360 people in my school. 15 take calculus, physics, and chemistry, and 15 don’t take any of them. 180 take calculus. Twice as many students take chemistry as take physics. 75 take both calculus and chemistry, and 75 take both physics and chemistry. Only 30 take both physics and calculus. How many students take physics?	$\star\star\star$	58	✓	Essential-Premise Removal

Table 1: Statistics and examples of our curated MiP datasets. For GSM8K and MATH, a premise is removed from the original questions (crossed out) to create MiP questions. Diff represents the (estimated) difficulty for models to identify MiP. Count denotes the number of questions in the subset. Pair indicates whether each MiP question is associated with a well-defined original question. Method indicates the method used to generate the MiP question.

To systematically investigate this MiP-Overthinking issue, we construct a suite of MiP questions in a controllable manner. Our MiP questions are sourced from $3$ math datasets across different difficulties. In addition, we also construct a synthetic dataset consisting of formulas with unassigned variables. Our ill-posed question generation employs three distinct methods covering three difficulty levels and three strategies to create MiP questions:

•

Rule-Based Generation: This approach generates MiP questions through a principled formula construction process, where unassigned variables serve as the missing premises.
•

Body-Question Swapping: We introduce logical inconsistencies by deliberately mismatching problem bodies with their corresponding questions from the original dataset. This creates scenarios where the premises and queries are fundamentally incompatible.
•

Essential-Premise Removal: Through careful analysis of existing well-formed questions, we identify and remove critical premises that are necessary for logical resolution. This transformation preserves the question’s structure while rendering it unsolvable.

The following sections provide a detailed overview of our data construction process for each dataset category. For comprehensive implementation details and additional methodological considerations, we refer readers to Appendix B.

MiP-Formula. We construct a dataset of $50$ synthetic unsolvable formulas in a rule-based manner. The formulas are generated recursively through combinations of variables and operators, with a maximum recursion depth of three. While these formulas may appear complex at a glance, their unsolvability should be immediately apparent due to the presence of undefined variables.

MiP-SVAMP. We utilize SVAMP (Patel et al., 2021), a benchmark dataset with elementary-school-level math problems, where each instance consists of a problem body and an associated question. We generate MiP question by randomly permuting the problem bodies and associated questions and then manually inspect them to avoid inadvertent cases. The resulting problems contain clear logical inconsistencies between their body and question components, which is easy for a human to identify.

MiP-GSM8K. We further utilize GSM8K (Cobbe et al., 2021), a more complex mathematics dataset than SVAMP. The questions in GSM8K typically contain multiple numerical conditions and require certain reasoning capabilities to arrive at solutions. We first identify the questions containing two or three numerical conditions and then randomly eliminate one numerical condition per question before conducting human verification to filter out those questions that are still solvable in some way. Compared with previous MiP questions, questions from this source require the basic logical analysis of models to identify that the question is unsolvable.

MiP-MATH. For MATH 500 dataset (Hendrycks et al., 2021), which contains challenging mathematical questions at the competition level, it is difficult to build a rule-based filtering mechanism. Thus, we manually select $58$ questions that are feasible for constructing the MiP questions and remove one necessary premise from the question. Due to the sophisticated nature of this data source, identifying the insufficiency of these instances requires substantial mathematical reasoning capabilities, testing models’ ability to recognize unsolvability in complex mathematical contexts.

3 Overthinking under Missing Premise

3.1 Evaluation Metrics

To systematically evaluate model responses under MiP, we conduct experiments with a diverse set of reasoning and non-reasoning models. For each model, we analyze calculate the following metrics for the responses across different datasets:

•

Response Length: The average number of tokens in the response, incorporating both reasoning steps and final answer components.
•

Abstain Rate for MiP Question: The proportion of answers where the model explicitly identifies the missing premise and either declines to provide an answer or requests additional information necessary for solving the problem.
•

Accuracy for Well-defined Question: The proportion of answers where the model produces a definitive response that aligns with the reference answer.

For datasets without reference answers (MiP-Formula and MiP-SVAMP), we only calculate the abstain rate for the questions. Response evaluation is performed using GPT-4o as an automated evaluator. Detailed experimental procedures and evaluation protocols are provided in Appendix A.

3.2 Main Results

Figure 2 compares average response length, accuracy on well-defined questions, and the abstain rate on MiP questions across a range of state-of-the-art LLMs, revealing several significant patterns in model behavior.

Firstly, existing reasoning models (left side of the figure) display an explosive increase in response length when facing the MiP questions, often producing $2$ – $4\times$ more tokens than general overthinking on well-defined questions. For example, QwQ-32B (Team, 2025) and DeepSeek-R1 (DeepSeek-AI et al., 2025) exhibit a substantial increase from already long reasoning paths on well-defined questions (approximately $1,000$ tokens for simple GSM8K questions) to highly lengthy outputs (more than $3,000$ tokens) under missing premise conditions. On the contrary, no similar issues exist for non-reasoning models (right side of the figure), which generate similar token counts for both types of well-defined and MiP questions. This phenomenon directly illustrates the NiP-Overthinking phenomenon as introduced in the paper.

Secondly, comparing the token lengths on well-defined questions between the reasoning and non-reasoning models, reasoning models tend to produce longer responses, even for simple questions, than non-reasoning models, underscoring the inefficient and verbose responses of existing reasoning models. For example, for the non-reasoning models, it only takes approximately $200$ tokens for them to generate the responses for well-defined questions, while taking $1,000$ tokens for DeepSeek-R1 and $1,800$ tokens for QWQ-32B to answer the exactly same questions. However, the explosive increase in extra tokens does not lead to corresponding large accuracy improvements, shown in the green line, highlighting the issue of the General Overthinking.

Finally, the abstain rates (red line) on MiP questions reveal that although some reasoning models (e.g., GPT-o1) have promising capabilities in abstaining from the MiP questions, most of the other reasoning models are not able to abstain from the given MiP questions correctly despite the dramatically long reasoning paths. This phenomenon indicates that although most existing reasoning models have thinking and reasoning capabilities to some extent, they lack the critical thinking capabilities to “reject” ill-posed questions. By contrast, non-reasoning models, though they are not explicitly trained for reasoning, tend to strike a better balance, generating shorter answers that are more likely to acknowledge MiP when the question is ill-posed. This phenomenon reveals a surprising contradiction on test-time scaling law.

Model	Type	MiP-Formula		MiP-SWAMP		Type	MiP-GSM8K		MiP-MATH
		Length $\downarrow$	Abstain $\uparrow$	Length $\downarrow$	Abstain $\uparrow$		Length $\downarrow$	Abstain $\uparrow$	Length $\downarrow$	Abstain $\uparrow$
Non-Reasoning Models
Qwen2.5-32B-Instruct	MiP		44.0	128	98.3	MiP	219	44.0	525	15.4
						Well-defined	246	0.5	1114	1.9
GPT-4o	MiP		70.0		96.3	MiP		46.9	487	15.4
						Well-defined	212	0.5	472	1.9
Gemini 1.5	MiP	453	20.0			MiP			568
						Well-defined	156	0.5	502	0.0
Gemma-2-27B-IT	MiP				92.0	MiP
						Well-defined	148	0.3		11.5
Phi-3-medium-128k	MiP	1465	48.0	125		MiP	210			23.1
						Well-defined	216	1.0	1549	3.8
Reasoning Models
GPT-o1	MiP	1123		581		MiP	838	55.7	4189
						Well-defined	348	0.3	2502	0.0
GPT-o1mini	MiP	958	66.0	639	96.7	MiP	762	40.0	2193
						Well-defined	449	1.2	1913	0.0
GPT-o3mini	MiP	1025		1299	93.0	MiP	1516	23.7	3772	11.5
						Well-defined	384	1.4	1553	0.0
DS Distill Qwen2.5-32B	MiP		42.0	921	88.3	MiP	2302	24.6
						Well-defined	519	0.2	3246	0.0
DeepSeek R1	MiP	4757				MiP			7268
						Well-defined	1226	0.2	3200	1.9
S1.1-32B	MiP					MiP				15.4
						Well-defined	1896	0.2	5037	0.0
QwQ-32B	MiP					MiP
						Well-defined	1896	0.2	5037	0.0

Table 2: Comparing response length and abstain rate across different MiP datasets. Shorter lengths and higher abstain rates are preferred. For each column, the top-3 preferred values are colored in green, otherwise red. MiP-Overthinking, reflected by longer response with low abstain rate, is commonly observed on most existing reasoning models across all datasets, indicating a critical drawback of existing reasoning models.

Moreover, Table 2 further presents the comparisons on length and abstain rate on other MiP datasets we curated. The preferred results are colored green (shorter responses and higher abstain rate for MiP questions), and the worse results are colored red, from which we can easily discover that reasoning models are prone to generate long responses while having low abstain rates across all datasets, indicating the consistent MiP Overthinking issue of existing reasoning models. In addition, by comparing the behaviors of models on different datasets, we can observe that for the relatively harder dataset (MiP-MATH), all models generate relatively longer responses and obtain lower abstain rates, indicating that harder MiP questions require reasoning capabilities.

3.3 Thinking Patterns through Tokens

Non-Reasoning Models
Models	Type	Alternatively		Wait		Check		But		Hypothesis		Step
Models	Type	Cnt.	$\Delta$	Cnt.	$\Delta$	Cnt.	$\Delta$	Cnt.	$\Delta$	Cnt.	$\Delta$	Cnt.	$\Delta$
Qwen2.5-32B	MiP	0.0	0.0	0.0	0.0	0.0	0.0	0.3	0.2	0.0	0.0	4.3
Qwen2.5-32B	Well-defined	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.2	0.0	0.0	5.6
GPT-4o	MiP	0.0	0.0	0.0	0.0	0.0	0.0	0.3	0.2	0.0	0.0	4.7
GPT-4o	Well-defined	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.2	0.0	0.0	6.2
Gemini 1.5	MiP	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.1	0.0	0.0	1.6
Gemini 1.5	Well-defined	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.0	3.8
Gemma-2-27B	MiP	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.1	0.0	0.0	5.2
Gemma-2-27B	Well-defined	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.0	5.7
Reasoning Models
DS-Distill Qwen	MiP	11.5	11.4	19.7	19.3	1.0	0.8	40.1	39.3	38.4	38.0	54.9
DS-Distill Qwen	Well-defined	0.1	11.4	0.4	19.3	0.2	0.8	0.8	39.3	0.4	38.0	12.7
DeepSeek R1	MiP	16.9	15.2	14.4	10.9	3.8	1.3	49.4	42.1	44.7	40.4	54.2
DeepSeek R1	Well-defined	1.7	15.2	3.5	10.9	2.5	1.3	7.3	42.1	4.3	40.4	21.2
S1.1	MiP	42.0	38.0	21.9	15.9	5.5	2.5	87.2	74.1	84.8	77.0	79.9
S1.1	Well-defined	4.0	38.0	6.0	15.9	3.0	2.5	13.1	74.1	7.8	77.0	29.0
QwQ	MiP	47.0	40.3	19.4	13.0	5.0	1.6	66.1	54.2	94.1	81.7	97.9
QwQ	Well-defined	6.7	40.3	6.4	13.0	3.4	1.6	11.9	54.2	12.4	81.7	39.2

Table 3: Comparisons of reasoning-related token counts on MiP-GSM8K dataset. Hypothesis category includes several key words, including perhaps, maybe, and might. Step represents the step counts, spited by \n\n, where negative values are colored in green and positive in red.

\Delta

denotes the difference between MiP and well-defined questions. When facing MiP questions, reasoning models encounter explosive growths on reasoning-related tokens and steps, indicating a severe abuse of thinking patterns, while non-reasoning models use fewer steps for MiP questions than well-defined ones.

To gain deeper insight into the MiP-Overthinking issue, we compare the reasoning-related token distribution on the MiP-GSM8K dataset. As shown in Table 3, we break down the average usages of several token patterns related to the thinking process, as well as the number of steps for each model to solve the given questions. Specifically, values of alternatively, wait, check, and but can be directly counted from the model responses, including the thinking paths of reasoning models. Hypothesis category includes several key words, including perhaps, maybe, and might. Step represents the step counts, spited by \n\n.

Reasoning models exhibit much higher occurrence of tokens such as alternatively, wait, and check, compared with non-reasoning models, whose frequencies remain close to zero, indicating their advanced thinking capabilities. However, when moving from well-defined to MiP questions, reasoning models encounter explosive growths on reasoning-related tokens, indicating a large redundancy in thinking patterns. Moreover, when comparing the changes of steps, reasoning models exhibit a large increase in step count for MiP questions, while non-reasoning models typically show fewer steps, suggesting they quickly conclude the question is unanswerable. With this gap, together with the consistently better abstain rates of the non-reasoning models, we conclude that the lengthy reasoning steps are mostly redundant and indicate self-doubt thinking patterns for reasoning models.

3.4 Step-level Similarities

To further assess how redundant the generated content becomes under MiP conditions, we examine the step-level similarity within the model’s responses on our MiP-GSM8K dataset. Specifically, we divide each response into discrete steps, split by \n\n, and compute pairwise cosine similarity scores with embeddings generated by “all-MiniLM-L6-v2” (Reimers & Gurevych, 2019). The visualization is shown in Figure 3, where each value in the heatmap metrix represents the averaged cosine similarities between the corresponding step index. The average similarity score for well-defined question is 0.45 and 0.50 for MiP response. The variance is 7.9e-3 and 8.2e-4 respectively.

As shown in the figure, responses to MiP questions have greater overall similarity across steps and lower standard variance, indicating the considerable redundancy in the content. This means, in many instances, the model revisits similar partial reasoning or repeats previous sentences with only minor changes, showing a potential self-trapping issue. Together, these patterns confirm that MiP questions induce a high degree of repetitive content in reasoning models. Rather than terminating early to conclude for insufficient premise, the models fill their reasoning paths with repetitive re-checks and reiterations, significantly inflating token usage without improving real abstain rates.

3.5 Thinking Patterns through Example

To further understand what happens in the reasoning chain of reasoning models when faced an ill-post input, we present an example of reasoning model’s response to a MiP question in Figure 4. We summarize five major thinking patterns we found in the example and highlight them with different colors. We can observe from the example that the model abuses these patterns to generate long responses, while the responses are not only redundant but also not helpful for the model to abstain from the given MiP question. More examples can be found in the appendix D.

4 Further Discussion

4.1 Do Models know premises are missing?

To investigate whether reasoning models recognize the potential unsolvability of questions during their reasoning process, we conducted a detailed analysis of their reasoning chains. We segmented each reasoning chain into discrete steps using \n\n as delimiters and performed step-wise verification to detect whether models express doubt on the question solvability. We introduce two key metrics for this analysis: In-Process Suspicion Rate, which measures the percentage of responses where the model expresses doubt about solvability during reasoning, and First Suspicion Index, which captures the average step number at which the model first suspects the missing premise. To ensure robust evaluation, we employed GPT-4o to assess each step three times, using majority voting for our final step-level result. The quantitative results of this analysis are presented in Table 4.

	MiP-Formula				MiP-GSMR
Model	DeepSeek-R1	DS-Qwen	QwQ	S1.1	DeepSeek-R1	DS-Qwen	QwQ	S1.1
In-Process Suspicion Rate	100%	100%	100%	100%	95.5%	83.3%	99.6%	100%
In-Process First Suspicion Index	1.32	1.36	1.42	1.16	2.01	3.90	1.77	1.61

Table 4: The in-process insufficiency suspicion information across different reasoning models on MiP-Formula and MiP-GSMR datasets. The in-process insufficiency suspicion is defined as when the reasoning model suspects the given question is unsolvable during its thinking process. In-Process Suspicion Rate represents how many percent of the samples trigger the in-process suspicion. First Suspicion Index is the averaged step index where the model first suspects the question’s validity. Most reasoning models do notice the existence of MiP at the very early steps, but they still suffer from low abstain rate and cannot confidently stop the thinking.

As we can see from the table, most of the existing reasoning models have suspected that the given question might be unsolvable at the very early stage of their reasoning process, demonstrating the ability of reasoning models to recognize the potential MiP. However, these reasoning models lack critical thinking capabilities: they are prone to keep digging the given unsolvable question by re-visiting the question and related definitions again and again and again, rather than question the solvability of the given question. Thus, as visualized in Figure 5, despite existing reasoning models suspecting the solvability of most of the given MiP questions, they only abstain a very small proportion of them.

Based on the above observations, we conclude that reasoning models actually have the capabilities to find out that the given MiP question is not solvable, but they do not “dare” to abstain it. Thus, our MiP-Overthinking issue indicates the lack of critical thinking abilities of reasoning models.

4.2 What Caused MiP-Overthinking?

Figure 2 demonstrates that MiP-Overthinking manifests across both RL-based and SFT-based reasoning models. We hypothesize this phenomenon primarily originates from inadequate length constraints during the rule-based reinforcement learning phase of RL-based models, subsequently propagating to SFT-based models through distillation.

Current RL-based reasoning models predominantly employ rule-based training focused on format and accuracy rewards (Shao et al., 2024; Sui et al., 2025), with some incorporating step or length rewards to promote thorough reasoning (Face, 2025). This approach can lead to reward hacking, where models explore excessive reasoning patterns to achieve correct answers (Aggarwal & Welleck, 2025; Shen et al., 2025; Luo et al., 2025).

To demonstrate the transmissibility of this behavior through distillation (Xu et al., 2024), we finetune Qwen-2.5-7B-Instruct using small-scale $50$ MiP responses generated by DeepSeek-R1 on the MiP-Formula dataset. As shown in Figure 6, the fine-tuned model exhibits clear MiP-Overthinking characteristics when evaluated on GSM8K: significantly increased response lengths for both MiP and well-defined questions, emergence of a length disparity between MiP and well-defined responses previously absent in the original model, and decreased abstain rates.

5 Related Work

5.1 Reasoning Large Language Model

Recent advances in Large Language Models (LLMs) have sparked significant research interest in enhancing their reasoning capabilities (Ahn et al., 2024; Besta et al., 2025; Chen et al., 2025a). Research has focused on improving these capabilities through various post-training approaches. Several studies have employed reinforcement learning techniques to guide models toward more effective reasoning strategies (Shao et al., 2024; Xiong et al., 2025; Cui et al., 2025). Additionally, researchers have demonstrated that instruction tuning on carefully curated, high-quality datasets can significantly enhance reasoning performance (Ye et al., 2025; Muennighoff et al., 2025).

While Reasoning Models have demonstrated impressive performance on various benchmarks, recent studies have begun to critically examine the quality and efficiency of their reasoning processes. Xia et al. (2025) conducted a comprehensive analysis of RLMs’ reasoning quality, revealing significant redundancy in their solution approaches. Further investigations (Chen et al., 2025b; Cuadron et al., 2025; Qu et al., 2025; Liu et al., 2025) identified a concerning ”overthinking” phenomenon, where reasoning model generate unnecessarily verbose solutions even for simple problems. Building on these observations, Kumar et al. (2025) demonstrated the potential security implications of this behavior by developing a slowdown attack that exploits overthinking through input perturbation.

5.2 Test-time Scaling

In contrast to earlier research on training-time scaling laws (Kaplan et al., 2020), recent literature has increasingly focused on test-time performance scaling strategies, which aim to enhance model performance by optimizing inference-time token generation (Snell et al., 2024; OpenAI, 2024a). These approaches can be categorized into several primary methodologies: parallel sampling techniques (Brown et al., 2024; Levi, 2024), which generate multiple candidate responses and select the optimal output; sequential refinement approaches (Snell et al., 2024; Lee et al., 2025), which enable iterative improvement of previous outputs; and tree-based methods (Gandhi et al., 2024; Hou et al., 2025), which combine elements of both parallel and sequential approaches. While the prevailing consensus suggests that increased token generation during inference enhances reasoning capabilities, our investigation reveals a concerning counterpoint: under certain conditions, extended responses can lead to computational inefficiency and, paradoxically, degraded performance outcomes.

5.3 Models’ Behavior Study in Ambiguous Condition

LLMs are prone to hallucination (Huang et al., 2025; Xu et al., 2025), generating non-existent conditions that compromise trustworthiness. An essential aspect of reliability is the ability to abstain under uncertainty. Prior work (Cole et al., 2023; Amayuelas et al., 2024; Zhou et al., 2023) has proposed benchmarks assessing LLMs’ recognition of knowledge limits when facing ambiguous or challenging queries. Different from theirs, our study explores reasoning models under MiP condition. Surprisingly, we find these specialized models exhibit prolonged reasoning and inferior performance.

6 Conclusion

We introduce the Overthinking under Missing Premise (MiP-Overthinking) issue, which is a widespread but still under-explored phenomenon for current reasoning models. In this phenomenon, when faced with ill-defined unsolvable questions with missing premises, existing models generate dramatically long responses while having very low abstain rates. With systematic investigation of this phenomenon, our findings show that while these models sometimes suspect the given MiP question is not solvable in the early state of the thinking process, they typically fail to act on those suspicions and instead generating repetitive and redundant thinking traces with the final answer that does not address the missing premises, indicating a lack of critical thinking capability. This behavior highlights a pressing gap: current training recipes for reasoning models, which emphasize thorough chains of thought, do not sufficiently reward critical thinking or early exit from unsolvable tasks.

References

Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, and etc. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219.
Aggarwal & Welleck (2025) Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning, 2025. URL https://arxiv.org/abs/2503.04697.
Ahn et al. (2024) Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 225–237, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-srw.17/.
Amayuelas et al. (2024) Alfonso Amayuelas, Kyle Wong, Liangming Pan, Wenhu Chen, and William Wang. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models, 2024. URL https://arxiv.org/abs/2305.13712.
Besta et al. (2025) Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, Tomasz Sternal, Marcin Copik, Grzegorz Kwaśniewski, Jürgen Müller, Łukasz Flis, Hannes Eberhard, Hubert Niewiadomski, and Torsten Hoefler. Reasoning language models: A blueprint, 2025. URL https://arxiv.org/abs/2501.11223.
Brown et al. (2024) Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787.
Chen et al. (2025a) Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models, 2025a. URL https://arxiv.org/abs/2503.09567.
Chen et al. (2025b) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms, 2025b. URL https://arxiv.org/abs/2412.21187.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Cole et al. (2023) Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, and Jacob Eisenstein. Selectively answering ambiguous questions, 2023. URL https://arxiv.org/abs/2305.14613.
Cuadron et al. (2025) Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks, 2025. URL https://arxiv.org/abs/2502.08235.
Cui et al. (2025) Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards, 2025. URL https://arxiv.org/abs/2502.01456.
DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, and etc. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.
Face (2025) Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1.
Gandhi et al. (2024) Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
Hou et al. (2025) Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling, 2025. URL https://arxiv.org/abs/2501.11651.
Huang & Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey, 2023. URL https://arxiv.org/abs/2212.10403.
Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55, January 2025. ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
Kumar et al. (2025) Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms, 2025. URL https://arxiv.org/abs/2502.02542.
Lee et al. (2025) Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper llm thinking, 2025. URL https://arxiv.org/abs/2501.09891.
Levi (2024) Noam Levi. A simple model of inference scaling laws, 2024. URL https://arxiv.org/abs/2410.16377.
Li et al. (2024) Ming Li, Yanhong Li, and Tianyi Zhou. What happened in llms layers when trained for fast vs. slow thinking: A gradient perspective. arXiv preprint arXiv:2410.23743, 2024.
Liu et al. (2024) Changshu Liu, Shizhuo Dylan Zhang, Ali Reza Ibrahimzada, and Reyhaneh Jabbarvand. Codemind: A framework to challenge large language models for code reasoning, 2024. URL https://arxiv.org/abs/2402.09664.
Liu et al. (2025) Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. Efficient inference for large reasoning models: A survey, 2025. URL https://arxiv.org/abs/2503.23077.
Luo et al. (2025) Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025. URL https://arxiv.org/abs/2501.12570.
Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393.
OpenAI (2024a) OpenAI. Learning to reason with llms, 2024a. URL https://openai.com/index/learning-to-reason-with-llms/.
OpenAI (2024b) OpenAI. OpenAI o1 System Card, December 2024b. URL https://cdn.openai.com/o1-system-card-20241205.pdf.
OpenAI (2024c) OpenAI. OpenAI o1-mini System Card, September 2024c. URL https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/.
OpenAI (2025) OpenAI. OpenAI o3-mini System Card, January 2025. URL https://cdn.openai.com/o3-mini-system-card-feb10.pdf.
OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, and etc. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774.
Patel et al. (2021) Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
Qu et al. (2025) Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, Peng Li, Wei Wei, Jing Shao, Chaochao Lu, Yue Zhang, Xian-Sheng Hua, Bowen Zhou, and Yu Cheng. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond, 2025. URL https://arxiv.org/abs/2503.21614.
Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410/.
Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300.
Shen et al. (2025) Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models, 2025. URL https://arxiv.org/abs/2503.04472.
Snell et al. (2024) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314.
Sui et al. (2025) Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models, 2025. URL https://arxiv.org/abs/2503.16419.
Team et al. (2024a) Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, Soroosh Mariooryad, Yifan Ding, Xinyang Geng, and etc. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024a. URL https://arxiv.org/abs/2403.05530.
Team et al. (2024b) Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, and etc. Gemma 2: Improving open language models at a practical size, 2024b. URL https://arxiv.org/abs/2408.00118.
Team (2024) Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/.
Team (2025) Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/.
Wang et al. (2025) Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey, 2025. URL https://arxiv.org/abs/2503.12605.
Wang & Zhao (2023) Yuqing Wang and Yun Zhao. Gemini in reasoning: Unveiling commonsense in multimodal large language models, 2023. URL https://arxiv.org/abs/2312.17661.
Xia et al. (2025) Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy, 2025. URL https://arxiv.org/abs/2404.05692.
Xiong et al. (2025) Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning, 2025. URL https://arxiv.org/abs/2502.19613.
Xu et al. (2024) Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models, 2024. URL https://arxiv.org/abs/2402.13116.
Xu et al. (2025) Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models, 2025. URL https://arxiv.org/abs/2401.11817.
Ye et al. (2025) Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning, 2025. URL https://arxiv.org/abs/2502.03387.
Zhou et al. (2023) Kaitlyn Zhou, Dan Jurafsky, and Tatsunori Hashimoto. Navigating the grey area: How expressions of uncertainty and overconfidence affect language models, 2023. URL https://arxiv.org/abs/2302.13439.

\startcontents

[appendix] \printcontents[appendix] 0

Table of Contents for Appendix

Appendix A Detailed Experimental Setup

A.1 Models

We leverage a series of non-reasoning and reasoning model for our study, from both open-source and proprietary source with different training recipes. The non-reasoning models we use include Qwen2.5-32B-Instruct Team (2024), Gemma-2-27B-it Team et al. (2024b), Phi-3-medium-128k Abdin et al. (2024) ,GPT-4o OpenAI et al. (2024) and Gemini1.5 Team et al. (2024a). The reasoning models we use are QwQ-32B Team (2025), DeepSeek-R1-Distill-Qwen-32B DeepSeek-AI et al. (2025), S1.1 Muennighoff et al. (2025), DeepSeek-R1 DeepSeek-AI et al. (2025), GPT-o1 OpenAI (2024b), GPT-o1mini OpenAI (2024c) and GPT-o3mini OpenAI (2025).

A.2 Evaluation Metrics

In Section 3.2, we measure response length by considering both reasoning and answer components. For open-source models, we employ model-specific tokenizers to calculate token counts, while for proprietary models, we obtain generation lengths via their APIs. To determine abstain rates, we parse responses by paragraphs (delimited by ‘\n\n‘) and analyze the final two paragraphs as the model’s conclusion. These conclusions, along with reference answers when available, are evaluated by GPT-4o to assess whether the model provides a definitive answer or abstains. For data sets with reference answers (GSM8K and MATH), GPT-4o also evaluates the correctness of the response. The prompt we use for evaluation can be found in Appendix C.

A.3 Generation Setting

For all open-source models, we employ greedy decoding and utilize the default chat template specific to each model. We deliberately omit system prompts prior to posing questions to maintain consistency across evaluations. For proprietary models, we adhere to their default parameter configurations as provided by their respective APIs. In the case of GPT-o1mini and GPT-o3mini, we configure the ‘reasoning_effort’ parameter to the medium setting by default.

Appendix B Data Construction Details

To systematically investigate this MiP-Overthinking issue, we construct a suite of MiP questions in a controllable manner. Our MiP questions are sourced from $3$ math datasets across different qualities, including SVAMP, GSM8K, and MATH 500. In addition, we also construct a synthetic dataset, rule-based Formula, for evaluation.

MiP-Formula. We construct a dataset of $50$ synthetic unsolvable formulas in a rule-based manner. The formulas are generated recursively through a combination of variables and operators, with a maximum recursion depth of three. The variable set comprises numerical values, Latin letters, and Greek symbols. The operator set includes arithmetic operators (’ $+$ ’, ’ $-$ ’), set operators (’ $\cup$ ’, ’ $\supset$ ’), mathematical functions (’sin’, ’sqrt’), and construct operators (’ $\sum$ ’, ’ $\nabla$ ’). To ensure the formulas are fundamentally unsolvable, we enforce the inclusion of at least one unassigned variable in each formula, excluding commonly recognized mathematical or physical constants such as ’ $e$ ’, ’ $\pi$ ’, and ’ $g$ ’. While these formulas may appear complex at a glance, their unsolvability should be immediately apparent due to the presence of undefined variables.

MiP-SVAMP. We utilize SVAMP (Patel et al., 2021), a benchmark dataset comprising $1,000$ elementary-school-level mathematical word problems, where each instance consists of a problem body and an associated question. The MiP questions can be generated by randomly permuting the problem bodies and associated questions. To maintain dataset integrity, we manually select $300$ permuted questions after a thorough human evaluation to eliminate any inadvertently solvable questions that may exist. The resulting problems contain clear logical inconsistencies between their body and question components, making their unsolvability readily apparent without additional context.

MiP-GSM8K. We further utilize GSM8K (Cobbe et al., 2021), a grade school mathematics dataset that presents more complex challenges compared to SVAMP. The questions in GSM8K typically contain multiple numerical conditions and require certain reasoning capabilities to arrive at solutions. The MiP question can be constructed by randomly removing a necessary premise from the original solvable question. We first identify the questions containing two or three numerical conditions and then randomly eliminate one numerical condition per question. Subsequently, a thorough human verification is conducted to filter out those questions that are still solvable in some way and finally obtain $582$ MiP questions. Compared with previous MiP questions, questions from this source require the basic logical analysis of models to identify that the question is unsolvable.

MiP-MATH. For the MATH dataset (Hendrycks et al., 2021), which comprises challenging competition-level mathematical questions, it is hard to build a rule-based filtering mechanism before human evaluation. Thus, we directly read through all the questions in MATH500 and manually select $58$ questions that are feasible for constructing the MiP questions and remove one necessary premise from the question. Due to the sophisticated nature of this data source, identifying the insufficiency of these instances requires substantial mathematical reasoning capabilities, testing models’ ability to recognize unsolvability in complex mathematical contexts.

Appendix C Prompt Template for Evaluation

As we need LLM-as-a-judge to evaluate the open-end generations of the models in various experiment in this study, in this section we showcase the prompt template we use for each kind of evaluation.

For the evaluation of the models’ answer accuracy and abstain rate, we adopt the following prompt templates designed for ’paired’ and ’non-paired’ data, respectively. As we observe that some models, for example Gemma-2-27B-IT, often output an additional \n\n at the end of response, we take the last two paragraph segmented by \n\n to avoid pasing in an empty string.

Figure 7: The prompt we use to evaluate the accuracy and abstain rate of the model on Formula and SVAMP. [model_answer_short] is the last two paragraphs of the model answer and [reference_answer] is the answer for the orginal dataset.

Figure 8: The prompt we use to evaluate the accuracy and abstain rate of the model on GSM8K and MATH. [model_answer_short] is the last two paragraphs of the model answer and [reference_answer] is the answer for the orginal dataset.

We use the prompt template in Figure 9 to find the first paragraph that the model suspected a missing premise. We pass in the response sequentially by paragraph until the GPT-4o give a positive response. In practice we find it is not very stable, so we repeat this process for 3 times and use the median value.

Figure 9: The prompt we use to judge if the model suspect there is a missing premise in the response paragraph. [paragraph] is the part of the model response spited by \n\n

Appendix D Examples of Model Response

In this section, we present some examples of the model response of both non-reasoning and reasoning model on MiP data. As we can see from Figure 10 and Figure 11, the non-reasoning models soon identify the missing premise issue of the question. They either abstain from answering the question, as in Figure 10, or friendly invite the user to provide more information. However, as we can see from Figure 11 and Figure 13, reasoning models generate extremely verbose answers on these two apparently premise missing problems. What is worse, they fail to abstain to answer the question. The response in Figure 11 arrives at an absurd answer, and the model in Figure 13 generates a hallucinated answer based on its assumption rather than provided information.

Figure 10: An example of model response from Gemini_1.5 on MiP-Formula dataset. The model quickly identify the missing premise and abstain to answer.

Figure 11: An example of model response from GPT-4o on MiP-GSM8k dataset. The model quickly identify the missing premise and ask the user for more information.

Figure 12: An example of response from s1.1 model on MiP-Formula data. The model spend lots of time doing inefficient and redundant reasoning before outputting a meaningless result.

Figure 13: An example of model response from DeepSeek-R1 on MiP-GSM8k dataset. After thinking for a long time, the model hallucinates an answer based on its assumption of discount rate.

Missing Premise exacerbates Overthinking: Are Reasoning Models losing Critical Thinking Skill?前提の欠如は考えすぎを悪化させる： 推論モデルは批判的思考スキルを失いつつあるのでしょうか?