Introduction
Recently, Apple engineer Iman Mirzadeh published a paper titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” questioning the reasoning capabilities of OpenAI’s o1.
What is GSM-Symbolic?
OpenAI introduced GSM8K (Grade School Math 8K), a dataset of elementary math problems, as a popular benchmark for evaluating LLM mathematical reasoning abilities. Although it contains simple math problems with detailed solutions suitable for techniques like chain-of-thought (CoT) prompts, it only provides a single metric based on a fixed set of questions. The paper’s authors proposed GSM-Symbolic, an enhanced benchmark that generates diverse variations of GSM8K problems using symbolic templates. This allows researchers to evaluate LLM performance in various settings more precisely and reliably than a single accuracy metric.
In short, GSM8K is a test paper containing a series of chicken-and-rabbit problems, where GPT takes the exam and is scored. GSM-Symbolic is a more comprehensive version of GSM8K with richer questions, more complete examination standards, and professional graders.
What Does the Paper Question?
The authors tested large models using GSM-Symbolic to evaluate their performance on variations of the same problem.
The Chicken-and-Rabbit Problem
Ancient Chinese mathematical texts like “Sunzi Suanjing” describe the chicken-and-rabbit problem. In “Sunzi Suanjing,” it is called the “hen-rabbit in the cage” problem. The original text reads: Now there are hens and rabbits in a cage, with 35 heads above and 94 feet below. Ask: how many hens and rabbits are there? Translating this gives: chickens and rabbits are in the same cage, with 35 heads and 94 feet. How many chickens and rabbits are there?
Returning to the question, what did Iman Mirzadeh’s paper question? His tests found that GPT could answer “chickens and rabbits in a cage with 35 heads and 94 feet” well, but its performance dropped when the problem was adjusted but remained fundamentally the same.
Variations of the chicken-and-rabbit problem:
- Chickens and rabbits in a cage with 100 heads and 160 feet. How many chickens and rabbits are there? (changed numbers)
- Chickens and rabbits in a room, observed to have 100 heads and 160 feet. How many chickens and rabbits are there? (changed description)
- More variations
- Add more descriptive text
- Replace the subject names in the question
These variations differ but are essentially the same problem. GPT’s performance varies significantly, raising the question: if GPT has true reasoning ability, it should perform as well as humans when facing these variations.
Therefore, Iman Mirzadeh questions whether GPT truly possesses reasoning and computational capabilities or merely complex matching abilities.
Philosophical and Ethical Implications
OpenAI: “Just like Newton established classical mechanics and Einstein developed relativity, OpenAI has reached the summit and seen the magnificent view. ChatGPT is currently the best paradigm.”
Apple: “Why does my experiment suggest your paradigm might be flawed? Are you not at the summit? If I follow your path, will I fall into a pit? Do you want to prevent me from succeeding?”
Impact on Ordinary Users
Although the issues mentioned in the paper do exist, ChatGPT has already impressed many in practical use. This critique may help us better utilize GPT.
Optimize your question wording, be precise and concise in describing problems, avoiding unnecessary word interference…
