CarwilBJ's Twitter Archive—№ 38,169

As a former math competition participant, I keep reviewing the ChatGPT o3 performance on the American Invitational Math Exam (AIME) and similar benchmarks. Here's why I'm skeptical…
Permalink On twitter.com ♻️ 2 Retweets ❤️ 4 Favorites 2025 Jan 12 Mood 0

…in reply to @CarwilBJ
First, details aren't released, but it looks like the LLM was allowed to submit 1,000s of answers to check if they were right. If so, it's scores cannot be compared to humans who get one shot.
Permalink On twitter.com 2025 Jan 12 Mood +2 🙂

…in reply to @CarwilBJ
Second, some problems on the AIME are designed to elicit conceptual solutions in the humans taking the test, but might be solved by brute force counting by ChatGPT o3. This violates the rules of the AIME, which forbids calculators and computers.
On twitter.com 2025 Jan 12 Mood -2 🙁

…in reply to @CarwilBJ
Third, pattern the compute time o3 is using is essentially trying thousands of algorithms to try to find a solution. Conceptually this is like memorizing the AIME back catalog of questions (possible given the training set) and using each one.
Permalink On twitter.com 2025 Jan 12 Mood +3 🙂

…in reply to @CarwilBJ
Such an effort would be impossible for humans in the time allowed, but I can safely say that no one who takes the exam would be impressed by the mathematical ability of a human who tried this strategy and came up with the correct answer on their 1,759th try days later.
Permalink On twitter.com ❤️ 1 Favorite 2025 Jan 12 Mood +5 🙂

…in reply to @CarwilBJ
We can't be 100% sure this is the mechanism used by o3, since that isn't being disclosed, but it would be consistent with the strategy proposed here... redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt as is OpenAI's graph of compute-time vs performance.
Permalink On twitter.com 2025 Jan 12 Mood 0