-
As a former math competition participant, I keep reviewing the ChatGPT o3 performance on the American Invitational Math Exam (AIME) and similar benchmarks. Here's why I'm skeptical…
-
First, details aren't released, but it looks like the LLM was allowed to submit 1,000s of answers to check if they were right. If so, it's scores cannot be compared to humans who get one shot.
-
Third, pattern the compute time o3 is using is essentially trying thousands of algorithms to try to find a solution. Conceptually this is like memorizing the AIME back catalog of questions (possible given the training set) and using each one.
-
Such an effort would be impossible for humans in the time allowed, but I can safely say that no one who takes the exam would be impressed by the mathematical ability of a human who tried this strategy and came up with the correct answer on their 1,759th try days later.
-
We can't be 100% sure this is the mechanism used by o3, since that isn't being disclosed, but it would be consistent with the strategy proposed here... redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt as is OpenAI's graph of compute-time vs performance.
