CarwilBJ's avatarCarwilBJ's Twitter Archive—№ 38,169

      1. As a former math competition participant, I keep reviewing the ChatGPT o3 performance on the American Invitational Math Exam (AIME) and similar benchmarks. Here's why I'm skeptical…
    1. …in reply to @CarwilBJ
      First, details aren't released, but it looks like the LLM was allowed to submit 1,000s of answers to check if they were right. If so, it's scores cannot be compared to humans who get one shot.
  1. …in reply to @CarwilBJ
    Second, some problems on the AIME are designed to elicit conceptual solutions in the humans taking the test, but might be solved by brute force counting by ChatGPT o3. This violates the rules of the AIME, which forbids calculators and computers.
    oh my god twitter doesn’t include alt text from images in their API
    1. …in reply to @CarwilBJ
      Third, pattern the compute time o3 is using is essentially trying thousands of algorithms to try to find a solution. Conceptually this is like memorizing the AIME back catalog of questions (possible given the training set) and using each one.
      1. …in reply to @CarwilBJ
        Such an effort would be impossible for humans in the time allowed, but I can safely say that no one who takes the exam would be impressed by the mathematical ability of a human who tried this strategy and came up with the correct answer on their 1,759th try days later.
        1. …in reply to @CarwilBJ
          We can't be 100% sure this is the mechanism used by o3, since that isn't being disclosed, but it would be consistent with the strategy proposed here... redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt as is OpenAI's graph of compute-time vs performance.