OMEGA: A Structured Math Benchmark to Probe the Reasoning Limits of LLMs
Introduction to Generalization in Mathematical Reasoning Large-scale language models with long CoT reasoning, such as DeepSeek-R1, have shown good results on Olympiad-level mathematics. However, models trained through Supervised Fine-Tuning or Reinforcement Learning depend on limited techniques, such as repeating known algebra rules or defaulting to coordinate geometry in diagram problems. Since these models follow learned…
