Krakow, Poland--(Newsfile Corp. - May 21, 2026) - Omni Calculator announced the publication of the third iteration of its Omni Research on Calculation in AI (ORCA) Benchmark, an independent benchmarking initiative designed to evaluate the mathematical reasoning and stability of publicly available Large Language Models (LLMs).
The ORCA V3 report evaluates the performance of several AI models across real-world quantitative tasks and introduces updated findings related to model accuracy, logical consistency, and calculation stability. The benchmark focuses on assessing AI systems using quantitative problems where outputs can be objectively verified.
The ORCA framework evaluates LLMs across 500 quantitative problems spanning seven categories: Biology & Chemistry, Engineering & Construction, Finance & Economics, Health & Sports, Math & Conversions, Physics, and Statistics & Probability. According to Omni Calculator, the benchmark uses verified answer keys from the company's library of more than 3,800 calculators to assess model outputs.
The report states that the benchmark follows a zero-shot evaluation methodology, in which models are tested on their first response attempt without additional prompting or retries. Omni Calculator noted that the benchmark is conducted through publicly accessible interfaces to reflect the experience of general users.
A key component of the ORCA project is the "Instability Metric," which measures how frequently models generate different answers when presented with the same prompt multiple times. According to the report, the metric is intended to evaluate consistency in applications involving finance, engineering, and other quantitative domains.
The ORCA V3 report includes findings related to ChatGPT 5.3, Claude Sonnet 4.6, and Grok 4.20. According to Omni Calculator, Grok 4.20 achieved a reported 70.4% math accuracy score and a 33.1% instability score in the benchmark evaluation. The report also states that Claude Sonnet 4.6 achieved a 53.2% math accuracy score, while ChatGPT 5.3 recorded a 48.4% score in the benchmark's quantitative testing.
The report also discusses "Regression Risk," a trend identified in prior ORCA evaluations in which newer AI model versions may produce lower performance on certain quantitative tasks than earlier versions. According to Omni Calculator, this variability may affect the reliability of automated workflows and repeated calculations.
Omni Calculator stated that the ORCA initiative was developed to provide additional transparency into AI model performance in mathematical and logical reasoning tasks and to support evaluation methods focused on real-world quantitative use cases.
The full ORCA V3 report, titled Is Claude Really the Best?, is available on the Omni Calculator website.
About the ORCA Benchmark
The ORCA Benchmark is an independent AI benchmarking initiative developed by Omni Calculator to evaluate the mathematical reasoning and logical stability of Large Language Models using quantitative testing scenarios. The benchmark is currently in its third iteration.
About Omni Calculator
Omni Calculator is a technology company based in Kraków, Poland. The company operates a library of more than 3,800 professional-grade calculators and develops benchmarking initiatives focused on quantitative AI evaluation.
Media Contact
Full Contact Person’s Name: Agata Flak
Email Address: content.partnerships@omnicalculator.com
Telephone Number: +48 722 354 132
Company: Omni Calculator
Website: https://www.omnicalculator.com
###
To view the source version of this press release, please visit https://www.newsfilecorp.com/release/298195