Firstpost
  • Home
  • Video Shows
    Vantage Firstpost America Firstpost Africa First Sports
  • World
    US News
  • Explainers
  • News
    India Opinion Cricket Tech Entertainment Sports Health Photostories
  • Asia Cup 2025
Apple Incorporated Modi ji Justin Trudeau Trending

Sections

  • Home
  • Live TV
  • Videos
  • Shows
  • World
  • India
  • Explainers
  • Opinion
  • Sports
  • Cricket
  • Health
  • Tech/Auto
  • Entertainment
  • Web Stories
  • Business
  • Impact Shorts

Shows

  • Vantage
  • Firstpost America
  • Firstpost Africa
  • First Sports
  • Fast and Factual
  • Between The Lines
  • Flashback
  • Live TV

Events

  • Raisina Dialogue
  • Independence Day
  • Champions Trophy
  • Delhi Elections 2025
  • Budget 2025
  • US Elections 2024
  • Firstpost Defence Summit
Trending:
  • PM Modi in Manipur
  • Charlie Kirk killer
  • Sushila Karki
  • IND vs PAK
  • India-US ties
  • New human organ
  • Downton Abbey: The Grand Finale Movie Review
fp-logo
Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?
Whatsapp Facebook Twitter
Whatsapp Facebook Twitter
Apple Incorporated Modi ji Justin Trudeau Trending

Sections

  • Home
  • Live TV
  • Videos
  • Shows
  • World
  • India
  • Explainers
  • Opinion
  • Sports
  • Cricket
  • Health
  • Tech/Auto
  • Entertainment
  • Web Stories
  • Business
  • Impact Shorts

Shows

  • Vantage
  • Firstpost America
  • Firstpost Africa
  • First Sports
  • Fast and Factual
  • Between The Lines
  • Flashback
  • Live TV

Events

  • Raisina Dialogue
  • Independence Day
  • Champions Trophy
  • Delhi Elections 2025
  • Budget 2025
  • US Elections 2024
  • Firstpost Defence Summit
  • Home
  • Explainers
  • Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?

Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?

Shreya Mundhra • June 22, 2024, 18:50:56 IST
Whatsapp Facebook Twitter

OpenAI’s top rival, Anthropic, recently launched its newest AI model, Claude 3.5 Sonnet. Anthropic claims this new model surpasses well-known competitors like OpenAI’s GPT-4o and Google’s Gemini-1.5 Pro. We explain how it has performed on benchmark tests and take a look at whether the results should be taken as an accurate indicator of practical usefulness

Advertisement
Subscribe Join Us
Add as a preferred source on Google
Prefer
Firstpost
On
Google
Anthropic's AI model, Claude 3.5 Sonnet launched. How does it compare to ChatGPT-4o, Gemini?
Image source: Pixabay/Anthropic.com

Anthropic, a leading contender in the AI space, has just rolled out its latest innovation: Claude 3.5 Sonnet. This marks the inaugural release in the highly anticipated Claude 3.5 series. Anthropic claims this new model surpasses well-known competitors like OpenAI’s GPT-4o and Google’s Gemini-1.5 Pro.

We explain how well the new AI model has performed on benchmark tests. We also take a look at whether the test results should be taken as an accurate indicator of practical usefulness.

STORY CONTINUES BELOW THIS AD

About Claude 3.5 Sonnet

It’s a large language model (LLM) developed by Anthropic, part of their family of generative pre-trained transformers. These models excel at predicting the next word in a sequence based on extensive pre-training with vast amounts of text. Claude 3.5 Sonnet builds on the foundation laid by Claude 3 Sonnet, which made its debut in March this year.

Claude 3.5 Sonnet boasts a significant performance boost, operating at twice the speed of its predecessor, Claude 3 Opus. This leap, paired with a more budget-friendly pricing model, positions Claude 3.5 Sonnet as the go-to solution for intricate tasks, including context-aware customer support and managing multi-step workflows, according to Anthropic’s official statement.

More from Explainers
How ChatGPT is becoming everyone’s BFF and why that’s dangerous How ChatGPT is becoming everyone’s BFF and why that’s dangerous This Week in Explainers: How recovering from Gen-Z protests is a Himalayan task for Nepal This Week in Explainers: How recovering from Gen-Z protests is a Himalayan task for Nepal

Claude 3.5 sonnet vs ChatGPT-4o vs Gemini 1.5 Pro

Anthropic revealed Claude 3.5 Sonnet’s scores (compared with those of its peers) on benchmark tests in a post on social media platform X.

Note that concepts like “0-shot”, “5-shot”, and chain of thought (CoT) are used here. In simple terms, it refers to how many calculation or deduction cycles the model went through to arrive at the answer.

Claude 3.5 vs ChatGPT 4o vs Gemini 1.5 Pro
Image source: Anthropic.com

Here’s what each of the tests measure, and a comparative of how the three AIs performed.

Graduate-level reasoning
GQPA (Graduate-level Physics Questions and Answers): This benchmark assesses an AI’s capability to answer complex physics questions at a graduate level, testing its advanced reasoning and problem-solving skills within the physics domain. The test was introduced by Rein et al. in a paper titled “GPQA: A Graduate-Level Google-Proof Q&A Benchmark”.

Diamond: This benchmark evaluates an AI’s high-level reasoning across a range of topics, encompassing academic, professional, common sense, and general knowledge domains. It assesses the model’s ability to understand and solve intricate problems that require deep knowledge and critical thinking.

Impact Shorts

More Shorts
Ghaziabad woman dead, pilgrims attacked in bus… How Nepal’s Gen-Z protests turned into a living hell for Indian tourists

Ghaziabad woman dead, pilgrims attacked in bus… How Nepal’s Gen-Z protests turned into a living hell for Indian tourists

Were bodyguards involved in Charlie Kirk’s shooting? The many conspiracies surrounding the killing

Were bodyguards involved in Charlie Kirk’s shooting? The many conspiracies surrounding the killing

  • Claude 3.5 Sonnet: 59.4% (0-shot CoT)

  • GPT-4o: 53.6% (0-shot CoT)

  • Gemini 1.5 Pro: Not available

Undergraduate-level knowledge
MMLU (Massive Multitask Language Understanding): This benchmark evaluates a model’s understanding across a wide array of undergraduate-level subjects, spanning the humanities, sciences, and social sciences. It measures the model’s breadth of knowledge and its ability to handle diverse topics. The test was introduced by Hendrycks et al. in a paper titled ‘Measuring Massive Multitask Language Understanding".

  • Claude 3.5 Sonnet: 88.7% (5-shot), 88.3% (0-shot CoT)

  • GPT-4o: 88.7% (0-shot CoT)

  • Gemini 1.5 Pro: 85.9% (5-shot)

Code
HumanEval: This benchmark assesses an AI model’s proficiency in generating correct and functional code snippets from natural language descriptions of programming tasks. It tests the model’s grasp of programming languages and its ability to solve problems in software development.

  • Claude 3.5 Sonnet: 92.0% (0-shot)

  • GPT-4o: 90.2% (0-shot)

  • Gemini 1.5 Pro: 84.1% (0-shot)

Multilingual math
MGSM (Multilingual Grade School Math): This benchmark evaluates a model’s capacity to solve grade school-level math problems presented in various languages. It tests both the model’s mathematical reasoning and its ability to understand and respond in different linguistic contexts. It was introduced by Shi et al. in a paper titled “Language Models are Multilingual Chain-of-Thought Reasoners”.

  • Claude 3.5 Sonnet: 91.6% (0-shot CoT)

  • GPT-4o: 90.5% (0-shot CoT)

  • Gemini 1.5 Pro: 87.5% (8-shot)

Reasoning over text
DROP (Discrete Reasoning Over Paragraphs): This benchmark measures an AI’s ability to perform discrete reasoning tasks, such as information extraction and arithmetic operations, on paragraphs of text. The F1 score, a common metric in machine learning for assessing classification tasks, is used to evaluate the model’s accuracy in these tasks, balancing precision and recall. It was introduced by Dua et al. in a paper titled “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs”.

  • Claude 3.5 Sonnet: 87.1% (3-shot)

  • GPT-4o: 83.4% (3-shot)

  • Gemini 1.5 Pro: 74.9% (Variable shots)

Mixed evaluations
BIG-Bench-Hard: Part of the Beyond the Imitation Game Benchmark (BIG-Bench) suite, this benchmark focuses on exceptionally challenging tasks that demand deep reasoning, understanding, and creativity. These tasks push the limits of AI capabilities across diverse domains. This was introduced by Suzgun et al. in a paper titled “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them”.

  • Claude 3.5 Sonnet: 93.1% (3-shot CoT)

  • GPT-4o: Not available

  • Gemini 1.5 Pro: 89.2% (3-shot CoT)

Math problem-solving
MATH: This benchmark tests an AI model’s ability to solve math problems ranging from high school to college level. It assesses the model’s understanding of mathematical concepts, problem-solving skills, and capacity to perform complex calculations.

  • Claude 3.5 Sonnet: 71.1% (0-shot CoT)

  • GPT-4o: 76.6% (0-shot CoT)

  • Gemini 1.5 Pro: 67.7% (4-shot)

Grade school math
GSM8K (Grade School Math 8K): This benchmark evaluates an AI model’s proficiency in solving a wide range of math problems typically encountered in grade school (K-8). It measures the model’s grasp of basic arithmetic, geometry, and word problems.

  • Claude 3.5 Sonnet: 96.4% (0-shot CoT)

  • GPT-4o: 90.8% (11-shot)

  • Gemini 1.5 Pro: Not available

The results

Claude 3.5 Sonnet generally outperforms GPT-4o and Gemini 1.5 Pro across most benchmarks.

ChatGPT logo is seen in this illustration taken 28 September, 2023. Image used for representational purposes/Reuters
Claude 3.5 Sonnet has outperformed ChatGPT-4o. Image used for representational purposes/Reuters

It shows superior performance in graduate-level reasoning, coding (HumanEval), multilingual math, reasoning over text, mixed evaluations, and grade school math.

GPT-4o performs very well in undergraduate-level knowledge, coding (HumanEval), and math problem-solving.

Gemini 1.5 Pro shows strong performance in mixed evaluations and reasoning over text.

Taking benchmark test scores with a pinch of salt

Most benchmarks are designed to push the limits of a model only on one singular task at a time, something that rarely happens in real-life scenarios. Real-life applications often involve complex, context-dependent tasks that benchmarks might not fully encapsulate. They are typically simplified and controlled, whereas real-world scenarios can be much more intricate.

Benchmarks usually measure a model’s performance in isolation. However, real-life usefulness often involves interacting with humans, understanding context, and dynamically adapting responses, aspects that benchmarks might not fully capture.

STORY CONTINUES BELOW THIS AD

Real-world environments are ever-changing, and benchmarks are static. A model’s ability to adapt and learn continuously in a dynamic setting is crucial but not typically measured by standard benchmarks.

While Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro perform differently across benchmarks, their practical effectiveness will ultimately depend on their performance in real-world scenarios and how well they meet the specific needs of their intended applications. Who comes out on top will also depend on how much are OpenAI, Perplexity, and Google spending on the computing needs for retraining their models and to run inferencing.

Tags
artificial intelligence (AI) ChatGPT
End of Article
Latest News
Find us on YouTube
Subscribe
End of Article

Impact Shorts

Ghaziabad woman dead, pilgrims attacked in bus… How Nepal’s Gen-Z protests turned into a living hell for Indian tourists

Ghaziabad woman dead, pilgrims attacked in bus… How Nepal’s Gen-Z protests turned into a living hell for Indian tourists

Prime Minister KP Sharma Oli resigned following violent protests in Nepal. An Indian woman from Ghaziabad died trying to escape a hotel fire set by protesters. Indian tourists faced attacks and disruptions, with some stranded at the Nepal-China border during the unrest.

More Impact Shorts

Top Stories

Russian drones over Poland: Trump’s tepid reaction a wake-up call for Nato?

Russian drones over Poland: Trump’s tepid reaction a wake-up call for Nato?

As Russia pushes east, Ukraine faces mounting pressure to defend its heartland

As Russia pushes east, Ukraine faces mounting pressure to defend its heartland

Why Mossad was not on board with Israel’s strike on Hamas in Qatar

Why Mossad was not on board with Israel’s strike on Hamas in Qatar

Turkey: Erdogan's police arrest opposition mayor Hasan Mutlu, dozens officials in corruption probe

Turkey: Erdogan's police arrest opposition mayor Hasan Mutlu, dozens officials in corruption probe

Russian drones over Poland: Trump’s tepid reaction a wake-up call for Nato?

Russian drones over Poland: Trump’s tepid reaction a wake-up call for Nato?

As Russia pushes east, Ukraine faces mounting pressure to defend its heartland

As Russia pushes east, Ukraine faces mounting pressure to defend its heartland

Why Mossad was not on board with Israel’s strike on Hamas in Qatar

Why Mossad was not on board with Israel’s strike on Hamas in Qatar

Turkey: Erdogan's police arrest opposition mayor Hasan Mutlu, dozens officials in corruption probe

Turkey: Erdogan's police arrest opposition mayor Hasan Mutlu, dozens officials in corruption probe

Top Shows

Vantage Firstpost America Firstpost Africa First Sports
Latest News About Firstpost
Most Searched Categories
  • Web Stories
  • World
  • India
  • Explainers
  • Opinion
  • Sports
  • Cricket
  • Tech/Auto
  • Entertainment
  • IPL 2025
NETWORK18 SITES
  • News18
  • Money Control
  • CNBC TV18
  • Forbes India
  • Advertise with us
  • Sitemap
Firstpost Logo

is on YouTube

Subscribe Now

Copyright @ 2024. Firstpost - All Rights Reserved

About Us Contact Us Privacy Policy Cookie Policy Terms Of Use
Home Video Shorts Live TV