Large Language Models Benchmarks

15 小时on MSN

Anthropic releases Claude Sonnet 4.6: Benchmark performance, how to try it

Anthropic's latest flagship model, Claude Sonnet 4.6, is out now.

CARDBiomedBench: a benchmark for evaluating the performance of large language models in ...

Although large language models (LLMs) have the potential to transform biomedical research, their ability to reason accurately across complex, data-rich domains remains unproven. To address this ...

TechNode

ByteDance Releases Doubao-Seed-2.0, Positions Pro Model Against GPT 5.2 and Gemini 3 Pro

Seed-2.0, the latest version of its Doubao large language model series. The company said the Pro variant is benchmarked ...

3 小时

Sarvam AI launches 30B and 105B models, says 105B outperforms DeepSeek R1 and Gemini Flash ...

Bengaluru-based AI startup Sarvam AI on February 18 announced the launch of two new large language models, a 30-billion-parameter model and a 105-billion-parameter model, both trained from scratch, ...

VentureBeat

Self-invoking code benchmarks help you decide which LLMs to use for your programming tasks

As large language models (LLMs) continue to improve at coding, the benchmarks used to evaluate their performance are steadily becoming less useful. That's because though many LLMs have similar high ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

6 天

Sarvam AI claims edge over larger global models on Indic benchmarks

Capable of reasoning, designed for voice, and fluent in Indian languages, the model would be ready for population-scale deployment ...

26 分钟

AI startup Sarvam launches two made-in-India large language models

Sarvam launches 30B and 105B parameter indigenous LLMs trained on Indian languages, positioning India closer to a sovereign, voice-first AI ecosystem ...

6 小时on MSN

Sarvam unveils two new large language models focused on real-time use, advanced reasoning

The company said the model is optimised for “efficient thinking”, delivering stronger responses while using fewer tokens — a key factor in reducing inference costs in production environments.

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

一些您可能无法访问的结果已被隐去。

显示无法访问的结果