This AI explosion prompted a need to benchmark them for comparison.

Grading large language models and the chatbots that use them is difficult.

For now, we are stuck with subjective measurements.

GPT-4 loses its position as best LLM to Claude-3 in LMSYS benchmark

Enter LMSYS’sChatbot Arena, a crowd-sourced leaderboard for ranking LLMs “in the wild.”

It employs the Elo rating system, which is widely used to rank players in zero-sum games like chess.

Two LLMs compete in random head-to-head matches, with humans blind-judging which bot they prefer based on its performance.

It has even become the gold standard, with the highest ranking systems described as “GPT-4-class” models.

Perhaps even more impressive is Claude 3 Haiku’s break into the top ten.

Haiku is Anthropic’s “local size” model, comparable to Google’s Gemini Nano.

It is exponentially smaller than Opus, which has trillions of parameters, making it muchfasterby comparison.

According to LMSYS, coming in at number seven on the leaderboard graduates Haiku to GPT-4 class.

Anthropic probably won’t hold the top spot for long.

Last week, OpenAI insidersleakedthat GPT-5 is almost ready for its public debut and should launch “mid-year.”

The new LLM model is leaps and bounds better than GPT-4.

Image credit:Mike MacKenzie