AI - LLM Leaderboard & Ranking

These models are evaluated and ranked on various leaderboards, providing a comparative analysis of their performance across different tasks and scenarios. This article will summarize the current state of AI LLM leaderboards and rankings, focusing on the Open LLM Leaderboard, the LLM Safety Leaderboard, and various other rankings.

Leaderboard Collections:

LMSys Chatbot Arena Leaderboard is an anonymous benchmark platform for LLMs that features randomized battles in a crowdsourced manner
Open LLM Leaderboard by HuggingFace
Current best choices on LocalLLaMA reddit
LLM Logic Tests by YearZero on reddit/localllama
paperswithcode has LLM SOTA leaderboards
Can AI code a self-evaluating interview for AI coding models. code
Gotzmann LLM Score v2 by Gatzuma on Reddit
Aviary Explorer open source utility to compare leading OSS LLMs and see votes, pricing per token etc.
Comparative look at (ggml) quantization and parameter size part 1 by KerfuffleV2
Updated relative comparison of ggml quantization types and effect on perplexity part 2 by KerfuffleV2
Programming performance ranking for popular LLaMAs using HumanEval+ by ProfessionalHand9945
llm-humaneval-benchmarks HumanEval+
CoT Hub
C-Eval Benchmark
programming eval by catid from reddit, code
HumanEval+ raking for open vs closed programming LLMs by ProfessionalHand9945
LLM Comparison Sheet by OptimalScale/LMFlow
llm-jeopardy Automated prompting and scoring framework to evaluate LLMs using updated human knowledge prompts
llama presets arena testing different generation presets by oobabooga, reddit discussion
MTEB Leaderboard Massive Text Embedding Benchmark (MTEB) Leaderboard

Open LLM Leaderboard

The Open LLM Leaderboard, hosted by Hugging Face, aims to track, rank, and evaluate open LLMs and chatbots. It uses the Eleuther AI Language Model Evaluation Harness, a unified framework designed to test generative language models. The leaderboard evaluates models based on four main benchmarks. As of the latest data, the Intel neural-chat-7b model has achieved the #1 ranking for 7-billion-parameter models on this leaderboard.

LLM Safety Leaderboard

The LLM Safety Leaderboard focuses on the safety evaluation of LLMs. It provides a unified evaluation to help researchers and practitioners better understand the safety and risks associated with these models. The leaderboard offers comprehensive trustworthiness perspectives, novel red-teaming algorithms tailored for each perspective, and a comprehensive leaderboard for both open and closed models based on their performance.

Other Rankings

Apart from these leaderboards, there are various other rankings that evaluate LLMs based on different criteria. For instance, the Julia LLM Leaderboard evaluates and compares the Julia code generation capabilities of various LLMs. The Galileo hallucination index identifies GPT-4 as the best-performing LLM for different use cases. The MythoMax 13B model, a fine-tune of Llama 2 13B, is one of the highest performing models according to OpenRouter.

In terms of overall performance, the current leader is LLaMA2, a collection of pretrained and fine-tuned LLMs that are specifically optimized for dialogue applications. Other top models include LLaMA, T5, and Galactica. However, it's important to note that the specific use case and business requirements should guide the selection of the right model.

LLM leaderboards and rankings provide valuable insights into the performance of various models, helping users make informed decisions. However, given the rapid pace of advancements in this field, these rankings are subject to change as new models are developed and existing ones are improved.

AI - LLM Leaderboard & Ranking

Leaderboard Collections:

Open LLM Leaderboard

LLM Safety Leaderboard

Other Rankings

Shinji

AI Pill

AI - LLM Leaderboard & Ranking

Leaderboard Collections:

Open LLM Leaderboard

LLM Safety Leaderboard

Other Rankings

Shinji

MCPMarket

Cluely

MCP.so

Firebase Studio

DeepReel

AI Pill