These models are evaluated and ranked on various leaderboards, providing a comparative analysis of their performance across different tasks and scenarios. This article will summarize the current state of AI LLM leaderboards and rankings, focusing on the Open LLM Leaderboard, the LLM Safety Leaderboard, and various other rankings.
Leaderboard Collections:
- LMSys Chatbot Arena Leaderboard is an anonymous benchmark platform for LLMs that features randomized battles in a crowdsourced manner
- Open LLM Leaderboard by HuggingFace
- Current best choices on LocalLLaMA reddit
- LLM Logic Tests by YearZero on reddit/localllama
- paperswithcode has LLM SOTA leaderboards
- Can AI code a self-evaluating interview for AI coding models. code
- Gotzmann LLM Score v2 by Gatzuma on Reddit
- Aviary Explorer open source utility to compare leading OSS LLMs and see votes, pricing per token etc.
- Comparative look at (ggml) quantization and parameter size part 1 by KerfuffleV2
- Updated relative comparison of ggml quantization types and effect on perplexity part 2 by KerfuffleV2
- Programming performance ranking for popular LLaMAs using HumanEval+ by ProfessionalHand9945
- llm-humaneval-benchmarks HumanEval+
- CoT Hub
- C-Eval Benchmark
- programming eval by catid from reddit, code
- HumanEval+ raking for open vs closed programming LLMs by ProfessionalHand9945
- LLM Comparison Sheet by OptimalScale/LMFlow
- llm-jeopardy Automated prompting and scoring framework to evaluate LLMs using updated human knowledge prompts
- llama presets arena testing different generation presets by oobabooga, reddit discussion
- MTEB Leaderboard Massive Text Embedding Benchmark (MTEB) Leaderboard
Open LLM Leaderboard

The Open LLM Leaderboard, hosted by Hugging Face, aims to track, rank, and evaluate open LLMs and chatbots. It uses the Eleuther AI Language Model Evaluation Harness, a unified framework designed to test generative language models. The leaderboard evaluates models based on four main benchmarks. As of the latest data, the Intel neural-chat-7b model has achieved the #1 ranking for 7-billion-parameter models on this leaderboard.
LLM Safety Leaderboard

The LLM Safety Leaderboard focuses on the safety evaluation of LLMs. It provides a unified evaluation to help researchers and practitioners better understand the safety and risks associated with these models. The leaderboard offers comprehensive trustworthiness perspectives, novel red-teaming algorithms tailored for each perspective, and a comprehensive leaderboard for both open and closed models based on their performance.
Other Rankings
Apart from these leaderboards, there are various other rankings that evaluate LLMs based on different criteria. For instance, the Julia LLM Leaderboard evaluates and compares the Julia code generation capabilities of various LLMs. The Galileo hallucination index identifies GPT-4 as the best-performing LLM for different use cases. The MythoMax 13B model, a fine-tune of Llama 2 13B, is one of the highest performing models according to OpenRouter.
In terms of overall performance, the current leader is LLaMA2, a collection of pretrained and fine-tuned LLMs that are specifically optimized for dialogue applications. Other top models include LLaMA, T5, and Galactica. However, it's important to note that the specific use case and business requirements should guide the selection of the right model.
LLM leaderboards and rankings provide valuable insights into the performance of various models, helping users make informed decisions. However, given the rapid pace of advancements in this field, these rankings are subject to change as new models are developed and existing ones are improved.