Natural Language Proficiency Ranking (LLM-as-a-judge)

We employed a fine-tuned GPT-3.5 Turbo as a judge to evaluate large language models through pairwise comparisons. This model participated in the evaluation of four natural language proficiency sub-tasks: free Q&A, content generation, scenario simulation, and role-playing. Pairwise comparisons were conducted between responses from the 14 LLMs, and the win rate statistics (the larger the number, the greater the win rate of matching Model A's response to Model B's response to the same question) are as follows:

Winning Rates for Pairwise Comparisons

Large Language Model Assessment in the Chinese Context / 中文语境下的人工智能大语言模型评测

The Elo rating system, combined with judgments by the fine-tuned GPT-3.5-Turbo, produces the below rankings.

Leaderboard

Leaderboard
Rank	Model	Version	回答获取方式	Natural Language Proficiency	Disciplinary expertise	Safety and Responsibility	Average
10	MiniMax (abab5.5-chat)	BigScience	API	91.01	76.77	78.04	82.89

Rank	Model	Version	Natural Language Proficiency	Disciplinary expertise	Safety and Responsibility	Average
1	ERNIE-Bot 4	ERNIE-Bot4.0	80.03	73.07	68.25	74.58
2	GPT4-Turbo	gpt-4-1106-preview	82.59	67.82	67.25	73.66
3	Tongyi Qianwen 2	qwen-max	75.22	77.19	64.64	72.97
4	GPT4	gpt-4-0613	80.6	65.79	59	69.95
5	Spark 3	Spark v3.0	72.61	66.66	66.61	69.06
6	Sensenova	nova-ptc-xl-v1	71.29	63.07	63.65	66.56
7	MiniMax	abab5.5-chat	71.21	58.23	55.31	62.7
8	ChatGLM3	ChatGLM3-6B	70.38	48	62.9	61.13
9	360GPT	360GPT_S2_V9	67.5	52.78	56.04	59.64
10	GPT3.5-Turbo	gpt-3.5-turbo-0613	72.96	33.17	62.72	57.35
11	Baichuan2	baichuan2-13b-chat-v1	60.14	50.58	59.33	56.84
12	Qianfan-Chinese-Llama-2	Qianfan-Chinese-Llama-2-7B	57.04	46.37	54.01	52.78
13	AquilaChat	AquilaChat-7B	56.75	24.24	59.94	47.14
14	BLOOMZ	BLOOMZ-7B	49.8	30.27	45.85	42.43

Leaderboard

Leaderboard
Rank	Model	Version	回答获取方式	Elo rating	Disciplinary expertise	Safety and Responsibility	Average
10	MiniMax (abab5.5-chat)	BigScience	API	91.01	76.77	78.04	82.89

Rank	Model	Version	Elo rating
1	GPT4-Turbo	gpt-4-1106-preview	1391
2	GPT3.5-Turbo	gpt-3.5-turbo-0613	1197
3	Spark 3	Spark v3.0	1104
4	Chatglm3	ChatGLM3-6B	1074
5	GPT4	gpt-4-0613	1048
6	ERNIE-Bot 4	ERNIEBot-4	1040
7	Tongyi Qianwen 2	qwen-max	1036
8	Sensenova	nova-ptc-xl-v1	1026
9	MiniMax	abab5.5-chat	1022
10	Baichuan2	baichuan2-13b-chat-v1	942
11	Qianfan-Chinese-Llama-2	Qianfan-Chinese-Llama-2-7B	906
12	360GPT	360GPT_S2_V9	860
13	AquilaChat-7B	AquilaChat-7B	755
14	BLOOMZ-7B	BLOOMZ-7B	601