Evaluation Strategy

Depending on the characteristics of different evaluation tasks, we adopt various evaluation methods and scoring schemes to achieve optimal evaluation results.

Evaluation Methods (How are model responses evaluated?)

- Algorithmic Evaluation: Use existing automated algorithms to assess the generated responses from the models.
- LLM-as-a-Judge: Utilize a (fine-tuned) large language model as the judge to evaluate the generated responses.
- Human Judges: Engage human experts with relevant backgrounds to evaluate the generated responses.

Scoring Schemes (How are the scores for the evaluated models determined?)

- Closed-ended Questions & Accuracy: For closed-ended questions, final scores are calculated based on the accuracy of each model across all questions.
- Single Response Scoring & Absolute Rating Scheme: For some open-ended questions, judges assign individual scores to each model's response (e.g., using a 1–7 point scale in single or multiple dimensions), and the final score is obtained by averaging scores across all questions.
- Pairwise Comparison & Elo Rating System: Judges compare the responses of any two different models in a head-to-head match and select a winner (or a draw), and final rankings are determined using the Elo rating system.