|
EN
Evaluation Strategy
Depending on the characteristics of different evaluation tasks, we adopt various evaluation methods and scoring schemes to achieve optimal evaluation results.
Evaluation Methods (How are model responses evaluated?)
  • - Algorithmic Evaluation: Use existing automated algorithms to assess the generated responses from the models.
  • - LLM-as-a-Judge: Utilize a (fine-tuned) large language model as the judge to evaluate the generated responses.
  • - Human Judges: Engage human experts with relevant backgrounds to evaluate the generated responses.
Scoring Schemes (How are the scores for the evaluated models determined?)
  • - Closed-ended Questions & Accuracy: For closed-ended questions, final scores are calculated based on the accuracy of each model across all questions.
  • - Single Response Scoring & Absolute Rating Scheme: For some open-ended questions, judges assign individual scores to each model's response (e.g., using a 1–7 point scale in single or multiple dimensions), and the final score is obtained by averaging scores across all questions.
  • - Pairwise Comparison & Elo Rating System: Judges compare the responses of any two different models in a head-to-head match and select a winner (or a draw), and final rankings are determined using the Elo rating system.