Evaluation Strategy
Depending on the characteristics of different evaluation tasks, we adopt
various evaluation methods and scoring schemes to achieve optimal evaluation
results.
Evaluation Methods (How are model responses evaluated?)
- - Algorithmic Evaluation: Use existing automated algorithms to assess the generated responses from the models.
- - LLM-as-a-Judge: Utilize a (fine-tuned) large language model as the judge to evaluate the generated responses.
- - Human Judges: Engage human experts with relevant backgrounds to evaluate the generated responses.
Scoring Schemes (How are the scores for the evaluated models
determined?)
- - Closed-ended Questions & Accuracy: For closed-ended questions, final scores are calculated based on the accuracy of each model across all questions.
- - Single Response Scoring & Absolute Rating Scheme: For some open-ended questions, judges assign individual scores to each model's response (e.g., using a 1–7 point scale in single or multiple dimensions), and the final score is obtained by averaging scores across all questions.