CheckEval → LLM써서 평가하는 프레임워크
ambiguity와 inconsistency 문제들을 address 하기 위해
SumEval Benchmark를 통해서 보니까 CheckEval 과 인간판단과 비슷했다.
장점: explanability (어떤 항목이 체크됐는지 알수있으니까)
CheckEval이랑 인간평가랑 비교해보니까 (Spearman, kendall-tau conefficient) 굉장히 좋음. 특히 G-Eval보다 좋음
In this task, you need to create a question to evaluate the {aspect} of the summary of the original document. The definition of {aspect} and the questions corresponding to the key component of {aspect} are provided below. Use them to generate sub-questions for each key question.
Each sub-question must satisfy the following conditions: 1. Each question must be answerable with ‘Yes’ or ‘No’. 2. Each question must contain concepts from the key component. 3. Each question should minimize the subjectivity of the rater’s judgment. 4. The semantic redundancy between sub-questions should be minimized. 5. Formulate questions so that a ‘Yes’ answer is a positive answer.