This study introduces InteractEval, a framework that integrates the outcomes of Think-Aloud (TA) conducted by humans and LLMs to generate attributes for checklist-based text evaluation. By combining humans' flexibility and high-level reasoning with LLMs' consistency and extensive knowledge, InteractEval outperforms text evaluation baselines on a text summarization benchmark (SummEval) and an essay scoring benchmark (ELLIPSE). Furthermore, an in-depth analysis shows that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhancement of text evaluation performance. A subsequent comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes, highlighting the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation.
Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
Seong Yeub Chu, Jong Woo Kim, Mun Yong Yi
Proceedings of the Conference on Human Factors in Computing Systems (CHI '25). arXiv
- Combination of humans' thoughts and LLMs' thoughts
accelerate
git+https://github.com/huggingface/transformers
jinja2>=3.1.0
openai==0.28.0
pandas
tiktoken
scipy
prettytable
google-generativeai
jupyter
anthropic
- Due to the nature of generative models, where the output varies depending on various conditions such as temperature and seed, the results of the implemented code may differ.
pip install -r requirements.txt generate a checklist by running "./summeval_checklist_construction.ipynb" python ./src/main.py --model_name gpt-3.5-Turbo --dimension coherence"
- Language: Python
- Utilized LLMs: GPT-4/3.5-Turbo, Gemini-1.5-Pro, Llama-3.1-8B-Instruct, Claude-3.5-Sonnet
- Dependencies : Refer to "requirements.txt"
- Dataset : SummEval, ELLIPSE
InteractEval ├──assets ├──data │ ├──ellipse │ └──summeval ├──prompts │ ├──ellipse │ │ ├──checklist_construction │ │ │ ├──attributes_clustering │ │ │ ├──component_extraction │ │ │ ├──question_generation │ │ │ ├──question_validation │ │ │ ├──sub_question_generation │ │ ├──evaluation │ │ ├──think_aloud │ │ │ ├──claude │ │ │ ├──gemini │ │ │ ├──gpt │ │ │ └──llama │ └──summeval ├──src ├──summeval_think_aloud └──ellipse_think_aloud
git lfs install git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct