🤖 InteractEval 👨‍👩‍👦‍👦

📖 Overview

This study introduces InteractEval, a framework that integrates the outcomes of Think-Aloud (TA) conducted by humans and LLMs to generate attributes for checklist-based text evaluation. By combining humans' flexibility and high-level reasoning with LLMs' consistency and extensive knowledge, InteractEval outperforms text evaluation baselines on a text summarization benchmark (SummEval) and an essay scoring benchmark (ELLIPSE). Furthermore, an in-depth analysis shows that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhancement of text evaluation performance. A subsequent comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes, highlighting the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation.

📑 Paper

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation
Seong Yeub Chu, Jong Woo Kim, Mun Yong Yi
Proceedings of the Conference on Human Factors in Computing Systems (CHI '25). arXiv

⭐ Main Feature

Human-LLM Combination

Combination of humans' thoughts and LLMs' thoughts

Think Aloud (TA)

Checklist construction based on Think Aloud process

💻 Getting Started

Installation

accelerate
git+https://github.com/huggingface/transformers
jinja2>=3.1.0
openai==0.28.0
pandas
tiktoken
scipy
prettytable
google-generativeai
jupyter
anthropic

How to Run (Evaluator: GPT-3.5-Turbo / Data: SummEval / Dimension: Coherence)

Due to the nature of generative models, where the output varies depending on various conditions such as temperature and seed, the results of the implemented code may differ.

pip install -r requirements.txt
generate a checklist by running "./summeval_checklist_construction.ipynb"
python ./src/main.py --model_name gpt-3.5-Turbo --dimension coherence"

🔧 Stack

Language: Python
Utilized LLMs: GPT-4/3.5-Turbo, Gemini-1.5-Pro, Llama-3.1-8B-Instruct, Claude-3.5-Sonnet
Dependencies : Refer to "requirements.txt"
Dataset : SummEval, ELLIPSE

Project Structure

InteractEval
├──assets
├──data
│   ├──ellipse
│   └──summeval
├──prompts
│   ├──ellipse
│   │   ├──checklist_construction
│   │   │   ├──attributes_clustering
│   │   │   ├──component_extraction
│   │   │   ├──question_generation
│   │   │   ├──question_validation
│   │   │   ├──sub_question_generation
│   │   ├──evaluation
│   │   ├──think_aloud
│   │   │   ├──claude
│   │   │   ├──gemini
│   │   │   ├──gpt
│   │   │   └──llama
│   └──summeval
├──src
├──summeval_think_aloud
└──ellipse_think_aloud

How to Run Think Aloud with LLama-3.1

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 InteractEval 👨‍👩‍👦‍👦

📖 Overview

📑 Paper

⭐ Main Feature

Human-LLM Combination

Think Aloud (TA)

💻 Getting Started

Installation

How to Run (Evaluator: GPT-3.5-Turbo / Data: SummEval / Dimension: Coherence)

🔧 Stack

Project Structure

How to Run Think Aloud with LLama-3.1

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
assets		assets
data		data
ellipse_think_aloud		ellipse_think_aloud
prompts		prompts
src		src
summeval_think_aloud		summeval_think_aloud
README.md		README.md
api_keys.json		api_keys.json
ellipse_checklist_construction.ipynb		ellipse_checklist_construction.ipynb
requirements.txt		requirements.txt
summeval_checklist_construction.ipynb		summeval_checklist_construction.ipynb

BBeeChu/InteractEval

Folders and files

Latest commit

History

Repository files navigation

🤖 InteractEval 👨‍👩‍👦‍👦

📖 Overview

📑 Paper

⭐ Main Feature

Human-LLM Combination

Think Aloud (TA)

💻 Getting Started

Installation

How to Run (Evaluator: GPT-3.5-Turbo / Data: SummEval / Dimension: Coherence)

🔧 Stack

Project Structure

How to Run Think Aloud with LLama-3.1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages