IberAuTexTification 🤖👩🏻

Overview

In this work, I will explore a text classification task that differs from traditional Sentiment Analysis (SA) and Named-Entity Recognition (NER). As part of the course Aplicaciones en Tecnologías del Lenguaje, we have been encouraged to review resources such as Kaggle and competitive evaluation forums in the NLP field, like IberLEF.

Given that much of my professional work in recent years has focused on developing agentic applications and chatbots using Large Language Models (LLMs), I found the IberAuTexTification task particularly interesting to explore.

⚠️ This work constitutes one of the tasks for the subject "Aplicaciones para NLP" of my Másters at UNED. Due to this, copying any of the work presented here and using it for the same subject is completely prohibited.

v1 and v2 versions of the dataset have not been pushed in order to save some space in GitHub. If you need them, contact me and I'll upload them.

Introduction

The rise of LLMs like ChatGPT, LLaMA, Gemini and more recently DeepSeek-V3 has revolutionized text generation across multiple languages, including those from the Iberian Peninsula, becoming integral to both individual and corporate workflows. However, their widespread availability and capabilities pose risks, as malicious users can exploit these models to generate disinformation, fake news, phishing content, and more, across various languages and domains. This growing threat highlights the urgent need for effective content moderation strategies that can detect machine-generated text ( MGT) and identify the specific models behind it for forensic purposes. The IberAuTexTification task aims to advance research and development in this area, helping companies and organizations mitigate the risks posed by malicious LLM-generated content. There are two text classification tasks presented in this challenge:

MGT Detection: binary classification task which determines if the text has been generated by a machine or a human.
Model Attribution: multi-class classification task to determine which LLM was the generator of the text.

The task focuses on six main Iberian languages: Catalan, English, Spanish, Basque, Galician, and Portuguese. The dataset (publicly available in Huggingface), consisting of 168k instances, includes human and MGT in 7 domains: Chat, How-to, News, Literary, Reviews, Tweets, and Wikipedia. The generations are obtained using six language models:

The scoreboard with the competition ranking can be found here.

⚠️ IMPORTANT: due to time limitations, I've decided to focus only on the first of the subtasks (this is, MGT Detection). I consider the second subtask to be quite interesting as well, but I'd prefer to focus my time on the first one.

Task

A thorough and in-depth analysis must be conducted on the task, the associated dataset, the proposed architectures, and the obtained results. The following steps will be followed:

🔍 Data Analysis and Text Representation: Perform an initial exploratory analysis of the dataset, including visualizations, text analysis, class balance assessment, and issue identification with proposed solutions. Data cleaning and preprocessing could be carried out using normalization techniques suitable for machine learning (ML). Some text representation methods (one-hot encoding, weighting schemes, embeddings, etc.) will be explored, justifying the choice of each technique.

📏 Baseline Definition: Establish reference results using simple algorithms to compare against more complex models. Recommended baselines include classical ML techniques such as Naïve Bayes, Logistic Regression, Decision Trees, Random Forests, and/or SVMs.

🧠 Deep Learning (DL) Models: train/fine-tune and validate neural network-based models, defining and justifying the proposed architectures.

📊 Result Analysis: Evaluate the learning process, hyperparameters, and performance of each model individually and comparatively. A critical analysis should be conducted to explain the results, concluding with a summary that compares all explored models and different fine-tuning techniques.

Project

The project is organized into the following folders and files:

.
├── challenge
│   ├── experiment_output
│   │   ├── 1_data_exploration
│   │   │   ├── v1
│   │   │   │   ├── test.tsv.gzip
│   │   │   │   ├── train.tsv.gzip
│   │   │   │   └── val.tsv.gzip
│   │   │   ├── v2
│   │   │   │   ├── test.tsv.gzip
│   │   │   │   ├── train.tsv.gzip
│   │   │   │   └── val.tsv.gzip
│   │   │   ├── v3
│   │   │   │   ├── train.tsv.gzip
│   │   │   │   └── val.tsv.gzip
│   │   │   └── v4
│   │   │       ├── train.tsv.gzip
│   │   │       └── val.tsv.gzip
│   │   ├── 2_ml_baselines
│   │   ├── 3_dl_approaches
│   │   └── 4_results_and_conclusions
│   ├── notebooks
│   │   ├── 1_data_exploration.ipynb
│   │   ├── 2_ml_baselines.ipynb
│   │   ├── 3_dl_approaches.ipynb
│   │   └── 4_results_and_conclusions.ipynb
│   └── resources
│       └── MMTEB_best_encoders.png
├── README.md
└── requirements.txt

Overview

experiment_output: Stores outputs generated in the notebooks, allowing them to be reused in subsequent flows. For example, the first notebook generates datasets that are later consumed by other notebooks for different models.
notebooks: The different parts of the activity are separated into notebooks for better organization and separation of code.
resources: Stores reference materials, such as images.

Notice that I didn't include a data folder as the dataset can directly be downloaded and used as a HuggingFace dataset. Check the first notebook to see how this is done.

Implementation

To properly separate the logic and reuse outputs from some flows in others (such as dataset processing and cleaning), I will partition the implementation into multiple Jupyter Notebooks (see the Notebook Index).

Each notebook will focus on a specific task, such as data analysis and processing, defining simple ML baseline models, and finally, implementing more complex models like contextual models, Transformer-based architectures, and LLMs (if given enough time).

Additional considerations:

💡 To limit the scope of this project, I will base my approach on state-of-the-art architectures for this task. We could've explored other simpler DL based approaches like RNNs, BiLSTMs or CNNs, but probably transformers will outperform them.
👎 I will define multiple dataset versions based on the representation needs of each model or augmentation techniques, but I will use only one cleanup version per case.
👎 I would've liked to explore the zero-shot performance of reasoning models (like DeepSeek or o3-mini-high, to see how they perform in this task). However, I couldn't find enough time to do this.

Notebook Index

The table below presents the set of notebooks used in this work, along with their objectives.

File Name	Objective	Description
1_data_exploration.ipynb	Data analysis, exploration, and preprocessing	Loads the dataset, performs initial exploration and data preprocessing, analyzes distribution and class balance, and proposes cleaning mechanisms.
2_ml_baselines.ipynb	ML baselines: Naive Bayes, SVM, Random Forests, etc.	Defines baselines mainly through training and validating simple Machine Learning models.
3_dl_approaches.ipynb	Deep Learning Models: Transformers, contextual.	Trains and validates systems based on contextual models, using Transformer-based architectures and some zero-shot and few-shot tests with LLMs.
4_results_and_conclusions.ipynb	Comparative analysis of results	Compares the performance of models trained in previous flows, providing a comparison of metrics (accuracy, precision, recall, and f1). Based on the comparison, a recommendation is made on the best model for production deployment.

Running Notebooks

As I was trying some experiments, I ended up executing different notebooks on different environments:

For ML based models presented in 2_ml_baselines, I used my local laptop.
For DL based models presented in 3_dl_approaches, I used a mix between Google Colab and PyCharm remote execution pointing to a remote AWS server with GPUs.

In any case, paths are configured by the RouteResolver class in different notebooks.

To run the notebooks locally, it is necessary to set up an environment using conda or venv and install all dependencies imported in the various notebooks. Find the requirements file here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IberAuTexTification 🤖👩🏻

Overview

Introduction

Task

Project

Overview

Implementation

Notebook Index

Running Notebooks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
challenge		challenge
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

lucasalvarezlacasa/iberAuTexTification

Folders and files

Latest commit

History

Repository files navigation

IberAuTexTification 🤖👩🏻

Overview

Introduction

Task

Project

Overview

Implementation

Notebook Index

Running Notebooks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages