In this work, I will explore a text classification task that differs from traditional Sentiment Analysis (SA) and Named-Entity Recognition (NER). As part of the course Aplicaciones en Tecnologías del Lenguaje, we have been encouraged to review resources such as Kaggle and competitive evaluation forums in the NLP field, like IberLEF.
Given that much of my professional work in recent years has focused on developing agentic applications and chatbots using Large Language Models (LLMs), I found the IberAuTexTification task particularly interesting to explore.
v1
and v2
versions of the dataset have not been pushed in order to save some space in GitHub. If you need them,
contact me and I'll upload them.
The rise of LLMs like ChatGPT, LLaMA,
Gemini and more recently DeepSeek-V3
has revolutionized text generation across multiple
languages, including those from the Iberian Peninsula, becoming integral to both individual and corporate workflows.
However, their widespread availability and capabilities pose risks, as malicious users can exploit these models to
generate disinformation, fake news, phishing content, and more, across various languages and domains. This growing
threat highlights the urgent need for effective content moderation strategies that can detect machine-generated text (
MGT) and identify the specific models behind it for forensic purposes. The IberAuTexTification
task aims to advance
research and development in this area, helping companies and organizations mitigate the risks posed by malicious
LLM-generated content. There are two text classification tasks presented in this challenge:
- MGT Detection: binary classification task which determines if the text has been generated by a machine or a human.
- Model Attribution: multi-class classification task to determine which LLM was the generator of the text.
The task focuses on six main Iberian languages: Catalan
, English
, Spanish
, Basque
, Galician
, and Portuguese
.
The dataset (publicly available in Huggingface),
consisting of 168k instances, includes human and MGT in 7 domains: Chat
, How-to
, News
, Literary
, Reviews
,
Tweets
, and Wikipedia
.
The generations are obtained using six language models:
- meta-llama/Llama-2-70b-chat-hf
- cohere.command-text-v14
- ai21.j2-ultra-v1
- gpt-3.5-turbo-instruct
- mistralai/Mixtral-8x7B-Instruct-v0.1
- gpt-4
The scoreboard with the competition ranking can be found here.
A thorough and in-depth analysis must be conducted on the task, the associated dataset, the proposed architectures, and the obtained results. The following steps will be followed:
🔍 Data Analysis and Text Representation: Perform an initial exploratory analysis of the dataset, including visualizations, text analysis, class balance assessment, and issue identification with proposed solutions. Data cleaning and preprocessing could be carried out using normalization techniques suitable for machine learning (ML). Some text representation methods (one-hot encoding, weighting schemes, embeddings, etc.) will be explored, justifying the choice of each technique.
📏 Baseline Definition: Establish reference results using simple algorithms to compare against more complex models. Recommended baselines include classical ML techniques such as Naïve Bayes, Logistic Regression, Decision Trees, Random Forests, and/or SVMs.
🧠 Deep Learning (DL) Models: train/fine-tune and validate neural network-based models, defining and justifying the proposed architectures.
📊 Result Analysis: Evaluate the learning process, hyperparameters, and performance of each model individually and comparatively. A critical analysis should be conducted to explain the results, concluding with a summary that compares all explored models and different fine-tuning techniques.
The project is organized into the following folders and files:
.
├── challenge
│ ├── experiment_output
│ │ ├── 1_data_exploration
│ │ │ ├── v1
│ │ │ │ ├── test.tsv.gzip
│ │ │ │ ├── train.tsv.gzip
│ │ │ │ └── val.tsv.gzip
│ │ │ ├── v2
│ │ │ │ ├── test.tsv.gzip
│ │ │ │ ├── train.tsv.gzip
│ │ │ │ └── val.tsv.gzip
│ │ │ ├── v3
│ │ │ │ ├── train.tsv.gzip
│ │ │ │ └── val.tsv.gzip
│ │ │ └── v4
│ │ │ ├── train.tsv.gzip
│ │ │ └── val.tsv.gzip
│ │ ├── 2_ml_baselines
│ │ ├── 3_dl_approaches
│ │ └── 4_results_and_conclusions
│ ├── notebooks
│ │ ├── 1_data_exploration.ipynb
│ │ ├── 2_ml_baselines.ipynb
│ │ ├── 3_dl_approaches.ipynb
│ │ └── 4_results_and_conclusions.ipynb
│ └── resources
│ └── MMTEB_best_encoders.png
├── README.md
└── requirements.txt
- experiment_output: Stores outputs generated in the notebooks, allowing them to be reused in subsequent flows. For example, the first notebook generates datasets that are later consumed by other notebooks for different models.
- notebooks: The different parts of the activity are separated into notebooks for better organization and separation of code.
- resources: Stores reference materials, such as images.
Notice that I didn't include a data
folder as the dataset can directly be downloaded and used as a HuggingFace
dataset. Check the first notebook to see how this is done.
To properly separate the logic and reuse outputs from some flows in others (such as dataset processing and cleaning), I will partition the implementation into multiple Jupyter Notebooks (see the Notebook Index).
Each notebook will focus on a specific task, such as data analysis and processing, defining simple ML baseline models, and finally, implementing more complex models like contextual models, Transformer-based architectures, and LLMs (if given enough time).
Additional considerations:
- 💡 To limit the scope of this project, I will base my approach on state-of-the-art architectures for this task. We could've explored other simpler DL based approaches like RNNs, BiLSTMs or CNNs, but probably transformers will outperform them.
- 👎 I will define multiple dataset versions based on the representation needs of each model or augmentation techniques, but I will use only one cleanup version per case.
- 👎 I would've liked to explore the zero-shot performance of reasoning models (like DeepSeek or o3-mini-high, to see how they perform in this task). However, I couldn't find enough time to do this.
The table below presents the set of notebooks used in this work, along with their objectives.
File Name | Objective | Description |
---|---|---|
1_data_exploration.ipynb | Data analysis, exploration, and preprocessing | Loads the dataset, performs initial exploration and data preprocessing, analyzes distribution and class balance, and proposes cleaning mechanisms. |
2_ml_baselines.ipynb | ML baselines: Naive Bayes, SVM, Random Forests, etc. | Defines baselines mainly through training and validating simple Machine Learning models. |
3_dl_approaches.ipynb | Deep Learning Models: Transformers, contextual. | Trains and validates systems based on contextual models, using Transformer-based architectures and some zero-shot and few-shot tests with LLMs. |
4_results_and_conclusions.ipynb | Comparative analysis of results | Compares the performance of models trained in previous flows, providing a comparison of metrics (accuracy, precision, recall, and f1). Based on the comparison, a recommendation is made on the best model for production deployment. |
As I was trying some experiments, I ended up executing different notebooks on different environments:
- For ML based models presented in
2_ml_baselines
, I used my local laptop. - For DL based models presented in
3_dl_approaches
, I used a mix between Google Colab and PyCharm remote execution pointing to a remote AWS server with GPUs.
In any case, paths are configured by the RouteResolver
class in different notebooks.
To run the notebooks locally, it is necessary to set up an environment using conda
or venv
and install all
dependencies imported in the various notebooks. Find the requirements file here.