Author - Gautham Satyanarayana
Email - gsaty@uic.edu
UIN - 659368048
As part of UIC-CS441 Engineering Distributed Objects for Cloud Computing,
this project demonstrates how to train an LLM.
For the second Homework, we build a spark job to tokenise, generate sliding window examples over an input text corpus
and finally train a distributed LLM and monitor training metrics.
Video Link: https://youtu.be/JOJytdxPjxQ
- MacOSX ARM64
- IntelliJ IDEA 2024.2.1
- Spark 3.5.2
- Scala v2.13.12
- SBT v1.10.2
- Parameters
- Window Size - 300
- Slide Length - 100
- Total number of sliding windows - 2678
- Layers - LSTM, RNN Layers
- Vocabulary - ~29,000 words
- Frameworks:
- Tokenization - Jtokkit
- Training - Dl4j-spark
- Testing - munit
- Config - TypeSafe Config
- Logging - Sl4j
- Dataset:
- Ulysses Text from Project Gutenberg
Explained in the YouTube video linked above
Tests are found in /src/test/scala/Suite.scala
or just run from root
sbt test
Results such as training stats and the actual trained model can be found in /results
- Clone this repository, cd into the root and run
sbt update
sbt clean compile assembly
This should install all the dependencies and create the jar file named hw2.jar
in the /target/scala-2.12
directory
- Make sure Spark is running, and submit the jar as a spark job
spark-submit \
--class Main \
--driver-memory 12g \
--master "local[*]" \
target/scala-2.12/hw2.jar \
src/main/resources/ulyss12-sharded.txt \
src/main/resources/embeddings.txt \
<output path for saving training stats> \
<output path for saving the model>