The Factorization Machine (FM) is a matrix-based machine learning algorithm proposed by Steffen Rendle. It can predict any real-valued vector. Its main advantages include: 1) capable of highly sparse data scenarios; 2) linear computational complexity.
- Factorization Machine Model
where: is the dot product of two k-dimensional vectors:
Model parameters are:
where
indicates that the feature i is represented by k factors, and k is the hyperparameter that determines the factorization。
FM can be used for a series of predictive tasks, such as:
- classification:
can be used directly as a predictor, and the optimization criterion is to minimize the least square difference。
- regression:can use the symbol of
to do classification prediction,parameters are estimated by hinge function or logistic regression function at any time.。
-
FM algorithm model The model of the FM algorithm consists of two parts, wide and embedding, where wide is a typical linear model. The final output is the sum of the two parts of wide and embedding.
-
FM training process Angel implements the gradient descent method to optimize, iteratively trains the FM model, and the logic on each iteration of the worker and PS is as follows:
- worker:Pull the wide and embedding matrices from the PS to the local for each iteration, calculate the corresponding gradient update value, push to PS
- PS:PS summarizes the gradient update values pushed by all workers, averages them, calculates and updates the new wide and embedding models through the optimizer.
-
FM prediction result:
- format:rowID,pred,prob,label
- caption:rowID indicates the row ID of the sample, starting from 0; pred: the predicted value of the sample; prob: the probability of the sample relative to the predicted result; label: the category into which the predicted sample is classified, when the predicted result value pred is greater than 0, Label is 1, less than 0 is -1
-
data format support Libsvm and dummy two data formats, libsvm format is as follows:
1 1:1 214:1 233:1 234:1
dummy data format:
1 1 214 233 234
Several steps must be done before editing the submitting script and running.
- confirm Hadoop and Spark have ready in your environment
- unzip sona--bin.zip to local directory (SONA_HOME)
- upload sona--bin directory to HDFS (SONA_HDFS_HOME)
- Edit $SONA_HOME/bin/spark-on-angel-env.sh, set SPARK_HOME, SONA_HOME, SONA_HDFS_HOME and ANGEL_VERSION
Here's an example of submitting scripts, remember to adjust the parameters and fill in the paths according to your own task.
#test description
actionType=train or predict
jsonFile=path-to-jsons/fm.json
modelPath=path-to-save-model
predictPath=path-to-save-predict-results
input=path-to-data
queue=your-queue
HADOOP_HOME=my-hadoop-home
source ./bin/spark-on-angel-env.sh
export HADOOP_HOME=$HADOOP_HOME
$SPARK_HOME/bin/spark-submit \
--master yarn-cluster \
--conf spark.ps.jars=$SONA_ANGEL_JARS \
--conf spark.ps.instances=10 \
--conf spark.ps.cores=2 \
--conf spark.ps.memory=10g \
--jars $SONA_SPARK_JARS \
--files $jsonFile \
--driver-memory 20g \
--num-executors 20 \
--executor-cores 5 \
--executor-memory 30g \
--queue $queue \
--class org.apache.spark.angel.examples.JsonRunnerExamples \
./lib/angelml-$SONA_VERSION.jar \
jsonFile:./fm.json \
dataFormat:libsvm \
data:$input \
modelPath:$modelPath \
predictPath:$predictPath \
actionType:$actionType \
numBatch:500 \
maxIter:2 \
lr:4.0 \
numField:39