This is the official implementation of the paper ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations in PyTorch.
Visual text rendering is a challenging task, especially when precise font control is desired. This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations.
-
Font controls require no font label annotations:
A text segmentation model can capture nuanced font information in pixel space without requiring font label annotations in the dataset, enabling zero- shot generation on unseen languages and fonts, as well as scalable training on web-scale image datasets as long as they contain text. -
Evaluating ambiguous fonts in the open world: Fuzzy font accuracy can be measured in the embed- ding space of a pretrained font classification model, utilizing our proposed metrics
l2@k
andcos@k
. -
Supporting user-driven design flexibility:
Random perturbations can be applied to segmented glyphs. While this won’t affect the rendered text quality, it accounts for users not precisely aligning text to best locations and prevents models from rigidly replicating the pixel locations in glyphs. -
Working with foundation models:
With limited computational resources, we can still copilot foundational image generation models to perform localized text and font editing.
If you find our work inspires you, please consider citing it. Thank you!
@article{jiang2025controltext,
title={ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations},
author={Jiang, Bowen and Yuan, Yuan and Bai, Xinyi and Hao, Zhuoqun and Yin, Alyson and Hu, Yaojie and Liao, Wenyu and Ungar, Lyle and Taylor, Camillo J},
journal={arXiv preprint arXiv:2502.10999},
year={2025}
}
Our repository is based on the code of AnyText. We build upon and extend it to enable user-controllable fonts in zero-shot. Below is a brief walkthrough:
-
Prerequisites: We use conda environment to manage all required packages.
conda env create -f environment.yml conda activate controltext
-
Preprocess Glyphs:
-
Configuration:
- Adjust hyperparameters such as
batch_size
,grad_accum
,learning_rate
,logger_freq
, andmax_epochs
in the training scripttrain.py
. Please keepmask_ratio = 1
. - Set paths for GPUs, checkpoints, model configuration file, image datasets, and preprocessed glyphs accordingly.
- Adjust hyperparameters such as
-
Training Command: Run the training script:
python train.py
The front-end code for user-friendly text and font editing are coming soon! Stay tuned for updates as we continue to enhance the project.
-
Our Generated Data
laion_controltext Google Drive, laion_controltext_gly_lines (cropped regions for each line of text from the entire image) Google Drive, laion_controltext_gly_lines_grayscale (laion_controltext_gly_lines after text segmentation) Google Drive, laion_gly_lines_gt (cropped regions from input glyphs after text segmentation) Google Drive
wukong_controltext Google Drive, wukong_controltext_gly_line Google Drive, wukong_controltext_glylines_grayscale Google Drive, wukong_gly_lines_gt Google Drive
-
Our Model Checkpoint
-
Script for evaluating text accuracy:
Run the following script to calculate SenACC and NED scores for text accuracy, which will evaluate
laion_controltext_gly_lines
andwukong_controltext_gly_line
.bash eval/eval_dgocr.sh
Run the following script to calculate FID score for overall image quality, which will evaluate
laion_controltext
andwukong_controltext
.bash eval/eval_fid.sh
-
Script for evaluating font accuracy in the open world:
Run the following script to calculate the font accuracy
bash eval/eval_font.sh --generated_folder path/to/your/generated_folder --gt_folder path/to/your/gt_folder
In the argument,
path/to/your/generated_folder
should point to the directory containing your generated images, for example,laion_controltext_gly_lines_grayscale
orwukong_controltext_glylines_grayscale
. Similarly,path/to/your/gt_folder
should refer to the directory containing the ground-truth glyph images or the segmented glyphs used as input conditions, where we uselaion_gly_lines_gt
orwukong_gly_lines_gt
.