A pipeline for the design of mutagenizing oligonucleotides required for Tile Region Exchange(T-Rex) Mutagenesis, which is a crucial step for deep mutation scanning (DMS), by harnessing data-optimized assembly design (DAD).
Deep Mutational Scanning (DMS) is a technique that combines mutagenesis,
high-throughput sequencing, and functional assays to measure the effects
of a large number of genetic variants on a protein’s function or
phenotype. In order to assess variants, generating a library of mutants
is necessary. Tile Region Exchange (T-Rex) Mutagenesis is a
high-throughput method to generate all possible single nucleotide
variants (SNVs) or missense mutations in a gene. Here is an overview of
T-Rex Mutagenesis. However, deriving the assembly
design by hand is often error-prone and time-consuming since the
fidelity of resulting oligonucleotides does not merely depend on mere
rules. Therefore, a data-driven solution is required for designing
oligonucleotides with high correct assembly rate (Pryor et al.,
2020). Although there are
web tools currently available for similar mutagenesis procedures, few of
them utilize Data-optimized Assembly Design.
TRexDAD
is a streamlined
tool that facilitates oligonucleotide design for T-Rex by harnessing DAD
techniques, along with visualization of the optimization process.
The package is for researchers who perform Tile Region Exchange
Mutagenesis for some genes of interest and need an efficient and
reliable streamline to produce mutagenizing oligonucleotides. The tool
includes functions performing data processing, overhang fidelity score
calculations, and optimization of gene tile locations, along with a
visualization of the change in fidelity score of resulting overhangs
from oligonucleotide designs. The scope of the R package is to improve
the work flow in T-Rex Mutagenesis regarding efficiency and accuracy,
facilitating open-source researches for interested bioinformatitians and
computational biologists. The TRexDAD
package was developed using
R version 4.3.1 (2023-06-16)
,
Platform: aarch64-apple-darwin20 (64-bit)
and
Running under: macOS Sonoma 14.1.1
.
To install the latest version of the package:
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("devtools")
library("devtools")
devtools::install_github("yunyicheng/TRexDAD", build_vignettes = TRUE)
library("TRexDAD")
To run the Shiny app:
TRexDAD::run_TRexDAD()
ls("package:TRexDAD")
browseVignettes("TRexDAD")
TRexDAD
contains 11 functions. Since 10 of them are parts of an
integrated pipeline, they are go to one R script, pipeline.R
. The
function run_TRexDAD
is a runner for shiny App, so it goes into
run_TRexDAD.R
. Here are the workflow illustration and detailed
explanation of the functions in pipeline.R
.
-
execute_and_plot is the work flow function which wraps up most of the other functions. It performs an optimization process to determine the optimal tile positions in a gene sequence and plots the progression of the score over iterations. It provides two customization parameters:
iteration_max
andscan_rate
. If not other specified, these parameter will have default values:iteration_max=30
,scan_rate=7
. -
split_into_codons takes a gene sequence (as a string) and splits it into codons.
-
oligo_cost calculates the cost of oligonucleotides (oligos) based on the number of tiles and the number of codons.
-
calculate_optimal_tiles calculates the oligo cost for a range of tile numbers and finds the number of tiles that minimizes this cost. It also computes the global length of the tiles.
-
get_overhangs calculates the overhang sequences at the specified positions in a gene sequence. Depending on the flag, it returns these sequences as either
DNAString
objects or plain character strings. -
get_all_overhangs computes overhang sequences for a given list of positions in a gene sequence. It iteratively calls the
get_overhangs
function to calculate the head and tail overhangs for each position and accumulates them in a list. -
calculate_local_score calculates a score for a specific tile within a gene sequence. The score is based on palindromicity, length variation from a global standard, and on-target reactivity, using a pre-defined overhang fidelity data frame.
-
calculate_global_score calculates a global score for a list of positions in a gene sequence. The score takes into account off-target reactions, repetitions, and the sum of local scores (by calling
obtain_score
). -
pick_position selects a position from a list of positions for optimization based on their scores. It uses a weighted sampling approach where the weights are inversely proportional to the scores of the positions.
-
optimize_position optimizes a single position within a list of positions. It adjusts the specified position to maximize the overall score (obtained via
calculate_scores
). The function supports different optimization modes, including greedy and Markov Chain Monte Carlo (MCMC) approaches.
The author of the package is Yunyi Cheng.
The author wrote the execute_and_plot function, which performs an optimization process to determine the optimal tile positions in a gene sequence and plots the change in the score over iterations.
Other user-accessible functions are also developed by the author. Most of them serve as helper functions for the execute_and_plot function, but they can also be used indepently for specific tasks.
Biostrings
R package is used for calculating reverse complement of
overhangs in the get_overhangs function.
forstringr
R package is used for manipulating splitted target gene (a
list of codons) in the get_overhangs function.
ggplot2
and scales
R packages are used in producing plot for the
optimization progress. It is used in execute_and_plot and the
vignette.
shiny
and shinythemes
R packages are used in app.R
for shiny app
development and formatting.
The utils
R package (included in R) is used for producing combinations
of overhangs and reading excel file overhang_fidelity.csv
. It is used
in
calculate_global_score function and the first section of pipeline.R
.
overhang_fidelity.csv
is from the study by New England Biolabs (Pryor
et al., 2020).
-
Pryor JM, Potapov V, Kucera RB, Bilotti K, Cantor EJ, Lohman GJS (2020). Enabling one-pot Golden Gate assemblies of unprecedented complexity using data-optimized assembly design. PLOS ONE, 15(9), e0238592. doi:10.1371/journal.pone.0238592
-
Potapov V, Ong JL, Kucera RB, Langhorst BW, Bilotti K, Pryor JM, Cantor EJ, Canton B, Knight TF, Evans TC Jr, Lohman GJS (2018). Comprehensive profiling of four base overhang ligation fidelity by T4 DNA ligase and application to DNA assembly. ACS Synthetic Biology, 7(11), 2665–2674.
-
Eremin, I. I. et al. 2Approximation algorithm for finding a clique with minimum weight of vertices and edges (2014). Proc. Steklov Inst. Math., 284, 87–95. DOI:10.1134/S0081543814040101
-
Hastings, W. K. Monte Carlo sampling methods using Markov chains and their applications (1970). Biometrika 57, 97–109.
-
Alberts B., Johnson A., Lewis J., et al. Molecular Biology of the Cell (2014). Garland Science. 6th edition.
-
Pagès H, Aboyoun P, Gentleman R, DebRoy S (2023). Biostrings: Efficient manipulation of biological strings. Bioconductor. doi:10.18129/B9.bioc.Biostrings, R package version 2.68.1, Biostrings.
-
Ogundepo E (2023). forstringr: String Manipulation Package for Those Familiar with ‘Microsoft Excel’. R package version 1.0.0, CRAN.
-
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. doi:10.1007/978-3-319-24277-4, R package version 3.3.5, ggplot2.
-
Chang W, Cheng J, Allaire JJ, Xie Y, McPherson J (2021). shiny: Web Application Framework for R. R package version 1.7.1, shiny.
-
Chang W, Ribeiro BB (2021). shinythemes: Themes for Shiny. R package version 1.2.0, shinythemes.
-
Wickham H, Seidel D (2020). scales: Scale Functions for Visualization. R package version 1.1.1, scales.
-
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. RProject.
This package was developed as part of an assessment for 2023 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA.
TRexDAD
welcomes issues, enhancement requests, and other
contributions. To submit an issue, use the GitHub
issues. Many thanks to
those who provided feedback to improve this package.
The main objective of the package is to find the optimal assembly design for mutagenizing a gene of interest. For illustration purposes, it is set to Rad27 by default; but users are encouraged to use this tool for any genes of interest with an ORF which is at least 60 condons in length.
To initiate workflow with the default gene of interest, run:
execute_and_plot()
For comprehensive explanation of workflow, please see the tool’s vignettes.
The package tree structure is provided below:
📦TRexDAD
┣ 📂R
┃ ┣ 📜pipeline.R
┃ ┗ 📜run_TRexDAD.R
┣ 📂inst
┃ ┣ 📂extdata
┃ ┃ ┣ 📜TRex_overview.png
┃ ┃ ┣ 📜overhang_fidelity.csv
┃ ┃ ┗ 📜pipeline_workflow.png
┃ ┣ 📂shiny_script
┃ ┃ ┗ 📜app.R
┃ ┗ 📜CITATION
┣ 📂man
┃ ┣ 📂figures
┃ ┃ ┗ 📜README-unnamed-chunk-4-1.png
┃ ┣ 📜calculate_global_score.Rd
┃ ┣ 📜calculate_local_score.Rd
┃ ┣ 📜calculate_optimal_tiles.Rd
┃ ┣ 📜execute_and_plot.Rd
┃ ┣ 📜get_all_overhangs.Rd
┃ ┣ 📜get_overhangs.Rd
┃ ┣ 📜oligo_cost.Rd
┃ ┣ 📜optimize_position.Rd
┃ ┣ 📜pick_position.Rd
┃ ┣ 📜run_TRexDAD.Rd
┃ ┗ 📜split_into_codons.Rd
┣ 📂tests
┃ ┣ 📂testthat
┃ ┃ ┗ 📜test_pipeline.R
┃ ┗ 📜testthat.R
┣ 📂vignettes
┃ ┣ 📜.gitignore
┃ ┣ 📜tutorial_TRexDAD.R
┃ ┣ 📜tutorial_TRexDAD.Rmd
┃ ┗ 📜tutorial_TRexDAD.html
┣ 📜.RData
┣ 📜.Rbuildignore
┣ 📜.Rhistory
┣ 📜.gitignore
┣ 📜DESCRIPTION
┣ 📜LICENSE
┣ 📜NAMESPACE
┣ 📜README.Rmd
┣ 📜README.md
┗ 📜TRexDAD.Rproj