An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre.
IUPACpal is an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences as substrings of a large text, allowing also for potential mismatches and gaps.
FILE / FOLDER | PURPOSE |
---|---|
Makefiles |
Folder of Makefiles enabling alternative compilation parameters |
sdsl-lite |
Succinct Data Structure Library (unzipped folder) |
test_data |
Folder containing test data that may be used as input for IUPACpal |
test_results |
Folder containing results from test runs |
tools |
Folder containg various Python tools for testing purposes |
Makefile.gcc |
Makefile for compiling the project |
README.md |
This file |
main.cc |
IUPACpal main code |
main.h |
IUPACpal header code |
pre-install.sh |
Pre-installation script |
sdsl-lite.tar.gz |
Succinct Data Structure Library (zipped folder) |
timing_tests.sh |
Execute timing tests (requires timing_tests.cfg) |
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
To run IUPACpal requires the pre-installation of:
- cmake
- C++ compiler
- libdivsufsort (https://github.com/y-256/libdivsufsort)
- sdsl (https://github.com/simongog/sdsl-lite)
To build testing configuration files requires pre-installation of:
- Python (https://www.python.org/)
The necessary files for libdivsufsort and sdsl are already include in this repository.
Pre-installation of the above tools and libraries may be easily completed by running the following commands on a Ubuntu system within the project directory:
$ sudo apt-get install cmake
$ sudo apt-get install gcc
$ sudo apt-get install python3
$ sudo ./pre-install.sh
The IUPACpal software may then be compiled:
$ make -f Makefile.gcc
This will compile the program for 64-bit integers using the recommended type of Range Minimum Queries (rmq1). This requires double the amount of memory as 32-bit integers.
There are alternative makefiles available in the Makefiles
folder, which provide an alternative implementation of Range Minimum Queries (rmq2) and a choice between 32-bit and 64-bit integers .
After compilation the binary file IUPACpal
will be created in the working
directory, you may run it within this directory via the command:
$ ./IUPACpal <arguments>
To see the full list of required inputs, run the command with no arguments:
$ ./IUPACpal
FLAG | PARAMETER | TYPE | DEFAULT | DESCRIPTION |
---|---|---|---|---|
-f | input_file | string | input.fasta | Input filename (FASTA). |
-s | seq_name | string | seq0 | Input sequence name. |
-m | min_len | integer | 10 | Minimum length. |
-M | max_len | integer | 100 | Maximum length. |
-g | max_gap | integer | 100 | Maximum permissible gap. |
-x | mismatches | integer | 0 | Maximum permissible mismatches. |
-o | output_file | string | IUPACpal.out | Output filename. |
Run on the input file test_data/d00596.fasta
with default parameters:
$ ./IUPACpal -f test_data/d00596.fasta
Run on the input file test_data/test2.fasta
with parameters:
- sequence name in fasta file: seq2
- Maximum permissible gap: 3
- Minimum length: 5
- all other parameters default
$ ./IUPACpal -f test_data/test2.fasta -s seq2 -g 3 -m 5
Run on the input file test_data/rand1000000.fasta
with parameters:
- Minimum length: 20
- Maximum length: 25
- Maximum permissible gap: 8
- Maximum permissible mismatches: 5
- Output filename: output.txt
$ ./IUPACpal -f test_data/rand1000000.fasta -s seq0 -m 20 -M 25 -g 8 -x 5 -o output.txt
IUPACpal.py
is a Python module that may be imported as part of a Python script via:
import IUPACpal
This module is a Python wrapper that first requires the 'IUPACpal' software to be compiled as described in the Installation section. The IUPACpal.py
module contains a function find_inverted_repeats
that may be used to run the IUPACpal software. find_inverted_repeats
returns a list of tuple pairs, representing the identified inverted repeats. An example of this Python wrapper being used can be tested by running:
python example.py
When calling find_inverted_repeats
, the function returns a list of tuples of pairs in the format ((left_strand_start, left_strand_end), (right_strand_start, right_strand_end))
. Each tuple represents an inverted repeat, specified by the 4 indexes within the original DNA string that correspond to the left and right strands.
If no inverted repeats are found, find_inverted_repeats
returns None
.
If invalid parameters are given, find_inverted_repeats
returns an error string.
Automated testing can be performed through the use of timing_tests.sh
and a configuration file timing_tests.cfg
.
To run the default series of test on IUPACpal, first create the timing_tests.cfg
file by running:
$ python3 tools/timing_tests_builder.py
This creates the default configruation file timing_tests.cfg
. Testing of IUPACpal can then be performed by running:
$ ./timing_tests.sh
Ensure timing_tests.sh
and timing_tests.cfg
are in the same directory. To modify the test parameters, you can either directly modify timing_tests.cfg
or alternatively modify the Python script tools/timing_tests_builder.py
to enable the script to build a different set of tests over the desired range of input parameters.
Test results are written to an output file test_results_<CURRENT_DATE_AND_TIME>.csv
.
Steven Watts - steven31415
This project is licensed under the GNU GPL License.