Representative-Sample-Chooser (RSC) is a computational tool for hierarchically filtering tab-delimited files.
This haskell script takes in an identifier field string, a hierarchical filter string and a tsv file in order capture best record for each identifier.
RSC outputs a single, filtered tsv file on successful exit, or an error message/log depending on the issue (more on this later).
rsc.hs assumes you have a the GHC compiler and packages installed that it imports. The easiest way to do this is to download the Haskell Platform.
To install the peripheral packages rsc.hs requires, you can call the following command assuming you have cabal, a package manager and build system for Haskell, installed on your system (it comes with the Haskell Platform).
$ cabal install [packagename]
Required packages
- Control.DeepSeq
- Data.ByteString
- Data.ByteString.Char8
- Data.ByteString.Lazy
- Data.Char
- Data.Functor
- Data.List
- Data.List.Split
- Data.Ord
- System.Console.GetOpt
- System.Process
- System.Environment
- System.Exit
- System.IO
- System.IO.Temp
- Text.PrettyPrint.Boxes
- Text.Regex.TDFA
RSC requires three inputs:
- Identifier Field String - This string defines the field (column) on which the data compression occurs.
The Identifier Field String has the following structure:
;[Identifier Field String];
- Hierarchical Filter String - This string defines a hierarchy of fields upon which to filter on. Within each field, a hierarchy of values will define the way in which filtering will occur (see example below).
The Hierarchical Filter String has the following structure:
;[FIELDNAME1]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];[FIELDNAME2]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];
The order in which the user provides the field names and corresponding field values is important, the first field name specified will be the first field used for comparison, and so on. The order of these fields in the input TSV file is irrelevant.
The following keywords can be used to identify fields which are numeric in nature (float or int):
- Float -> FLOAT
- Int -> INT
The following keycharacters can be used with the above keywords to define the type of comparison:
- Maximum -> >
- Minimum -> <
If the user has not specified all possible values for a given field, the program will exit early, and print a file named rsc_ERROR.log
, detailing all field(s) with missing values, and what those missing values are:
Name_of_Field_(Column) Values_Found_in_Input_TSV_Not_Found_in_Hierarchical_Filter_String Values_Found_in_Hierarchical_Filter_String_Not_Found_in_Input_TSV
- TSV file - This is tab-delimited file on which the hierarchical filtering will occur.
rsc.hs is easy to use.
You can call it using the runghc command provided by the GHC compiler as such:
$ runghc rsc.hs -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv
For maximum performance, please compile and run the source code as follows:
$ ghc -O2 -o RSC rsc.hs
$ ./RSC -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv
RSC has few different command line arguments:
Representative Sample Chooser, Copyright (c) 2020 Matthew Mosior.
Usage: rsc [-vV?o] [Identifier Field String] [Hierarchical Filter String] [TSV file]
-v --verbose Output on stderr.
-V, -? --version Show version number.
-o OUTFILE --outputfile=OUTFILE The output file to which the results will be printed.
--nonexhaustive First sample will be returned for identifiers with
non-exhaustive hierarchical filtering values.
--help Print this help message.
The -v
option, the verbose
option, will provide a full error message.
The -V
option, the version
option, will show the version of rsc
in use.
The -o
option, the outputfile
option, is used to specify the file in which the filtered lines will be printed to.
The --nonexhaustive
option specifies to print the first record for all given identifiers in which non-exhaustive filtering occured.
Finally, the --help
option outputs the help
message seen above.
The following examples will help illustrate the way the hierarchical filtering algorithm chooses a best record for each given identifier.
The following two examples assume the following inputs:
Identifier Field String: ;Sample_Group_ID;
Hierarchical Filter String: ;Time_point:T1>=T2>=T3;Type_of_data:complex>=simple>=NA;Data_depth:FLOAT.>;
Each of the following examples are illustrating the hierarchical filtering on a single identifier for simplicity's sake.
This example will illustrate a scenario where a single record is returned (user-defined hierarchical filter determined it was the best record for said identifier).
The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
simple | 100.19 |
2 | 200ABC | T1 tied |
complex | 65.32 |
3 | 200ABC | T1 tied |
complex | 106.78 |
There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
simple | 100.19 |
2 | 200ABC | T1 tied |
complex tied |
65.32 |
3 | 200ABC | T1 tied |
complex tied |
106.78 |
There is a two-way tie between lines 2 and 3 due to the values in the Type_of_data field, so the filtering then moves on to the next most important field as described by the Hierarchical Filter String, Data_depth, and is restricted to just lines 2 and 3.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
simple | 100.19 |
2 | 200ABC | T1 tied |
complex tied |
65.32 |
3 | 200ABC | T1 tied |
complex tied |
106.78 wins |
Because the Hierarchical Filter String was defined as FLOAT.>
, the largest float value in the field would win the comparison between lines 2 and 3.
So line 3 was the choosen record for this given identifier.
This example will illustrate a scenario where no record is returned (user-defined hierarchical filter could not determine a best record for said identifier).
The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
complex | 101.10 |
2 | 200ABC | T1 tied |
complex | 101.10 |
3 | 200ABC | T1 tied |
complex | 101.10 |
There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
complex tied |
101.10 |
2 | 200ABC | T1 tied |
complex tied |
101.10 |
3 | 200ABC | T1 tied |
complex tied |
101.10 |
There is a three way tie between all three lines due to the values in the Type_of_data field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Data_depth, and all three lines are still being compared.
Sample_Group_ID | Time_point | Type_of_data | Data_depth | |
---|---|---|---|---|
1 | 200ABC | T1 tied |
complex tied |
101.10 tied |
2 | 200ABC | T1 tied |
complex tied |
101.10 tied |
3 | 200ABC | T1 tied |
complex tied |
101.10 tied |
There is a three way tie between all three lines due to the values in the Data_depth field, so there is no best record for this identifier.
By default, in this scenario, no record will be returned for this identifier.
In this scenario, the --nonexhaustive
option (optional) will grab the first record for this identifier and return it.
The user could also provide additional field(s) and corresponding values in the hierarchical filter string, in hopes to break the current ties.
A docker container exists that contains all the necessary software to run RSC: matthewmosior/representativesamplechooser:final
Documentation was added April 2020.
Author : Matthew Mosior