Representative-Sample-Chooser: A Hierarchical Filtering Tool

Introduction

Representative-Sample-Chooser (RSC) is a computational tool for hierarchically filtering tab-delimited files.
This haskell script takes in an identifier field string, a hierarchical filter string and a tsv file in order capture best record for each identifier.

RSC outputs a single, filtered tsv file on successful exit, or an error message/log depending on the issue (more on this later).

Prerequisites

rsc.hs assumes you have a the GHC compiler and packages installed that it imports. The easiest way to do this is to download the Haskell Platform.

Installing required packages

To install the peripheral packages rsc.hs requires, you can call the following command assuming you have cabal, a package manager and build system for Haskell, installed on your system (it comes with the Haskell Platform).

$ cabal install [packagename]

Required packages

Control.DeepSeq
Data.ByteString
Data.ByteString.Char8
Data.ByteString.Lazy
Data.Char
Data.Functor
Data.List
Data.List.Split
Data.Ord
System.Console.GetOpt
System.Process
System.Environment
System.Exit
System.IO
System.IO.Temp
Text.PrettyPrint.Boxes
Text.Regex.TDFA

Input

RSC requires three inputs:

Identifier Field String - This string defines the field (column) on which the data compression occurs.

The Identifier Field String has the following structure:
;[Identifier Field String];

Hierarchical Filter String - This string defines a hierarchy of fields upon which to filter on. Within each field, a hierarchy of values will define the way in which filtering will occur (see example below).

The Hierarchical Filter String has the following structure:
;[FIELDNAME1]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];[FIELDNAME2]:[FIELDVALUE1]>=[FIELDVALUE2]>=...>=[FIELDVALUEN];

The order in which the user provides the field names and corresponding field values is important, the first field name specified will be the first field used for comparison, and so on. The order of these fields in the input TSV file is irrelevant.

The following keywords can be used to identify fields which are numeric in nature (float or int):

Float -> FLOAT
Int -> INT

The following keycharacters can be used with the above keywords to define the type of comparison:

Maximum -> >
Minimum -> <

If the user has not specified all possible values for a given field, the program will exit early, and print a file named rsc_ERROR.log, detailing all field(s) with missing values, and what those missing values are:
Name_of_Field_(Column) Values_Found_in_Input_TSV_Not_Found_in_Hierarchical_Filter_String Values_Found_in_Hierarchical_Filter_String_Not_Found_in_Input_TSV

TSV file - This is tab-delimited file on which the hierarchical filtering will occur.

Usage

rsc.hs is easy to use.

You can call it using the runghc command provided by the GHC compiler as such:
$ runghc rsc.hs -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv

For maximum performance, please compile and run the source code as follows:
$ ghc -O2 -o RSC rsc.hs
$ ./RSC -o name_of_filtered_file.tsv ";UPN_clinical;" ";Clinical_T_sequenced:T1>=T2>=T3>=T4>=T5>=T6>=T7>=T9>=Tn>=NA;model_group_reagent:combined_exome_capture>=exome>=capture_v2>=capture;common_name:relapse flow sorted>=relapse>=tumor>=normal;mean_depth:FLOAT.>;" ../path/to/input/file.tsv

Arguments

RSC has few different command line arguments:

Representative Sample Chooser, Copyright (c) 2020 Matthew Mosior.
Usage: rsc [-vV?o] [Identifier Field String] [Hierarchical Filter String] [TSV file]

  -v          --verbose             Output on stderr.
  -V, -?      --version             Show version number.
  -o OUTFILE  --outputfile=OUTFILE  The output file to which the results will be printed.
              --nonexhaustive       First sample will be returned for identifiers with
                                    non-exhaustive hierarchical filtering values.
              --help                Print this help message.

The -v option, the verbose option, will provide a full error message.
The -V option, the version option, will show the version of rsc in use.
The -o option, the outputfile option, is used to specify the file in which the filtered lines will be printed to.
The --nonexhaustive option specifies to print the first record for all given identifiers in which non-exhaustive filtering occured.
Finally, the --help option outputs the help message seen above.

Some Examples

The following examples will help illustrate the way the hierarchical filtering algorithm chooses a best record for each given identifier.

The following two examples assume the following inputs:

Identifier Field String: ;Sample_Group_ID;
Hierarchical Filter String: ;Time_point:T1>=T2>=T3;Type_of_data:complex>=simple>=NA;Data_depth:FLOAT.>;

Each of the following examples are illustrating the hierarchical filtering on a single identifier for simplicity's sake.

Example 1:

This example will illustrate a scenario where a single record is returned (user-defined hierarchical filter determined it was the best record for said identifier).

The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	simple	100.19
2	200ABC	T1 tied	complex	65.32
3	200ABC	T1 tied	complex	106.78

There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	simple	100.19
2	200ABC	T1 tied	complex tied	65.32
3	200ABC	T1 tied	complex tied	106.78

There is a two-way tie between lines 2 and 3 due to the values in the Type_of_data field, so the filtering then moves on to the next most important field as described by the Hierarchical Filter String, Data_depth, and is restricted to just lines 2 and 3.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	simple	100.19
2	200ABC	T1 tied	complex tied	65.32
3	200ABC	T1 tied	complex tied	106.78 wins

Because the Hierarchical Filter String was defined as FLOAT.>, the largest float value in the field would win the comparison between lines 2 and 3.

So line 3 was the choosen record for this given identifier.

Example 2:

This example will illustrate a scenario where no record is returned (user-defined hierarchical filter could not determine a best record for said identifier).

The hierarchical filtering starts on the most important field as described by the Hierarchical Filter String, Time_point.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	complex	101.10
2	200ABC	T1 tied	complex	101.10
3	200ABC	T1 tied	complex	101.10

There is a three way tie between all three lines due to the values in the Time_point field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Type_of_data, and all three lines are still being compared.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	complex tied	101.10
2	200ABC	T1 tied	complex tied	101.10
3	200ABC	T1 tied	complex tied	101.10

There is a three way tie between all three lines due to the values in the Type_of_data field, so the filtering then moves onto the next most important field as described by the Hierarchical Filter String, Data_depth, and all three lines are still being compared.

	Sample_Group_ID	Time_point	Type_of_data	Data_depth
1	200ABC	T1 tied	complex tied	101.10 tied
2	200ABC	T1 tied	complex tied	101.10 tied
3	200ABC	T1 tied	complex tied	101.10 tied

There is a three way tie between all three lines due to the values in the Data_depth field, so there is no best record for this identifier.

By default, in this scenario, no record will be returned for this identifier.

In this scenario, the --nonexhaustive option (optional) will grab the first record for this identifier and return it.

The user could also provide additional field(s) and corresponding values in the hierarchical filter string, in hopes to break the current ties.

Docker

A docker container exists that contains all the necessary software to run RSC: matthewmosior/representativesamplechooser:final

Credits

Documentation was added April 2020.
Author : Matthew Mosior

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
bin		bin
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Representative-Sample-Chooser: A Hierarchical Filtering Tool

Introduction

Prerequisites

Installing required packages

Input

Usage

Arguments

Some Examples

Example 1:

Example 2:

Docker

Credits

About

Releases

Packages

Languages

License

Matthew-Mosior/Representative-Sample-Chooser

Folders and files

Latest commit

History

Repository files navigation

Representative-Sample-Chooser: A Hierarchical Filtering Tool

Introduction

Prerequisites

Installing required packages

Input

Usage

Arguments

Some Examples

Example 1:

Example 2:

Docker

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages