Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore how to get a manifest of sequence files into Galaxy from ENA #310

Open
NoopDog opened this issue Feb 19, 2025 · 5 comments
Open
Assignees

Comments

@NoopDog
Copy link
Collaborator

NoopDog commented Feb 19, 2025

Need

Similar to #309 we need to explore importing a file manifest into a Galaxy workspace from ENA and using this to satisfy the sequencing files for worklfows such as Paired-end variant calling in haploid system.

Questions

  1. Where would the user go to search ENA for sequences?
  2. Does ENA have the ability to exclude files used in a given assembly or include them from a given time period or geographic region?
  3. How does the user export a manifest of the search results to their local file system or possibly directly to Galaxy if that connection exists?
  4. How could the manifest be uploaded to a given Galaxy history from the local file system?
  5. Given the manifest, how can the workflow recognize it and access the referenced files?
@NoopDog
Copy link
Collaborator Author

NoopDog commented Feb 20, 2025

@nekrut to refine and assign.

@Smeds
Copy link
Collaborator

Smeds commented Feb 20, 2025

ENA has a pretty nice REST API:

URL for Search & Discovery
https://www.ebi.ac.uk/ena/portal/api/swagger-ui/index.html#/Search%20%26%20Discovery/search

Documentation can be found at ENA-Portal-API-doc

List of fields that one can filter on: https://www.ebi.ac.uk/ena/portal/api/searchFields?result=read_run

As most providers they have some limits of usage

  • 50 requests per second
In order to ensure a smooth and fair user experience, we have implemented rate limits on our REST services. 
It helps us in maintaining optimal performance and preventing overload on our servers. By regulating the number of requests from individual users, we can ensure that everyone gets a consistent and responsive experience. It also acts as a protective measure against malicious activities such as DDoS attacks and brute-force attempts.
At present we have set the upper limit at 50 requests per second which we think should be sufficient for most use-cases. If the number of requests breaches this limit then the subsequent requests may be rejected with the error "Too Many Requests" (HTTP status code 429).

1 Where would the user go to search ENA for sequences?

There are a few GUI options:

The API that can be used to programmatically access the data can be explored using the following url: https://www.ebi.ac.uk/ena/portal/api/swagger-ui/index.html#/Search%20%26%20Discovery/search

2 Does ENA have the ability to exclude files used in a given assembly or include them from a given time period or geographic region?

It's possible to filter on many field, example scientific_name, date or country.
filter using scientific_name
input for "Search & Discovery"

  • result: read_run
  • query: scientific_name="Taeniopygia guttata"

which translate into the following url

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=scientific_name%3D%22Taeniopygia%20guttata%22&fields=run_accession%2Cscientific_name%2Cfastq_ftp%2Cread_count&limit=10&format=tsv

filter on date and sample accession
input for "Search & Discovery"

  • result: read_run
  • query: (sample_accession=SAMN02981239 OR sample_accession=SAMN37045233 OR sample_accession=SAMN12623621) AND (last_updated>=2023-08-23 AND last_updated<=2023-08-25 country=chile)

which translate into the following url

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=%28sample_accession%3DSAMN02981239%20OR%20sample_accession%3DSAMN37045233%20OR%20sample_accession%3DSAMN12623621%29%20AND%20%28last_updated%3E%3D2023-08-23%20AND%20last_updated%3C%3D2023-08-25%20country%3Dchile%29&fields=run_accession%2Clast_updated%2Cscientific_name%2Ccountry%2Cinstrument_platform%2Cinstrument_model%2Cread_count%2Ctax_id&limit=0&format=tsv&download=false

filter on countries and sample accession
input for "Search & Discovery"

  • result: read_run
  • query: (sample_accession=SAMN02981239 OR sample_accession=SAMN37045233 OR sample_accession=SAMN12623621) AND (last_updated>=usa OR country=chile)

which translate into the following url

https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&query=%28sample_accession%3DSAMN02981239%20OR%20sample_accession%3DSAMN37045233%20OR%20sample_accession%3DSAMN12623621%29%20AND%20%28country%3Dusa%20OR%20country%3Dchile%29&fields=run_accession%2Cscientific_name%2Ccountry%2Cinstrument_platform%2Cinstrument_model%2Cread_count%2Ctax_id&limit=0&format=tsv&download=false

3 How does the user export a manifest of the search results to their local file system or possibly directly to Galaxy if that connection exists?

With both the GUI and REST API we can export json/tsv file/information with SRR IDS and fastq urls. I haven't found a forwarding button, like the one at SRA Run Selector, which can forward a manifest to galaxy.

@Smeds
Copy link
Collaborator

Smeds commented Feb 20, 2025

Return result is a tsv or json that can contain both actual file path and run_accession. fastq_md5 can also be included for verification.

run_accession	last_updated	scientific_name	fastq_ftp	country	instrument_platform	instrument_model	read_count	tax_id
SRR25728136	2023-08-23	Aplochiton taeniatus	ftp.sra.ebi.ac.uk/vol1/fastq/SRR257/036/SRR25728136/SRR25728136.fastq.gz	Chile: Santo Domingo River, Valdivia, Los Rios district	PACBIO_SMRT	Sequel II	749636	946358

@Smeds Smeds moved this from Todo to In Progress in BRC development tasks Feb 21, 2025
@Smeds
Copy link
Collaborator

Smeds commented Feb 21, 2025

4 How could the manifest be uploaded to a given Galaxy history from the local file system?

Must say that I don't like the solution of going to a separate page, download a file and then upload it to galaxy (doesn't feel really smooth). I would prefer an integrated component

@nekrut
Copy link
Contributor

nekrut commented Feb 24, 2025

See #309 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants