Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm Cluster Migration for Python Infrastructure #1179

Closed
11 of 12 tasks
sprintell opened this issue Oct 26, 2023 · 24 comments · Fixed by EBISPOT/gwas-sumstats-harmoniser#82
Closed
11 of 12 tasks

Slurm Cluster Migration for Python Infrastructure #1179

sprintell opened this issue Oct 26, 2023 · 24 comments · Fixed by EBISPOT/gwas-sumstats-harmoniser#82

Comments

@sprintell
Copy link
Member

sprintell commented Oct 26, 2023

  • gwas-sumstats-harmoniser

  • Summary statistics with HDF5

  • Summary Statistics File Validator

  • gwas-sumstats-tools

  • sum-stats-formatter

  • eQTL-SumStats

  • gwas-template-services

  • gwas-sumstats-service

  • gwas-utils

  • gwas-curation-utils

  • gwas-ebi-search-index

  • gwas-solr-slim

@sprintell
Copy link
Member Author

@karatugo should have a session with @jdhayhurst before commencing this

@ljwh2
Copy link
Contributor

ljwh2 commented Nov 15, 2023

Do harmoniser last so Yue can have time to complete her work

@karatugo
Copy link
Member

karatugo commented Nov 21, 2023

Repo PR Status Notes
gwas-sumstats-harmoniser EBISPOT/gwas-sumstats-harmoniser#82 & EBISPOT/gwas-sumstats-harmoniser#83 & EBISPOT/gwas-utils#159 Done. Release needed for harmoniser & PRE_GWAS-SSF harmoniser. 1) Yue suggested 48h time limit in SLURM. 2) Development done, testing done in sandbox by Yue, pull requests for harmoniser and pre-gwas-ssf harmoniser merged to respective main branches. Glue scripts migrated to SLURM and added to GitHub for better tracking.
Summary statistics with HDF5 Skipped Discussed with Yomi and we agreed not to invest time in this as it will be replaced by another technology soon.
Summary Statistics File Validator Skipped Skipped as it was deprecated
gwas-sumstats-tools Done No LSF usage was found
sum-stats-formatter EBISPOT/sum-stats-formatter#86 Done Merged with the temp sbactch script file implementation and created the following backlog item. EBISPOT/sum-stats-formatter#88
eQTL-SumStats Skipped Postponed. Will check in the next release cycle if it needs an update.
gwas-template-services Done No LSF usage was found
gwas-sumstats-service EBISPOT/gwas-sumstats-service#273 & EBISPOT/gwas-sumstats-service#274 & EBISPOT/gwas-sumstats-service#275 & EBISPOT/gwas-sumstats-service#276 Done. Need to do tag release for the migration. Test OK for Celery workers start and refresh with scrontab. Created new START_CELERY_WORKERS_SLURM.sh in dev and prod. Also new start_celery_worker_slurm.sh script in dev and prod. Tested OK in the sandbox env.
gwas-utils EBISPOT/gwas-utils#158 Done LSF is not used anymore, cleaned up the old LSF code
gwas-curation-utils Done No LSF usage was found
gwas-ebi-search-index Done No LSF usage was found
gwas-solr-slim EBISPOT/gwas-solr-slim#52 Done. AFAIK no releases used but the new script start_slurm.sh. Test OK in dev. Also, created ${bamboo.sw_dir}/${bamboo.env_dir}/scripts/gwas-solr-slim/start_slurm.sh.

@karatugo
Copy link
Member

All done. Releases needed for the migration to SLURM.

@sprintell
Copy link
Member Author

This wil be released wiht metadata Yaml Update Feature

@ljwh2
Copy link
Contributor

ljwh2 commented Mar 6, 2024

Error in SLURM - waiting for input from TSC

@karatugo
Copy link
Member

karatugo commented Mar 18, 2024

Prepared scrontab entries for harmoniser.

  • Enable them before deployment
  • Disable crontab entries also

@karatugo
Copy link
Member

For gwas-sumstats-harmoniser migration:

@karatugo
Copy link
Member

For gwas-sumstats-harmoniser migration:

Test submitted to codon-slurm but failed. @jiyue1214 is helping me to investigate the problem.

@karatugo
Copy link
Member

@karatugo
Copy link
Member

For gwas-sumstats-harmoniser migration:

Test submitted to codon-slurm and it's successful. There's one small mistake in meta.yaml files. @jiyue1214 is helping me to investigate the problem.

@karatugo
Copy link
Member

Thanks to @jiyue1214 fix, released v1.1.7 and v1.0.7 now and testing again in codon-slurm.

[gwas_lsf@codon-dm-06 cron]$ ./start_harmonisation_slurm_test_goci1179.sh 
Submitted batch job 65232999

@karatugo
Copy link
Member

For gwas-sumstats-harmoniser migration:

I compared the output of the harmonisation pipeline in SLURM and LSF.

  • .h.tsv.gz, .h.tsv.gz.tbi, md5sum.txt files are identical.
  • In running.log, we have a higher percentage of sites that carried forward.
  • In meta yaml, @jiyue1214 fixed a few bugs (coordinate system and samples). (thanks @jiyue1214 !)

I suggest we deploy this after the Easter long weekend. I'll coordinate it with Yue.

@sprintell
Copy link
Member Author

This is waiting for final update from @jiyue1214

@jiyue1214 jiyue1214 self-assigned this Apr 17, 2024
@jiyue1214
Copy link

Issue:
In running.log, we have a higher percentage of sites that are carried forward.

Primary investigation:
Percentage of sites that are carried forward = Carried forward variants / ( Carried forward variants + Unmapped variants).
Based on the log file, the number of sites that are carried forward are same, which means the difference is caused by the unmapped variants. To investigate the reason why unmapped variants are different, I need to rerun the pipeline and use intermediate files to help.

@jiyue1214
Copy link

I rerun the pipeline with the intermediate files and found:

  1. Their intermediate files are the same (md5sum of two unmapped files are identical)
  2. I can repeat the slight difference between the LSF and Slurm, but the Slurm result is the correct number.
  3. In the LSF, nextflow read the GCST90293086's unmapped file to GCST90293085 log work folder. However, in the Slurm, it is it was the correct one.

This is not supported by the code difference. However, to double-check it, yue can change the LSF code to slurm (only change the executor.)

@jiyue1214
Copy link

I confirm the Slurm result is correct. We can close this ticket. For the reason causing the problem on LSF (the Harmonisation result is correct, only the unmapped file did not match the GCST), I will generate another ticket to look into more details.

@ljwh2
Copy link
Contributor

ljwh2 commented May 1, 2024

@karatugo to release

@ljwh2
Copy link
Contributor

ljwh2 commented May 22, 2024

@jiyue1214 added additional feature, waiting for Yue before releasing

@jiyue1214
Copy link

  1. All scripts are ready and will start to run today via crontab
  2. A small action is that I will active scrontab instead of crontab based on the ITSC info

@jiyue1214
Copy link

jiyue1214 commented Jun 11, 2024

Nextflow pipeline is running on Slurm and can be monitored by the nextflow tower daily.
Question: @karatugo, According to the scrontab, we have not activated the refresh harmonisation queue, queue GWAS-SSF files for harmonisation, and queue pre-GWAS-SSF files for harmonisation. Should we activate them as well?Screenshot 2024-06-11 at 21.47.49.png

@jiyue1214
Copy link

We have migrated all crontab jobs to Slurm this morning. This ticket can be moved to Done.
Just need to double-check if they are running successfully tomorrow.

@sprintell
Copy link
Member Author

been release at the moment, ticket due to be closed at end of sprint

@ljwh2 ljwh2 closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment