Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to best call variants with ska lo? #93

Open
rrwick opened this issue Mar 5, 2025 · 2 comments
Open

How to best call variants with ska lo? #93

rrwick opened this issue Mar 5, 2025 · 2 comments

Comments

@rrwick
Copy link

rrwick commented Mar 5, 2025

Hello, and thank you for developing SKA and the new ska lo subcommand!

I'm interested in using SKA to call variants against a reference using a single assembly. This is what I've done previously (in this preprint) using ska map:

ska build -o ska.skf -k 31 assembly.fasta
ska weed --filter no-ambig ska.skf
ska map reference.fasta ska.skf -f vcf | bcftools view -e'ALT="."' > variants.vcf

This works well for most SNPs but not for closely-spaced SNPs or indels. But then @johnlees let me know about ska lo, which sounds like it could solve these shortcomings! But I can't quite figure out how to get the VCF I need using ska lo. These commands almost work:

ska build -o ska.skf -k 31 assembly.fasta reference.fasta
ska lo ska.skf test -r reference.fasta

Except I have these two issues:

  1. The indels appear in a separate VCF file that don't have positions.
  2. ska lo only seems to allow a single sequence in the reference, so it errors out when reference.fasta contains multiple sequences (e.g. a chromomsome and plasmids).

Based on the documentation, the first issue seems to be an inherent limitation of ska lo, is that right?

For the second issue, the best solution I've found is to run ska lo separately on each reference sequence and then merge the results together. This is what I've come up with:

samtools faidx reference.fasta
ska build -o ska.skf -k 31 assembly.fasta reference.fasta

# Get reference seq names
seqs=($(grep ">" "$ref" | sed 's/>//; s/ .*//'))

# Run ska lo using each reference sequence
for seq in "${seqs[@]}"; do
    ska lo ska.skf "$seq" -r <(samtools faidx reference.fasta "$seq")
done

# Reheader and concatenate VCFs
vcf_files=()
for seq in "${seqs[@]}"; do
    bcftools reheader -f "${ref}.fai" "${seq}_snps.vcf" > "${seq}_snps_fixed.vcf"
    vcf_files+=("${seq}_snps_fixed.vcf")
done

# Merge all VCFs
bcftools concat -o variants.vcf "${vcf_files[@]}"

It's clunky, but it seems to work. Is there a better way?

Thanks!
Ryan

@rderelle
Copy link
Collaborator

rderelle commented Mar 6, 2025

Hi Ryan,

Thanks for your message. The command 'ska lo' calls variants in reference-free mode and only subsequently attempts to position SNPs on a reference genome, mostly designed for recombination analyses. So, as opposed to other methods that call variants directly from the reference genome, its SNP positioning is prone to errors (which does not necessarily imply that the SNPs are incorrect). More details here: https://www.biorxiv.org/content/10.1101/2024.10.02.616334v2

The current variant positioning is clearly limited, mostly due to a lack of time, and I/someone will have at some point to improve it. You are correct regarding some of its limitations:

  • no indel positioning: it can be done in the future but will probably require to profoundly modify the positioning function as indel positioning is tricky.
  • only 1 reference sequence allowed: this limitation shouldn't exist ... it's just me being a lazy coder. :(

Regarding quick fixes:

  • indels: I could provide you with a Python script that combines the insert with left and right sequences ('before' and 'after' in indel vcf) and then Blast these sequences on a reference genome for positioning. I used this approach in the 1st version of the skalo paper so must have such script somewhere and could modify/test it. Please let me know if you need it.
  • multiple reference sequences: yes, your approach should work. I wouldn't use it in the case of a fragmented genome assembly, with the risk of a given SNP being positioned multiple times independently. But I trust this approach will work here with a full assembly and two small plasmids. Or alternatively you could remove the two small plasmids from the reference and use the standard ska lo command line -> a few more false negatives for ska lo, which would be fair.

All that said, ska map might be sufficient to address the question sequencing reads vs assemblies (I really enjoyed reading your preprint!).

best,
Romain

@johnlees
Copy link
Member

johnlees commented Mar 6, 2025

We should fix the multi-contig reference, the code in ska map does this and we should be able to port it over to ska lo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants