-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the rule of barcode correction #241
Comments
The correction is to match with a barcode whitelist. It corrects one mismatch, and if there is a tie, TRUST4 will use the whitelist barcode frequency observed in the dataset from those perfect barcodes to break ties. If there is no whitelist given, TRUST4 will not conduct error correction. |
Thank you so much for your answer. When I give the barcodeTranslate file, where two different barcodes stand for a same spot, will trust4 also look them as a same spot? Thanks again. |
Do you mean that in the barcode translate table, two barcode say A, B will be translated into the same barcode C? In that case, TRUST4 will just output barcode C. |
Thank you so much for your answer. A ATCGATCGATCG <- In other words, two barcodes (marked by arrow) with maybe two-base difference stand for same spot A. Will TRUST4 output A also? Thank you so much! |
Yes, both barcodes will be translated into A, and A will be in the final output. |
Just want to confirm. Do the two barcodes ATCGATCGATCG and TTCGTTCGATCG still correspond to two cells in your data, and it just mean they are from the same spatial spot A? |
Thank you so much for your answer. Thanks again. and I wish you a Happy New Year in advance! |
Yes, the correction step will be skipped and the output will report the representative TCR or BCR for spot A. |
Thanks a lot. Cause my original single spots are smaller, and I wanna merge several samll spots to a bigger one (similar to a cell), so that it runs faster. Yuyu |
Just now, I have found a warning during annotation: Use of uninitialized value $tmpGermline[43] in string ne at ./TRUST4/trust-airr.pl line 368, line 423598. (There are many rows of the same warning.) I am not sure, where does this error cone from? By the way, I really want to ask you if there is any way to assemble like 500,000 barcodes efficiently, and each of them has about 10,000 reads? Thank you so much for your patience. Best wishes, |
1,2. cid is the contig id/name. It has the barcode information if it is single-cell data. Since there could be multiple contigs assembled for a cell, the last number in the ID provides the information to distinguish the contigs from the same cell. They are useful to link the CDR3s to the underlying sequences in the annot.fa file. |
Thank you very much for your detailed answer. According to the fifth point, yes, they are all VDJ-targed amplified reads. However, I have not performed an alignment on them. Do you recommend that these reads should be performed alignment firstly? And is STAR a suitable/useable tool to do it? Or which alignment tool do you suggest to use for VDJ-data? Thank you so much again! Best wishes, |
I don't think you need to do alignment first. Since these are enriched VDJ reads, you may try the --repseq option to speed up the process. |
Thank you so much for your help. I will try them again with this option. Best wishes, |
Hello, I recently downloaded the latest version of TRUST4 via conda. However, I got an issue when attempting to assemble with the parameter --readFormat bc:0:15. The error message was: Unknown parameter --readFormat. I was wondering if you could kindly assist me in troubleshooting this problem? Thank you so much! Best wishes, |
Could you please show me your full command? |
YES. Following is my full command: *The format of test_assemble_1_barcodeTranslate.tsv is: "x_y" \t "15bp_barcode" Moreover, I want to ask about the output again.
Thank you so much again! Best wishes, |
Could you please run "./run-trust4" without any parameters? It will show the version number. Some times you need to specify the version on conda to get the newest version. For other questions:
|
I am sorry, I have checked the version of trust4 just now, which is 1.0.5.1. I will try again to install the latest one. Thank you so much again for your patience! Best wishes, |
This is indeed an older version of 1.0.5. You can try something like conda install -c bioconda trust4=1.0.13 to install the latest version. Otherwise, you can also download the github version. TRUST4 does not have many dependencies, it is straightforward to compile from source code. |
Okay! Thank you so much for your tips! :) |
Does the run of "single-barcode assemble" mean that you extract all the reads from that barcode and run it in the bulk RNA-seq mode? The assembly procedure is slightly different in single-cell mode and bulk mode, such as contig overlap criteria. DIfferent underlying contigs will result in different abundance estimations. |
The Single-Barcode assembly was performed in bulk mode. Could you please explain how the assembly procedure differs between single-cell mode and bulk mode (maybe with a simple example)? I am particularly using the read count used for assembly to determine the reliability of the final assembly. For instance, if there are 100 reads associated with a specific barcode, but only 1 read is used to determine the CDR3, I would consider this resulting CDR3 as potentially inaccurate due to the possibility of bias. Thank you so much. Best wishes, |
There could be many subtle differences. One is read soring will be different. Using the whole data set will have a different k-mer abundance distribution, so the read sorting and the downstream assembly will be different. Furthermore, in single-cell mode, we use a more lenient threshold when determining whether a read overlaps with a contig. In bulk, the threshold is at least 21bp (also depending on the read length), in single-cell mode, it can be as low as 13. |
Thank you so much for your explanation. Then, I am facing a slight challenge in deciding which mode is suitable for the spatial sequencing data. Should I still use the single-cell mode or assemble each spot as a bulk (but I have more than 5k spots)? I think the final top 1 assembly for each barcode should be same, while the number of reads used to assemble is calculated differently. However, I am confused as to why I am observing a lower number of reads used to assemble the same clonotype for the specific barcode in single-cell mode than in the bulk mode, even when the threshold (used to determine whether a read overlaps with a contig) is set low? Best wishes, |
Does one spot contain a single cell or multiple cells in your data? Ideally, it shall be in single-cell mode. I think it is mainly affected by the read order, which may put some other contigs first to be assembled, which will drag away some alignments. Do you see this abundance discrepancy in many other spots as well? |
Yes, each spot should be considered as a single cell. I have recently made a comparison for a particular barcode. However, I plan to examine more barcodes to gain a deeper understanding of their differences. Once again, thank you sincerely for providing such a detailed explanation. Best regards, |
Dear Professor Li, Thank you once again for your detailed explanation. Firstly, I would like to explain an alternative step I have done for my spatial RNAseq dat. Before I performed TRUST4 assembly on my data in single-cell mode, I have separated this large data set (.fq) in 5 parts, according to the barcodes, to speed up the assemble step. Then I would perform the TRUST4-assemble in single-cell mode to these 5 split data set respectively. Finally, I would merge the assemble results together. However, I have several questions regarding the results obtained from TRUST4 in single-cell mode.
Thank you so much for your assistance. Best wishes, |
|
Dear Prof. Li, Thank you so much for your response.
Thank you once again for your patience and assistance. Best regards, |
I guess some of the barcode fails to translate and will be marked as "missing_barcode" in the toassemble_bc.fa file. These reads will be filtered in the assembly by default. |
Thanks the quick answer and tips. |
Dear Prof. Li, I am sorry that I have to disturb you again about the alignment step. I noticed that some of the reads were identified as VDJC genes by bowtie2 but not by TRUST4. Upon further investigation, I found that most of these reads are not in the C-regions, but rather in the V-regions. And the mapping scores were also relatively high.I have a few examples of these reads and I was wondering if you could help me understand them better. The species I'm working with is human, and I used the hg38_bcrtcr.fa reference file provided in the TRUST4 package.
I have observed that Bowtie2 identifies mapped reads twice as often as TRUST4 does. This discrepancy is concerning to me, and I would appreciate your input on this matter. Thank you! Best wishes, |
Thank you for scrutinizing this step. If you directly put the two reads as input to TRUST4, they will be recognized. How did you search the reads? Is your data paired-end? TRUST4 will merge read pairs, so if you are searching the read sequence, you may not be able to find it directly. Another possibility is the read quality. There are some read quality trimming inside of TRUST4. This will make either the read too short to be used in the assembly stage, or affect your read search (if you are searching using read sequence). |
Dear Prof. Li, thank you so much for the answer.
Thanks again for your assistance and patience. Best wishes, |
I just checked the running command you posted before, I think the command should be:
You shall also add r1:16:-1 in readFormat, otherwise the read itself will include the barcode sequence. Though the command before is for mouse data, I guess you used the same command this time? |
Oh, yes, I have always used thus commond. To make it clear: my data looks like: read1 15bp (barcode), read2 90bp (only TCR) Therefore, although my read2 does not include barcode sequence, I have to also add r1:16:-1 to make clear it? Thank you so much! This is definietly important to me! Yuyu |
I see. I thought both of them were from read1. You don't need the r1:16:-1 then. I just remembered you used the "repseq" option. There might be a bug in processing TCR region in this mode. Let me look into this issue. Thank you for providing the example! |
Thank you for your continued support and feedback too! I will also follow the read quality! :) |
I think I've found the issue and pushed an update to the main branch. Could you pull the github repo, recompile trust4 and give it a try? |
Thank you once again for your support. |
Dear Prof. Li, By the way, is it possible that I can only run the first step, namely reads extraction? Thanks |
If you have the log/output on screen from the run-trust4, you can find the command for each step. You can get the command of running "fastq-extractor" there. The updated code does not affect the read extraction stage though. |
Dear Dr. Li, I have tested the updated code. Initially, there were approximately 4M assembled reads, but now there are around 9M reads, slightly exceeding the recognition of bowtie2. Based on these results, I believe the outcome is reasonable. Thanks a lot for your support throughout this process. Thank you once again! Best regards, |
By the way, is it possible to synchronize this update with Conda? Cause, when I attempted to install trust4 in Linux using the zip file, I encountered an issue where the zlib package was missing, resulting in the inability to use samtools properly. As a workaround, I had to comment out the steps related to bamextractor (namely the steps relating to samtools) in the Makefile in order to successfully proceed with the other steps. Could you please consider incorporating this update into Conda as it would ensure a smoother installation process and avoid such complications? Thank you for your attention to this matter. Yuyu |
How's the assembly speed after relaxing the filtering? I recalled that I removed those reads because in TCR assembly we only need to know their V, J assignment and there was no need to infer the full-length sequence. The conda version requires the releasing a new version of TRUST4. I want to make sure there are no more other urgent issues before drafting a new release. |
Dear Prof. Li, I have reviewed the time cost and it seems acceptable. Originally, it took about 22 minutes to complete the entire process for approximately 9M reads, while the updated version now takes about 30 minutes. I hope these data can help you. For my project, I aim to assemble the full length of VDJ for TCR. Therefore, it would greatly benefit me if TRUST4 could retain all essential reads mapped to VDJ-regions. And it would be even greater if TRUST4 could provide users with an overview of their inpuit raw data, specifically indicating the number of reads mapped to VDJ-regions or obtaining CDR3-motifs. This feature would greatly support us in assessing the reliability of the final assembly result. Thank you once again for your support. Best wishes, |
Thank you for the testing. It's not bad and makes sense to have the full-length TCR assembly.
What's the difference between this and the abundance of the CDR3? |
Dear Prof. Li, Although I have not looked into the abundance of the CDR3, there were more spots, in which CDR3 were successfully assembled. Best wishes, |
Dear Prof. Li, I have double checked the reads used to assemble again with the newest TRUST4 (the new update to the main branch). However, I have noticed that there are still certain reads, potentially mapping to TCR genes (TRAV and TRAJ), which were not included in the assembly. I can provide you with these reads in fq format, along with the mapping results obtained from STAR in sam format. By the way,Additionally, I would like to mention that these reads were sequenced from human samples and only for TRA enrichment. Before to assembly, I performed several quality control steps, including the use of fastp to filter out low-quality reads (Q30>80%) and cutadapt to trim adapter sequences. As a result, the lengths of these reads vary. Regarding the reasons why these reads were not used, I have a few hypotheses:
I kindly request your assistance in investigating these matters further. star_mapped2tra_read_2.fq.gz Best wishes, |
Sorry for the delayed reply. You are right that the majority of the alignments have long introns "xxxN" or soft clips in the CIGAR field, so I think these reads will be aligned poorly to the TRA genes as a whole read and will be filtered by TRUST4. For the remaining reads, I checked some of them manually, and it seems they overlap a lot with the UTR regions, which are not part of the reference sequences. These reads will be filtered during the assembly stage if there is no valid contig for them to anchor on, i.e. the V gene is lowly expressed/used. The read length and the quality filtering should be fine. |
Thank you sincerely for your answer. Then I think TRUST4 works truly fine. I am grateful once again for this useful tool and your support. |
Hello,
I am using TRUST4 on my spatial-RNAseq data, which has 20bp barcode. However, I am not clear, how TRUST4 handels the single-base error within these 20bp-barcode? In other words, when two barcodes standing for a same spatial spot have a single-base difference, will trust4 looks them as a same barcode?
Thank you so much!
Best wishes,
Yuyu
The text was updated successfully, but these errors were encountered: