[DOCKTESTERS] BWA-Mem validation of DO51057 (normal BAM only). 96.3% matches, 0.013% miss-matches, and 3.7% soft-matches
Miguel Vazquez
miguel.vazquez at cnio.es
Thu Apr 6 10:28:21 EDT 2017
Dear all,
This is just an advance teaser for the BWA-Mem validation after the latest
changes, it is currently running over the tumor BAM, but the normal BAM has
completed and the *missmatches are two orders of magnitude lower* than in
our two previous attempts. Before further discussion here are the raw
numbers:
Lines: 1125172217
Matches: 1083221794
*Misses: 143716*
Soft: 41806707
If my calculation are correct this means 96.3% matches, *0.013%
miss-matches*, and 3.7% soft-matches.
The fix was two part. First realizing that the input of this process should
not be a single unaligned version of the output BAMs, but several input
BAMs. Breaking down the output bam into it's constituent BAMs, by a process
implemented by Jonas, dit not address the problem unfortunately. After this
first attempt it was pointed out to us, I think by Keiran, that the order
of the reads matter, and so our attempt to work back from the output BAM
was not going to work. Junjun came back to us with the second part of the
fix, he located a subset of original unaligned BAMs in the DKFZ that we
could use. Downloading these BAM files and submitting them to BWA-Mem in
the same order as was specified in the output BAM header achieved these
promising results.
I will reply this message in a few days with the corresponding numbers for
the other BAM, the tumor, which is currently running.
Best regards
Miguel
On Sun, Feb 19, 2017 at 1:43 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
wrote:
> Dear all,
>
> Great news! The BWA-Mem test on a real PCAWG donor succeed in running;
> achieving an overlap with the original BAM alignment similar to the
> HCC1143 test. The numbers are:
>
> Lines: 1708047647
> Matches: 1589172843
> Misses: 62726130
> Soft: 56148674
>
> Which mean 93% matches, 3.6% miss-matches, and 3.2% soft-matches. Compared
> to the HCC1143 test there are a few percentage points in matches that turn
> into soft-matches (95% and 1.3% to 93% and 3.2%), but the ratio of misses
> is very close 3.6%.
>
> I'm running this test on a second donor.
>
> Best regards
>
> Miguel
>
> On Tue, Feb 14, 2017 at 3:30 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
> wrote:
>
>> Dear colleagues,
>>
>> I'm very happy to say that the BWA-Mem pipeline finished for the HCC1143
>> data.
>>
>> I think what solved the problem was setting the headers to the unaligned
>> BAM files. I'm currently trying it out with the DO35937 donor, but its too
>> early to say if its working or not.
>>
>> To compare BAM files I've followed some advice that I found on the
>> internet https://www.biostars.org/p/166221/. I will detail them a bit
>> below because I would like some advice as to how appropriate the approach
>> is, but first here are the numbers:
>>
>> *Lines*: 74264390
>> *Matches*: 70565742
>> *Misses*: 2693687
>> *Soft*: 1004961
>>
>>
>> Which means *95% matches, 3.6% miss-matches, and 1.3% soft-matches*.
>> Matches are when the chromosome and position are the same, soft-matches are
>> when they are not the same but the position from one of the alignments is
>> included in the list of alternative positions for the other alignment (e.g
>> XA:Z:15,-102516528,76M,0), and misses are the rest.
>>
>> Here is the detailed process from the start. The comparison script is
>> here https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>> n/compare_bwa_bam.sh
>>
>> 1) Un-align tumor and normal BAM files, retaining the original aligned
>> BAM files
>> 2) Run BWA-Mem wich produces a file called HCC1143.merged_output.bam with
>> alignments from both tumor and normal
>> 3) use samtools to extract the entries, limited for the first in pair
>> (?), cut the read-name, chromosome, position (??) and extra information
>> (for additional alignments) and sort them. We do this for the original
>> files and for the BWA-Mem merged_output file, but separating tumor and
>> normal entries (marked with the codes 'tumor' and 'normal', I believe from
>> the headers I set when un-aligning them)
>> 4) join the lines by read-name, separately for the tumor and normal pairs
>> of files, and check for matches
>>
>> I've two questions:
>> (?) Is it OK to select only the first in pair, its what the guy in the
>> example did, and it did simplify the code without repeated read-names
>> (??) I guess its OK to only check chromosome and position, the cigar
>> would be necessarily the same.
>>
>> Best regards
>>
>> Miguel
>>
>> On Mon, Jan 16, 2017 at 3:24 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
>> wrote:
>>
>>> Dear all,
>>>
>>> Let me summarize the status of the testing for Sanger and DKFZ. The
>>> validation has been run for two donors for each workflow: DO50311 DO52140
>>>
>>> Sanger:
>>> ----------
>>>
>>> Sanger call only somatic variants. The results are *identical for
>>> Indels and SVs* but *almost identical for SNV.MNV and CNV*. The
>>> discrepancies are reproducible (on the same machine at least), i.e. the
>>> same are found after running the workflow a second time.
>>>
>>> DKFZ:
>>> ---------
>>> DKFZ cals somatic and germline variants, except germline CNVs. For both
>>> germline and somatic variants the results are *identical for SNV.MNV
>>> and Indels* but with *large discrepancies for SV and CNV*.
>>>
>>> Kortine Kleinheinz and Joachim Weischenfeldt are in the process of
>>> investigating this issue I believe.
>>>
>>> BWA-Mem failed for me and has also failed for Denis Yuen and Jonas
>>> Demeulemeester. Denis I believe is investigating this problem further. I
>>> haven't had the chance to investigate this much myself.
>>>
>>> Best
>>>
>>> Miguel
>>>
>>>
>>>
>>>
>>> ---------------------
>>> RESULTS
>>> ---------------------
>>>
>>> ubuntu at ip-10-253-35-14:~/DockerTest-Miguel$ cat results.txt
>>>
>>> Comparison of somatic.snv.mnv for DO50311 using DKFZ
>>> ---
>>> Common: 51087
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.indel for DO50311 using DKFZ
>>> ---
>>> Common: 26469
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.sv for DO50311 using DKFZ
>>> ---
>>> Common: 231
>>> Extra: 44
>>> - Example: 10:20596800:N:<TRA>,10:5606682
>>> 1:N:<TRA>,11:16776092:N:<TRA>
>>> Missing: 48
>>> - Example: 10:119704959:N:<INV>,10:131163
>>> 22:N:<TRA>,10:47063485:N:<TRA>
>>>
>>>
>>> Comparison of somatic.cnv for DO50311 using DKFZ
>>> ---
>>> Common: 731
>>> Extra: 213
>>> - Example: 10:132510034:N:<DEL>,10:205968
>>> 01:N:<NEUTRAL>,10:47674883:N:<NEUTRAL>
>>> Missing: 190
>>> - Example: 10:100891940:N:<NEUTRAL>,10:10
>>> 4975905:N:<NEUTRAL>,10:119704960:N:<NEUTRAL>
>>>
>>>
>>> Comparison of germline.snv.mnv for DO50311 using DKFZ
>>> ---
>>> Common: 3850992
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of germline.indel for DO50311 using DKFZ
>>> ---
>>> Common: 709060
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of germline.sv for DO50311 using DKFZ
>>> ---
>>> Common: 1393
>>> Extra: 231
>>> - Example: 10:134319313:N:<DEL>,10:134948
>>> 976:N:<DEL>,10:19996638:N:<DEL>
>>> Missing: 615
>>> - Example: 10:101851839:N:<TRA>,10:101851
>>> 884:N:<TRA>,10:10745225:N:<DUP>
>>>
>>> File not found /mnt/1TB/work/DockerTest-Migue
>>> l/tests/DKFZ/DO50311//output//DO50311.germline.cnv.vcf.gz
>>>
>>> Comparison of somatic.snv.mnv for DO52140 using DKFZ
>>> ---
>>> Common: 37160
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.indel for DO52140 using DKFZ
>>> ---
>>> Common: 19347
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.sv for DO52140 using DKFZ
>>> ---
>>> Common: 72
>>> Extra: 23
>>> - Example: 10:132840774:N:<DEL>,11:382520
>>> 19:N:<TRA>,11:47700673:N:<TRA>
>>> Missing: 61
>>> - Example: 10:134749140:N:<DEL>,11:179191
>>> :N:<TRA>,11:38252005:N:<TRA>
>>>
>>>
>>> Comparison of somatic.cnv for DO52140 using DKFZ
>>> ---
>>> Common: 275
>>> Extra: 94
>>> - Example: 1:106505931:N:<LOH>,1:10906889
>>> 9:N:<DEL>,1:109359995:N:<DEL>
>>> Missing: 286
>>> - Example: 10:88653561:N:<LOH>,11:179192:N:<LOH>,11:38252006:N:<LOH>
>>>
>>>
>>> Comparison of germline.snv.mnv for DO52140 using DKFZ
>>> ---
>>> Common: 3833896
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of germline.indel for DO52140 using DKFZ
>>> ---
>>> Common: 706572
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of germline.sv for DO52140 using DKFZ
>>> ---
>>> Common: 1108
>>> Extra: 1116
>>> - Example: 10:102158308:N:<DEL>,10:104645
>>> 247:N:<DEL>,10:105097522:N:<DEL>
>>> Missing: 2908
>>> - Example: 10:100107032:N:<TRA>,10:100107
>>> 151:N:<TRA>,10:102158345:N:<DEL>
>>>
>>> File not found /mnt/1TB/work/DockerTest-Migue
>>> l/tests/DKFZ/DO52140//output//DO52140.germline.cnv.vcf.gz
>>>
>>> Comparison of somatic.snv.mnv for DO50311 using Sanger
>>> ---
>>> Common: 156299
>>> Extra: 1
>>> - Example: Y:58885197:A:G
>>> Missing: 14
>>> - Example: 1:102887902:A:T,1:143165228:C:G,16:87047601:A:C
>>>
>>>
>>> Comparison of somatic.indel for DO50311 using Sanger
>>> ---
>>> Common: 812487
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.sv for DO50311 using Sanger
>>> ---
>>> Common: 260
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.cnv for DO50311 using Sanger
>>> ---
>>> Common: 138
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.snv.mnv for DO52140 using Sanger
>>> ---
>>> Common: 87234
>>> Extra: 5
>>> - Example: 1:23719098:A:G,12:43715930:T:A,20:4058335:T:A
>>> Missing: 7
>>> - Example: 10:6881937:A:T,1:148579866:A:G,11:9271589:T:A
>>>
>>>
>>> Comparison of somatic.indel for DO52140 using Sanger
>>> ---
>>> Common: 803986
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.sv for DO52140 using Sanger
>>> ---
>>> Common: 6
>>> Extra: 0
>>> Missing: 0
>>>
>>>
>>> Comparison of somatic.cnv for DO52140 using Sanger
>>> ---
>>> Common: 36
>>> Extra: 0
>>> Missing: 2
>>> - Example: 10:11767915:T:<CNV>,10:11779907:G:<CNV>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170406/67b4c21d/attachment.html>
More information about the docktesters
mailing list