[DOCKTESTERS] Help understand the small discrepancies in Sanger pipeline with GNOS VCF (99.9905% accuracy)

Keiran Raine kr2 at sanger.ac.uk
Mon Dec 12 10:38:42 EST 2016


Hi Miguel,

ASCAT is *.somatic.cnv.vcf.gz

Pindel is *.somatic.indel.vcf.gz

Are you not using vcftools to do comaprisons on all generated VCF files?

All variants:
vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out in1_v_in2

Passed variants:
vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out in1_v_in2 --remove-filtered-all

(unfortunately a sort instability in Pindel may require the indel vcf to be resorted first on: chr, pos, ref, alt)

Regards,


Keiran Raine
Principal Bioinformatician
Cancer Genome Project
Wellcome Trust Sanger Institute

kr2 at sanger.ac.uk
Tel:+44 (0)1223 834244 Ext: 4983
Office: H104

> On 12 Dec 2016, at 14:56, Miguel Vazquez <miguel.vazquez at cnio.es> wrote:
> 
> Hi Keiran,
> 
> I don't know how to check the pindel and ASCAT VCF's. I have not saved the docker image. If you give me detailed instructions I can save it on my next run and get them for you.
> 
> As for the difficulties on this donor (just my luck to choose this one at random), I'm running the pipeline on another donor, perhaps it will show no discrepancies, or perhaps its a better subject for our inquiries. We should see soon, I hope; it's 4 days into the analysis.
> 
> Best
> 
> Miguel
> 
> On Mon, Dec 12, 2016 at 3:38 PM, Keiran Raine <kr2 at sanger.ac.uk <mailto:kr2 at sanger.ac.uk>> wrote:
> Hi,
> 
> I'd need access to the full set of result files from the run but can you confirm the pindel and ASCAT VCF's exactly the same?  Both feed into caveman analysis.
> 
> ASCAT is the least stable of the algorithms as it randomly assigns the B-allele and if this donor is known to have an unusual copynumber/rearrangment state it is likely to be the cause (I wouldn't consider a sample like this to particularly good for testing though).
> 
> What were the results on the other samples, I assume cleaner data has also been run?
> 
> Regards,
> 
> Keiran Raine
> Principal Bioinformatician
> Cancer Genome Project
> Wellcome Trust Sanger Institute
> 
> kr2 at sanger.ac.uk <mailto:kr2 at sanger.ac.uk>
> Tel:+44 (0)1223 834244 Ext: 4983 <tel:+44%201223%20834244>
> Office: H104
> 
>> On 12 Dec 2016, at 14:12, Brian O'Connor <Brian.OConnor at oicr.on.ca <mailto:Brian.OConnor at oicr.on.ca>> wrote:
>> 
>> Hi Francis,
>> 
>> I agree with you, I think Miguel is showing what this group needs to show, that someone else can run the tools from Dockstore, have that be successful, and the results are largely in agreement with previous results (or duplicate runs).  I think maybe a statement about the possibility of stochastic results in the README for each tool would be sufficient.  This could be something that Keiran can craft/comment for Sanger’s pipeline since he’s in the best position for this one.
>> 
>> Brian
>> 
>>> On Dec 12, 2016, at 7:38 AM, Francis Ouellette <francis at oicr.on.ca <mailto:francis at oicr.on.ca>> wrote:
>>> 
>>> I know I'm not suppose to be there (and I'm not :-), but one slippery slope I want this dockerstore testing working group to be wary about (and Christina, this is really directed at you, chairing the discussion today) is that the request from Lincoln for this to reproduce what we are doing is fine, but I don't think it is this working group's task to reproduce and explain all of the discrepancies we see. I don't think we ever saw that kind of data from the people that ran the original workflow. 
>>> 
>>> If this group can ascertain that a dock store container basically works, I think we need to call that test a success, and move on to the next one. What Miguel is suggesting/asking below is very good, but I could see this becoming into a very slippery slope, which I would advise us against slipping down.
>>> 
>>> Anyway, going off to my day off,
>>> 
>>> Have a ghre at discussion,
>>> 
>>> Francis
>>> 
>>> -- 
>>> B.F. Francis Ouellette          http://oicr.on.ca/person/francis-ouellette <http://oicr.on.ca/person/francis-ouellette> 
>>> 
>>> On Dec 12, 2016, at 05:44, Miguel Vazquez <miguel.vazquez at cnio.es <mailto:miguel.vazquez at cnio.es>> wrote:
>>> 
>>>> Dear all,
>>>> 
>>>> I was wondering if someone here was acquainted with the Sanger workflow and could help explain these discrepancies. I've skimmed through the code, and it seems like uses EM but I didn't find anything random in it, such as during initialization, which was my initial guess. The other thing I though is that when it splits the work for parallel processing it might choose a different number of splits to accommodate the number of CPUs, and that this might affect the calculations.
>>>> 
>>>> Is there someone here that could help shed some light? As soon as some other tests finish I'll be running the process again, but since it takes so long perhaps a little insight would help.
>>>> 
>>>> Best regards
>>>> 
>>>> Miguel
>>>> 
>>>> 
>>>> 
>>>> On Mon, Dec 5, 2016 at 1:41 PM, Miguel Vazquez <miguel.vazquez at cnio.es <mailto:miguel.vazquez at cnio.es>> wrote:
>>>> Dear all,
>>>> 
>>>> The Sanger pipeline completed, after about 2 weeks of computing, for donor DO50311
>>>> 
>>>> The results are the following:
>>>> 
>>>> Comparison for DO50311 using Sanger
>>>> ---
>>>> Common: 156299
>>>> Extra: 1
>>>>    - Example: Y:58885197:G
>>>> Missing: 14
>>>>    - Example: 1:102887902:T,1:143165228:G,16:87047601:C
>>>> 
>>>> 
>>>> The donor results for DKFZ yielded
>>>> 
>>>> Comparison for DO50311 using DKFZ
>>>> ---
>>>> Common: 51087
>>>> Extra: 0
>>>> Missing: 0
>>>> 
>>>> 
>>>> In both cases I'm comparing agains the VCF file downloaded from GNOS. I've updated the information here
>>>> 
>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data <https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data>
>>>> 
>>>> 
>>>> Best regards
>>>> 
>>>> Miguel
>>>> 
>>>> 
>>> _______________________________________________
>>> docktesters mailing list
>>> docktesters at lists.icgc.org <mailto:docktesters at lists.icgc.org>
>>> https://lists.icgc.org/mailman/listinfo/docktesters <https://lists.icgc.org/mailman/listinfo/docktesters>
>> 
> 
> 
> -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered   office is 215 Euston Road, London, NW1 2BE. 
> 




-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20161212/ccbdaef1/attachment.html>


More information about the docktesters mailing list