[DOCKTESTERS] Help understand the small discrepancies in Sanger pipeline with GNOS VCF (99.9905% accuracy)

Miguel Vazquez miguel.vazquez at cnio.es
Mon Dec 12 11:16:20 EST 2016


Thanks Keiran, that is what Christina asked us to do,so I'll check it next

On Dec 12, 2016 4:38 PM, "Keiran Raine" <kr2 at sanger.ac.uk> wrote:

> Hi Miguel,
>
> ASCAT is *.somatic.cnv.vcf.gz
>
> Pindel is *.somatic.indel.vcf.gz
>
> Are you not using vcftools to do comaprisons on all generated VCF files?
>
> All variants:
> vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz
> --diff-site --out in1_v_in2
>
> Passed variants:
> vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz
> --diff-site --out in1_v_in2 --remove-filtered-all
>
> (unfortunately a sort instability in Pindel may require the indel vcf to
> be resorted first on: chr, pos, ref, alt)
>
> Regards,
>
>
> Keiran Raine
> Principal Bioinformatician
> Cancer Genome Project
> Wellcome Trust Sanger Institute
>
> kr2 at sanger.ac.uk
> Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244>
> Office: H104
>
> On 12 Dec 2016, at 14:56, Miguel Vazquez <miguel.vazquez at cnio.es> wrote:
>
> Hi Keiran,
>
> I don't know how to check the pindel and ASCAT VCF's. I have not saved the
> docker image. If you give me detailed instructions I can save it on my next
> run and get them for you.
>
> As for the difficulties on this donor (just my luck to choose this one at
> random), I'm running the pipeline on another donor, perhaps it will show no
> discrepancies, or perhaps its a better subject for our inquiries. We should
> see soon, I hope; it's 4 days into the analysis.
>
> Best
>
> Miguel
>
> On Mon, Dec 12, 2016 at 3:38 PM, Keiran Raine <kr2 at sanger.ac.uk> wrote:
>
>> Hi,
>>
>> I'd need access to the full set of result files from the run but can you
>> confirm the pindel and ASCAT VCF's exactly the same?  Both feed into
>> caveman analysis.
>>
>> ASCAT is the least stable of the algorithms as it randomly assigns the
>> B-allele and if this donor is known to have an unusual
>> copynumber/rearrangment state it is likely to be the cause (I wouldn't
>> consider a sample like this to particularly good for testing though).
>>
>> What were the results on the other samples, I assume cleaner data has
>> also been run?
>>
>> Regards,
>>
>> Keiran Raine
>> Principal Bioinformatician
>> Cancer Genome Project
>> Wellcome Trust Sanger Institute
>>
>> kr2 at sanger.ac.uk
>> Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244>
>> Office: H104
>>
>> On 12 Dec 2016, at 14:12, Brian O'Connor <Brian.OConnor at oicr.on.ca>
>> wrote:
>>
>> Hi Francis,
>>
>> I agree with you, I think Miguel is showing what this group needs to
>> show, that someone else can run the tools from Dockstore, have that be
>> successful, and the results are largely in agreement with previous results
>> (or duplicate runs).  I think maybe a statement about the possibility of
>> stochastic results in the README for each tool would be sufficient.  This
>> could be something that Keiran can craft/comment for Sanger’s pipeline
>> since he’s in the best position for this one.
>>
>> Brian
>>
>> On Dec 12, 2016, at 7:38 AM, Francis Ouellette <francis at oicr.on.ca>
>> wrote:
>>
>> I know I'm not suppose to be there (and I'm not :-), but one slippery
>> slope I want this dockerstore testing working group to be wary about (and
>> Christina, this is really directed at you, chairing the discussion today)
>> is that the request from Lincoln for this to reproduce what we are doing is
>> fine, but I don't think it is this working group's task to reproduce and
>> explain all of the discrepancies we see. I don't think we ever saw that
>> kind of data from the people that ran the original workflow.
>>
>> If this group can ascertain that a dock store container basically works,
>> I think we need to call that test a success, and move on to the next one.
>> What Miguel is suggesting/asking below is very good, but I could see this
>> becoming into a very slippery slope, which I would advise us against
>> slipping down.
>>
>> Anyway, going off to my day off,
>>
>> Have a ghre at discussion,
>>
>> Francis
>>
>> --
>> B.F. Francis Ouellette          http://oicr.on.ca/per
>> son/francis-ouellette
>>
>> On Dec 12, 2016, at 05:44, Miguel Vazquez <miguel.vazquez at cnio.es> wrote:
>>
>> Dear all,
>>
>> I was wondering if someone here was acquainted with the Sanger workflow
>> and could help explain these discrepancies. I've skimmed through the code,
>> and it seems like uses EM but I didn't find anything random in it, such as
>> during initialization, which was my initial guess. The other thing I though
>> is that when it splits the work for parallel processing it might choose a
>> different number of splits to accommodate the number of CPUs, and that this
>> might affect the calculations.
>>
>> Is there someone here that could help shed some light? As soon as some
>> other tests finish I'll be running the process again, but since it takes so
>> long perhaps a little insight would help.
>>
>> Best regards
>>
>> Miguel
>>
>>
>>
>> On Mon, Dec 5, 2016 at 1:41 PM, Miguel Vazquez <miguel.vazquez at cnio.es>
>> wrote:
>> Dear all,
>>
>> The Sanger pipeline completed, after about 2 weeks of computing, for
>> donor DO50311
>>
>> The results are the following:
>>
>> Comparison for DO50311 using Sanger
>> ---
>> Common: 156299
>> Extra: 1
>>    - Example: Y:58885197:G
>> Missing: 14
>>    - Example: 1:102887902:T,1:143165228:G,16:87047601:C
>>
>>
>> The donor results for DKFZ yielded
>>
>> Comparison for DO50311 using DKFZ
>> ---
>> Common: 51087
>> Extra: 0
>> Missing: 0
>>
>>
>> In both cases I'm comparing agains the VCF file downloaded from GNOS.
>> I've updated the information here
>>
>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data
>>
>>
>> Best regards
>>
>> Miguel
>>
>>
>> _______________________________________________
>> docktesters mailing list
>> docktesters at lists.icgc.org
>> https://lists.icgc.org/mailman/listinfo/docktesters
>>
>>
>>
>>
>> -- The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a company
>> registered in England with number 2742969, whose registered office is 215
>> Euston Road, London, NW1 2BE.
>>
>
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a company
> registered in England with number 2742969, whose registered office is 215
> Euston Road, London, NW1 2BE.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20161212/e7505ddc/attachment-0001.html>


More information about the docktesters mailing list