[DOCKTESTERS] Thanks!

Miguel Vazquez mikisvaz at gmail.com
Mon Mar 13 14:31:47 EDT 2017


Hi Junjun

About the unaligned BAM files, in fact I do have them for the two test I've
ran. I could put them available for George but I think he could just as
well produce them on site, since he might have to do that anyway. But we
can always explore that option, though right now I don't know of a simple
way to move these files around.

About the number of lanes let me just say good grief! This is the first
time I hear about it. So if I understand you correctly I need to:

1- Download the metadata for the BAM file
2- Determine the read_groups
3- Split the BAM file according to these read_groups
4- Unalign these BAM files and produce header files with different lanes
5- Run BWA-Mem
6- Compare collectively the reads from these BAM files with the original BAM

Could you please confirm that this is the case? Is this consistent with the
3% mismatches? A similar percentage was found in the HCC1143, could this be
the reason for that as well? Also I asked Keiran about these headers and he
said there where OK. If you could please confirm that I need to do this
extended process I'd be grateful, because its quite involved and there are
concepts here I'm not familiar with.

Regards

Miguel


On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca>
wrote:

> Hi Miguel,
>
> I thought you kept the unaligned sequence you prepared for the testing.
>
> Following your link about preparing unaligned input, I found this:
> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/
> master/bin/prepare_unaligned.sh#L16-L35, which actually could explain the
> high mismatch rate.
>
> When BWA MEM workflow runs, the alignments are done one lane level BAM at
> a time, then merge the aligned BAM later: https://github.com/ICGC
> -TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main
> /java/com/github/seqware/WorkflowClient.java#L201
>
> I see the script prepare_unaligned.sh always generates one read group
> (i.e., lane) for normal or tumour, no matter how many read groups (lanes)
> in the aligned BAMs. This has big impact on the alignment result when lanes
> are aligned independently comparing aligned altogether.
>
> The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM, but
> it only works when the input is *single lane BAM file*:
> https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+
> PCAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.
> PCAPorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustart
> fromsinglelaneBAMfiles
>
> So, I think in order to perform testing alignment workflow properly, we
> will need to prepare *lane level *unaligned BAM (one lane one BAM) as
> inputs. For example, this aligned BAM: https://gtrepo-ebi.annail
> abs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63,
> it has 7 read groups (search for read_group). It needs to be converted to 7
> individual lane level BAM files.
>
> Not sure whether it's the best way to do BAM splitting, but here is
> someone's Python code to do it: https://gist.github.com/seandavi/2014542
>
> Hope this helps,
> Junjun
>
>
>
> From: Miguel Vazquez <mikisvaz at gmail.com>
> Date: Monday, March 13, 2017 at 1:01 PM
> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
> Cc: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>, Junjun Zhang
> <junjun.zhang at oicr.on.ca>, "docktesters at lists.icgc.org" <
> docktesters at lists.icgc.org>
> Subject: Re: [DOCKTESTERS] Thanks!
>
> Hi George,
>
> The analigned BAM files are not available as far as I know, rather you
> must unalign the final BAM files, the normal ones you get from ICGC or
> GNOS. This process is also in my scripts, as you see here:
>
> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
> n/run_batch.sh#L32
>
> About the steps in the workflows, I don't know them myself. I think you'll
> need to ask the developers, and not all workflows use the same underlying
> workflow enactment tool. Not an easy answer
>
>
>
> On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu <
> George.Mihaiescu at oicr.on.ca> wrote:
>
>> Junjun told me this would provide value to the testing process, so I
>> would like to kick off a test of the BWA_mem docker.
>> Can somebody provide some quick instructions and the location of the
>> unaligned BAM files that were used already?
>>
>> Also, do we have somewhere the steps involved in each workflow, so I can
>> get an idea of how far they are while running?
>> For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50
>> steps from finish…
>>
>> Thank you,
>> George
>>
>> From: Miguel Vazquez <mikisvaz at gmail.com>
>> Date: Monday, March 13, 2017 at 8:52 AM
>>
>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, Jonas Demeulemeester <
>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" <
>> docktesters at lists.icgc.org>
>> Subject: Re: [DOCKTESTERS] Thanks!
>>
>> Hi George,
>>
>> Answers inline
>>
>> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu <
>> George.Mihaiescu at oicr.on.ca> wrote:
>>
>>> Hi Miguel,
>>>
>>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I
>>> guess with just one workflow running it should complete faster than two
>>> weeks.
>>>
>>
>> I think it still should take a long time. My scripts will run one
>> workflow after another.
>>
>>
>>>
>>> Because I'm running in Collaboratory I've changed the
>>> "get_icgc_donor.sh" script to use a docker container that has the icgc
>>> client inside and pull data from Collaboratory. There is no "bam.bas" file
>>> downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an
>>> issue.
>>>
>>>
>> I wondered the same thing first time I did this, but this file is
>> produced by the pipeline. There was some problem with this that was dealt
>> with by the developers and updated in the docker. So I think you won't have
>> a problem
>>
>>
>>> By looking at the "bin/compare_result_type.sh" it looks like it's using
>>> the gnos client to pull down the existing VCF files for comparison reasons,
>>> but I think we store those files in Collaboratory as well, so I'll work
>>> with Junjun to adapt the script for this.
>>>
>>>
>> Let me know if you need any help
>>
>>
>>> I think I initially tried to run the DKFZ workflow, but it complained
>>> about having to run Delly first, so I abandoned this for now.
>>>
>>
>> Yes, if you look at the run_batch.sh you will see that when using DKFZ it
>> will always run Delly first. Delly prepares some files the the  DKFZ file
>> needs, namely related to copy number I believe.
>>
>>
>>>
>>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor.
>>>
>>
>> Remember that you will need to add the relevant has-keys for the
>> different files in the etc/donor_files.csv. Its a bit tedious right now.
>> You need to go to the ICGC DCC and find these codes manually for the files
>> you need. Ask me if you need help. Once you have all you can run all the
>> workflows for that donor and evaluate results.
>>
>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/et
>> c/donor_files.csv
>>
>>
>> Regards
>>
>> Miguel
>>
>>
>>>
>>> George
>>>
>>> From: Miguel Vazquez <mikisvaz at gmail.com>
>>> Date: Monday, March 13, 2017 at 6:53 AM
>>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, Jonas Demeulemeester <
>>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" <
>>> docktesters at lists.icgc.org>
>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>
>>> Hi George,
>>>
>>> The Sanger workflow is very lengthy, it takes about two weeks in my
>>> tests.
>>>
>>> About correctness, my scripts also cover that part, if you are not using
>>> them they might still help you to clarify how we do it. The idea is to take
>>> each of the output files produced: SNV_MNV, Indel, SV, and CNV, for both
>>> germline and somatic and compare it with the result uploaded to GNOS (not
>>> all pipelines produce all files). This is the relevant part in the
>>> run_batch.sh script:
>>>
>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>>> n/run_batch.sh#L42-L46
>>>
>>> The bin/compare_result_type.sh script will take care of downloading the
>>> correct file from GNOS and running the comparison. The comparison itself is
>>> simple since all files are VCFs, it consists in taking out the variants in
>>> terms of chromosome, position, reference and alternative allele and
>>> measuring the overlaps.
>>>
>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>>> n/compare_result_type.sh
>>>
>>> About which donors to test, DO52140 is one Jonas and I have both tested
>>> and could be interesting to get a third opinion. Also, any other donor
>>> could be interesting to see if something new comes up. I'm not sure which
>>> options is best.
>>>
>>> Miguel
>>>
>>>
>>>
>>>
>>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu <
>>> George.Mihaiescu at oicr.on.ca> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've started Sanger on DO50398 and it's been running for more than 24
>>>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59"
>>>>
>>>> I just started a second run on a different VM on same donor, just to
>>>> compare run times.
>>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send
>>>> some monitoring graphs when it finishes the workflow, but I have no idea
>>>> how to check its correctness.
>>>>
>>>> Give me a list of donors and what workflows you want me to run and I'll
>>>> try to schedule them tomorrow.
>>>>
>>>> George
>>>>
>>>>
>>>> From: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
>>>> Date: Sunday, March 12, 2017 at 10:45 PM
>>>> To: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>, George
>>>> Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>> Cc: Miguel Vazquez <miguel.vazquez at cnio.es>, Denis Yuen <
>>>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" <
>>>> docktesters at lists.icgc.org>
>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>
>>>> Thanks Miguel and Jonas for your help here!
>>>>
>>>> Do you have any update on the latest testing? Please feel free updating
>>>> the wiki with any update: https://wiki.oicr.on.c
>>>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference
>>>>
>>>> Regards,
>>>> Junjun
>>>>
>>>>
>>>>
>>>> From: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>
>>>> Date: Saturday, March 11, 2017 at 7:15 PM
>>>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>> Cc: Miguel Vazquez <miguel.vazquez at cnio.es>, Junjun Zhang <
>>>> junjun.zhang at oicr.on.ca>, Denis Yuen <Denis.Yuen at oicr.on.ca>, "
>>>> docktesters at lists.icgc.org" <docktesters at lists.icgc.org>
>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>
>>>> Hi George,
>>>>
>>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of
>>>> scripts.
>>>> Give them a go and if you run into issues, just let us know!
>>>>
>>>> Cheers,
>>>> Jonas
>>>>
>>>>
>>>> On 11 Mar 2017, at 17:00, George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>> wrote:
>>>>
>>>> Sure, I'll give it a try and report later.
>>>>
>>>> Thank you,
>>>>
>>>> *George Mihaiescu*
>>>> Senior Cloud Architect
>>>>
>>>> *Ontario Institute for Cancer Research*
>>>> MaRS Centre
>>>> 661 University Avenue
>>>> Suite 510
>>>> Toronto, Ontario
>>>> Canada M5G 0A3
>>>>
>>>> Email: George.Mihaiescu at oicr.on.ca
>>>> Toll-free: 1-866-678-6427
>>>> Twitter: @OICR_news
>>>>
>>>> www.oicr.on.ca
>>>>
>>>> This message and any attachments may contain confidential and/or
>>>> privileged information for the sole use of the intended recipient. Any
>>>> review or distribution by anyone other than the person for whom it was
>>>> originally intended is strictly prohibited. If you have received this
>>>> message in error, please contact the sender and delete all copies.
>>>> Opinions, conclusions or other information contained in this message may
>>>> not be that of the organization.
>>>>
>>>>
>>>>
>>>> From: Miguel Vazquez <miguel.vazquez at cnio.es>
>>>> Date: Saturday, March 11, 2017 at 10:57 AM
>>>> To: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
>>>> Cc: Denis Yuen <Denis.Yuen at oicr.on.ca>, Jonas Demeulemeester <
>>>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu <
>>>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" <
>>>> docktesters at lists.icgc.org>
>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>
>>>> Hi Junjun,
>>>>
>>>> I think Jonas has been using my scripts to run some of the tests, maybe
>>>> George could try them as well, it should be very easy for him to try the
>>>> Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter.
>>>>
>>>> https://github.com/mikisvaz/PCAWG-Docker-Test
>>>>
>>>> He would just need to update the tokens for DACO access and the scripts
>>>> will take care of downloading the BAM files, running the workflows and
>>>> evaluating the result.
>>>>
>>>> The documentation there is reasonably updated, but if this sounds good
>>>> then perhaps he could contact me and I could walk him through the details.
>>>>
>>>> Best regards
>>>>
>>>> Miguel
>>>>
>>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca>
>>>> wrote:
>>>>
>>>>> Dear Docktesters,
>>>>>
>>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans
>>>>> to run some bioinformatics workflows to test Collab environment.
>>>>>
>>>>> Just thought this is a good opportunity to use as extra help for
>>>>> testing out the PCAWG dockerized workflows.
>>>>>
>>>>> Miguel, Denis and others, what workflows / datasets do you think would
>>>>> be good for George to run?
>>>>>
>>>>> Thanks,
>>>>> Junjun
>>>>>
>>>>>
>>>>>
>>>>> From: <docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org> on
>>>>> behalf of Denis Yuen <Denis.Yuen at oicr.on.ca>
>>>>> Date: Wednesday, March 1, 2017 at 10:26 AM
>>>>> To: "docktesters at lists.icgc.org" <docktesters at lists.icgc.org>
>>>>> Subject: [DOCKTESTERS] Thanks!
>>>>>
>>>>> Hi,
>>>>>
>>>>> Just wanted to say thanks to Miguel and Jonas for keeping the workflow
>>>>> testing data page up-to-date.
>>>>>
>>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data
>>>>>
>>>>>
>>>>> As we work on new versions or debugging, it is invaluable to know what
>>>>> versions of the workflows have worked outside OICR, thanks!
>>>>>
>>>>>
>>>>>
>>>>> *Denis Yuen*
>>>>> Senior Software Developer
>>>>>
>>>>>
>>>>> *Ontario**Institute**for**Cancer**Research*
>>>>> MaRSCentre
>>>>> 661 University Avenue
>>>>> Suite510
>>>>> Toronto, Ontario,Canada M5G0A3
>>>>>
>>>>> Toll-free: 1-866-678-6427
>>>>> Twitter: @OICR_news
>>>>> *www.oicr.on.ca <http://www.oicr.on.ca/>*
>>>>>
>>>>> This message and any attachments may contain confidential and/or
>>>>> privileged information for the sole use of the intended recipient. Any
>>>>> review or distribution by anyone other than the person for whom it was
>>>>> originally intended is strictly prohibited. If you have received this
>>>>> message in error, please contact the sender and delete all copies.
>>>>> Opinions, conclusions or other information contained in this message may
>>>>> not be that of the organization.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> docktesters mailing list
>>>>> docktesters at lists.icgc.org
>>>>> https://lists.icgc.org/mailman/listinfo/docktesters
>>>>>
>>>>>
>>>> The Francis Crick Institute Limited is a registered charity in England
>>>> and Wales no. 1140062 and a company registered in England and Wales no.
>>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT
>>>>
>>>>
>>>> _______________________________________________
>>>> docktesters mailing list
>>>> docktesters at lists.icgc.org
>>>> https://lists.icgc.org/mailman/listinfo/docktesters
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170313/648c3878/attachment-0001.html>


More information about the docktesters mailing list