[DOCKTESTERS] Thanks!
Miguel Vazquez
mikisvaz at gmail.com
Tue Mar 14 08:44:13 EDT 2017
Hi Junjun and Keiran,
I'm sorry guys, but his is too alien for me, this was never my area of
expertise. I'm going to need someone to write a script for me that takes a
BAM file and turns it into what ever I need to run BWA-Mem on. At least
pseudo-code or something that I can start with.
I think perhaps someone more knowledgeable than me should consider if this
procedure as a whole is acceptable in terms of reproducibility, and how
would be best to document it or if it could possibly be improved.
Also, I don't think I understand the nature of the problem because from
what I can fathom this problem should have either broken the process or
render a much larger of discrepancies than 3%. Can someone explain in
layman words how can only 3% of reads be affected?
Best regards
Miguel
On Tue, Mar 14, 2017 at 1:28 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca>
wrote:
> Hi Kieran,
>
> Thanks for the detailed explanation. So, in order to reproduce PCAWG BWA
> MEM alignment result, one must use lane level BAMs (one lane one BAM) as
> input.
>
> A processing is needed to prepare lane level BAMs from merged BAM.
>
> @Migual, hope this is helpful. Let us know if you have any other
> questions.
>
> Best regards
> Junjun
>
> On Mar 14, 2017, at 5:16 AM, Keiran Raine <kr2 at sanger.ac.uk> wrote:
>
> Hi Junjun,
>
> You won't be able to separate out the readgroups in the headers if the
> input is a merged BAM file . If there are different libraries, read
> lengths etc it will cause problems for insert-size determination (used in
> determining proper-pairs) and result in inter-library duplicate removal (by
> definition reads from different libraries can't be duplicates).
>
> If you really need to do it this way you'd have to add a pre-processing
> step, bamtofastq can split a BAM into it's component readgroups in a single
> pass.
>
> Regards,
>
> Keiran Raine
> Principal Bioinformatician
> Cancer Genome Project
> Wellcome Trust Sanger Institute
>
> kr2 at sanger.ac.uk
> Tel:+44 (0)1223 834244 Ext: 4983 <+44%201223%20834244>
> Office: H104
>
> On 13 Mar 2017, at 21:16, Junjun Zhang <Junjun.Zhang at oicr.on.ca> wrote:
>
> Hi Keiran,
>
> Can you please comment on this, i.e., comparison between alignment done
> lane by lane v.s. done with all lanes mixed?
>
> Basically, we are trying to prepare input BAMs for testing PCAWG BWA MEM
> workflow. The starting point is the aligned BAM because we don't have the
> unaligned lane BAM any more. The key point here is: should input BAM
> organized by lanes, one lane one BAM? Or just one BAM containing all lanes?
>
> Thanks,
> Junjun
>
>
>
> From: Miguel Vazquez <mikisvaz at gmail.com>
> Date: Monday, March 13, 2017 at 2:31 PM
> To: Junjun Zhang <junjun.zhang at oicr.on.ca>
> Cc: George Mihaiescu <George.Mihaiescu at oicr.on.ca>, Jonas Demeulemeester <
> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" <
> docktesters at lists.icgc.org>
> Subject: Re: [DOCKTESTERS] Thanks!
>
> Hi Junjun
>
> About the unaligned BAM files, in fact I do have them for the two test
> I've ran. I could put them available for George but I think he could just
> as well produce them on site, since he might have to do that anyway. But we
> can always explore that option, though right now I don't know of a simple
> way to move these files around.
>
> About the number of lanes let me just say good grief! This is the first
> time I hear about it. So if I understand you correctly I need to:
>
> 1- Download the metadata for the BAM file
> 2- Determine the read_groups
> 3- Split the BAM file according to these read_groups
> 4- Unalign these BAM files and produce header files with different lanes
> 5- Run BWA-Mem
> 6- Compare collectively the reads from these BAM files with the original
> BAM
>
> Could you please confirm that this is the case? Is this consistent with
> the 3% mismatches? A similar percentage was found in the HCC1143, could
> this be the reason for that as well? Also I asked Keiran about these
> headers and he said there where OK. If you could please confirm that I need
> to do this extended process I'd be grateful, because its quite involved and
> there are concepts here I'm not familiar with.
>
> Regards
>
> Miguel
>
>
> On Mon, Mar 13, 2017 at 6:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca>
> wrote:
>
>> Hi Miguel,
>>
>> I thought you kept the unaligned sequence you prepared for the testing.
>>
>> Following your link about preparing unaligned input, I found this:
>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/mas
>> ter/bin/prepare_unaligned.sh#L16-L35, which actually could explain the
>> high mismatch rate.
>>
>> When BWA MEM workflow runs, the alignments are done one lane level BAM at
>> a time, then merge the aligned BAM later: https://github.com/ICGC
>> -TCGA-PanCancer/Seqware-BWA-Workflow/blob/develop/src/main/
>> java/com/github/seqware/WorkflowClient.java#L201
>>
>> I see the script prepare_unaligned.sh always generates one read group
>> (i.e., lane) for normal or tumour, no matter how many read groups (lanes)
>> in the aligned BAMs. This has big impact on the alignment result when lanes
>> are aligned independently comparing aligned altogether.
>>
>> The PCAWG Sequence Submission SOP has a step to prepare unaligned BAM,
>> but it only works when the input is *single lane BAM file*:
>> https://wiki.oicr.on.ca/display/PANCANCER/PCAWG+%28a.k.a.+P
>> CAP+or+PAWG%29+Sequence+Submission+SOP+-+v1.0#PCAWG(a.k.a.PC
>> APorPAWG)SequenceSubmissionSOP-v1.0-a)Followthisifyoustartfr
>> omsinglelaneBAMfiles
>>
>> So, I think in order to perform testing alignment workflow properly, we
>> will need to prepare *lane level *unaligned BAM (one lane one BAM) as
>> inputs. For example, this aligned BAM: https://gtrepo-ebi.annail
>> abs.com/cghub/metadata/analysisFull/c9fa1c22-6432-4851-af67-30f4b4812c63,
>> it has 7 read groups (search for read_group). It needs to be converted to 7
>> individual lane level BAM files.
>>
>> Not sure whether it's the best way to do BAM splitting, but here is
>> someone's Python code to do it: https://gist.github.com/seandavi/2014542
>>
>> Hope this helps,
>> Junjun
>>
>>
>>
>> From: Miguel Vazquez <mikisvaz at gmail.com>
>> Date: Monday, March 13, 2017 at 1:01 PM
>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>> Cc: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>, Junjun
>> Zhang <junjun.zhang at oicr.on.ca>, "docktesters at lists.icgc.org" <
>> docktesters at lists.icgc.org>
>> Subject: Re: [DOCKTESTERS] Thanks!
>>
>> Hi George,
>>
>> The analigned BAM files are not available as far as I know, rather you
>> must unalign the final BAM files, the normal ones you get from ICGC or
>> GNOS. This process is also in my scripts, as you see here:
>>
>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>> n/run_batch.sh#L32
>>
>> About the steps in the workflows, I don't know them myself. I think
>> you'll need to ask the developers, and not all workflows use the same
>> underlying workflow enactment tool. Not an easy answer
>>
>>
>>
>> On Mon, Mar 13, 2017 at 5:57 PM, George Mihaiescu <
>> George.Mihaiescu at oicr.on.ca> wrote:
>>
>>> Junjun told me this would provide value to the testing process, so I
>>> would like to kick off a test of the BWA_mem docker.
>>> Can somebody provide some quick instructions and the location of the
>>> unaligned BAM files that were used already?
>>>
>>> Also, do we have somewhere the steps involved in each workflow, so I can
>>> get an idea of how far they are while running?
>>> For example, s58_cgpPindel_pin2vcf_95 is three steps from finish, or 50
>>> steps from finish…
>>>
>>> Thank you,
>>> George
>>>
>>> From: Miguel Vazquez <mikisvaz at gmail.com>
>>> Date: Monday, March 13, 2017 at 8:52 AM
>>>
>>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, Jonas Demeulemeester <
>>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" <
>>> docktesters at lists.icgc.org>
>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>
>>> Hi George,
>>>
>>> Answers inline
>>>
>>> On Mon, Mar 13, 2017 at 2:43 PM, George Mihaiescu <
>>> George.Mihaiescu at oicr.on.ca> wrote:
>>>
>>>> Hi Miguel,
>>>>
>>>> I've started the test by running "bin/run_test.sh Sanger DO50398", so I
>>>> guess with just one workflow running it should complete faster than two
>>>> weeks.
>>>>
>>>
>>> I think it still should take a long time. My scripts will run one
>>> workflow after another.
>>>
>>>
>>>>
>>>> Because I'm running in Collaboratory I've changed the
>>>> "get_icgc_donor.sh" script to use a docker container that has the icgc
>>>> client inside and pull data from Collaboratory. There is no "bam.bas" file
>>>> downloaded, just a ".bam" and a ".bam.bai" files, not sure if this is an
>>>> issue.
>>>>
>>>>
>>> I wondered the same thing first time I did this, but this file is
>>> produced by the pipeline. There was some problem with this that was dealt
>>> with by the developers and updated in the docker. So I think you won't have
>>> a problem
>>>
>>>
>>>> By looking at the "bin/compare_result_type.sh" it looks like it's using
>>>> the gnos client to pull down the existing VCF files for comparison reasons,
>>>> but I think we store those files in Collaboratory as well, so I'll work
>>>> with Junjun to adapt the script for this.
>>>>
>>>>
>>> Let me know if you need any help
>>>
>>>
>>>> I think I initially tried to run the DKFZ workflow, but it complained
>>>> about having to run Delly first, so I abandoned this for now.
>>>>
>>>
>>> Yes, if you look at the run_batch.sh you will see that when using DKFZ
>>> it will always run Delly first. Delly prepares some files the the DKFZ
>>> file needs, namely related to copy number I believe.
>>>
>>>
>>>>
>>>> I'll set up a new VM and run the "run_batch.sh" on the DO52140 donor.
>>>>
>>>
>>> Remember that you will need to add the relevant has-keys for the
>>> different files in the etc/donor_files.csv. Its a bit tedious right now.
>>> You need to go to the ICGC DCC and find these codes manually for the files
>>> you need. Ask me if you need help. Once you have all you can run all the
>>> workflows for that donor and evaluate results.
>>>
>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/et
>>> c/donor_files.csv
>>>
>>>
>>> Regards
>>>
>>> Miguel
>>>
>>>
>>>>
>>>> George
>>>>
>>>> From: Miguel Vazquez <mikisvaz at gmail.com>
>>>> Date: Monday, March 13, 2017 at 6:53 AM
>>>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>> Cc: Junjun Zhang <Junjun.Zhang at oicr.on.ca>, Jonas Demeulemeester <
>>>> Jonas.Demeulemeester at crick.ac.uk>, "docktesters at lists.icgc.org" <
>>>> docktesters at lists.icgc.org>
>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>
>>>> Hi George,
>>>>
>>>> The Sanger workflow is very lengthy, it takes about two weeks in my
>>>> tests.
>>>>
>>>> About correctness, my scripts also cover that part, if you are not
>>>> using them they might still help you to clarify how we do it. The idea is
>>>> to take each of the output files produced: SNV_MNV, Indel, SV, and CNV, for
>>>> both germline and somatic and compare it with the result uploaded to GNOS
>>>> (not all pipelines produce all files). This is the relevant part in the
>>>> run_batch.sh script:
>>>>
>>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>>>> n/run_batch.sh#L42-L46
>>>>
>>>> The bin/compare_result_type.sh script will take care of downloading the
>>>> correct file from GNOS and running the comparison. The comparison itself is
>>>> simple since all files are VCFs, it consists in taking out the variants in
>>>> terms of chromosome, position, reference and alternative allele and
>>>> measuring the overlaps.
>>>>
>>>> https://github.com/mikisvaz/PCAWG-Docker-Test/blob/master/bi
>>>> n/compare_result_type.sh
>>>>
>>>> About which donors to test, DO52140 is one Jonas and I have both tested
>>>> and could be interesting to get a third opinion. Also, any other donor
>>>> could be interesting to see if something new comes up. I'm not sure which
>>>> options is best.
>>>>
>>>> Miguel
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Mar 13, 2017 at 5:12 AM, George Mihaiescu <
>>>> George.Mihaiescu at oicr.on.ca> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've started Sanger on DO50398 and it's been running for more than 24
>>>>> hours, currently at "Workflow step succeeded: s58_bbAllele_merge_59"
>>>>>
>>>>> I just started a second run on a different VM on same donor, just to
>>>>> compare run times.
>>>>> The VM used has 8 cores, 48 GB of RAM and 1.1 TB disk and I'll send
>>>>> some monitoring graphs when it finishes the workflow, but I have no idea
>>>>> how to check its correctness.
>>>>>
>>>>> Give me a list of donors and what workflows you want me to run and
>>>>> I'll try to schedule them tomorrow.
>>>>>
>>>>> George
>>>>>
>>>>>
>>>>> From: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
>>>>> Date: Sunday, March 12, 2017 at 10:45 PM
>>>>> To: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>, George
>>>>> Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>>> Cc: Miguel Vazquez <miguel.vazquez at cnio.es>, Denis Yuen <
>>>>> Denis.Yuen at oicr.on.ca>, "docktesters at lists.icgc.org" <
>>>>> docktesters at lists.icgc.org>
>>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>>
>>>>> Thanks Miguel and Jonas for your help here!
>>>>>
>>>>> Do you have any update on the latest testing? Please feel free
>>>>> updating the wiki with any update: https://wiki.oicr.on.c
>>>>> a/display/PANCANCER/2017-03-13+PCAWG-TECH+Teleconference
>>>>>
>>>>> Regards,
>>>>> Junjun
>>>>>
>>>>>
>>>>>
>>>>> From: Jonas Demeulemeester <Jonas.Demeulemeester at crick.ac.uk>
>>>>> Date: Saturday, March 11, 2017 at 7:15 PM
>>>>> To: George Mihaiescu <George.Mihaiescu at oicr.on.ca>
>>>>> Cc: Miguel Vazquez <miguel.vazquez at cnio.es>, Junjun Zhang <
>>>>> junjun.zhang at oicr.on.ca>, Denis Yuen <Denis.Yuen at oicr.on.ca>, "
>>>>> docktesters at lists.icgc.org" <docktesters at lists.icgc.org>
>>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>>
>>>>> Hi George,
>>>>>
>>>>> Yup, I've been running the PCAWG dockers mainly using Miguel's set of
>>>>> scripts.
>>>>> Give them a go and if you run into issues, just let us know!
>>>>>
>>>>> Cheers,
>>>>> Jonas
>>>>>
>>>>>
>>>>> On 11 Mar 2017, at 17:00, George Mihaiescu <
>>>>> George.Mihaiescu at oicr.on.ca> wrote:
>>>>>
>>>>> Sure, I'll give it a try and report later.
>>>>>
>>>>> Thank you,
>>>>> *George Mihaiescu*
>>>>> Senior Cloud Architect
>>>>>
>>>>> *Ontario Institute for Cancer Research*
>>>>> MaRS Centre
>>>>> 661 University Avenue
>>>>> Suite 510
>>>>> Toronto, Ontario
>>>>> Canada M5G 0A3
>>>>>
>>>>> Email: George.Mihaiescu at oicr.on.ca
>>>>> Toll-free: 1-866-678-6427
>>>>> Twitter: @OICR_news
>>>>>
>>>>> www.oicr.on.ca
>>>>>
>>>>> This message and any attachments may contain confidential and/or
>>>>> privileged information for the sole use of the intended recipient. Any
>>>>> review or distribution by anyone other than the person for whom it was
>>>>> originally intended is strictly prohibited. If you have received this
>>>>> message in error, please contact the sender and delete all copies.
>>>>> Opinions, conclusions or other information contained in this message may
>>>>> not be that of the organization.
>>>>>
>>>>>
>>>>>
>>>>> From: Miguel Vazquez <miguel.vazquez at cnio.es>
>>>>> Date: Saturday, March 11, 2017 at 10:57 AM
>>>>> To: Junjun Zhang <Junjun.Zhang at oicr.on.ca>
>>>>> Cc: Denis Yuen <Denis.Yuen at oicr.on.ca>, Jonas Demeulemeester <
>>>>> jonas.demeulemeester at crick.ac.uk>, George Mihaiescu <
>>>>> George.Mihaiescu at oicr.on.ca>, "docktesters at lists.icgc.org" <
>>>>> docktesters at lists.icgc.org>
>>>>> Subject: Re: [DOCKTESTERS] Thanks!
>>>>>
>>>>> Hi Junjun,
>>>>>
>>>>> I think Jonas has been using my scripts to run some of the tests,
>>>>> maybe George could try them as well, it should be very easy for him to try
>>>>> the Sanger, Delly+DKFZ, BWA-Mem, and the BiasFilter.
>>>>>
>>>>> https://github.com/mikisvaz/PCAWG-Docker-Test
>>>>>
>>>>> He would just need to update the tokens for DACO access and the
>>>>> scripts will take care of downloading the BAM files, running the workflows
>>>>> and evaluating the result.
>>>>>
>>>>> The documentation there is reasonably updated, but if this sounds good
>>>>> then perhaps he could contact me and I could walk him through the details.
>>>>>
>>>>> Best regards
>>>>>
>>>>> Miguel
>>>>>
>>>>> On Fri, Mar 10, 2017 at 9:51 PM, Junjun Zhang <Junjun.Zhang at oicr.on.ca
>>>>> > wrote:
>>>>>
>>>>>> Dear Docktesters,
>>>>>>
>>>>>> George Mihaiescu, cloud architect, of the Collaboratory at OICR plans
>>>>>> to run some bioinformatics workflows to test Collab environment.
>>>>>>
>>>>>> Just thought this is a good opportunity to use as extra help for
>>>>>> testing out the PCAWG dockerized workflows.
>>>>>>
>>>>>> Miguel, Denis and others, what workflows / datasets do you think
>>>>>> would be good for George to run?
>>>>>>
>>>>>> Thanks,
>>>>>> Junjun
>>>>>>
>>>>>>
>>>>>>
>>>>>> From: <docktesters-bounces+junjun.zhang=oicr.on.ca at lists.icgc.org>
>>>>>> on behalf of Denis Yuen <Denis.Yuen at oicr.on.ca>
>>>>>> Date: Wednesday, March 1, 2017 at 10:26 AM
>>>>>> To: "docktesters at lists.icgc.org" <docktesters at lists.icgc.org>
>>>>>> Subject: [DOCKTESTERS] Thanks!
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Just wanted to say thanks to Miguel and Jonas for keeping the
>>>>>> workflow testing data page up-to-date.
>>>>>>
>>>>>> https://wiki.oicr.on.ca/display/PANCANCER/Workflow+Testing+Data
>>>>>>
>>>>>> As we work on new versions or debugging, it is invaluable to know
>>>>>> what versions of the workflows have worked outside OICR, thanks!
>>>>>>
>>>>>>
>>>>>> *Denis Yuen*
>>>>>> Senior Software Developer
>>>>>>
>>>>>>
>>>>>> *Ontario**Institute**for**Cancer**Research*
>>>>>> MaRSCentre
>>>>>> 661 University Avenue
>>>>>> Suite510
>>>>>> Toronto, Ontario,Canada M5G0A3
>>>>>>
>>>>>> Toll-free: 1-866-678-6427
>>>>>> Twitter: @OICR_news
>>>>>> *www.oicr.on.ca <http://www.oicr.on.ca/>*
>>>>>>
>>>>>> This message and any attachments may contain confidential and/or
>>>>>> privileged information for the sole use of the intended recipient. Any
>>>>>> review or distribution by anyone other than the person for whom it was
>>>>>> originally intended is strictly prohibited. If you have received this
>>>>>> message in error, please contact the sender and delete all copies.
>>>>>> Opinions, conclusions or other information contained in this message may
>>>>>> not be that of the organization.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> docktesters mailing list
>>>>>> docktesters at lists.icgc.org
>>>>>> https://lists.icgc.org/mailman/listinfo/docktesters
>>>>>>
>>>>>>
>>>>> The Francis Crick Institute Limited is a registered charity in England
>>>>> and Wales no. 1140062 and a company registered in England and Wales no.
>>>>> 06885462, with its registered office at 1 Midland Road London NW1 1AT
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> docktesters mailing list
>>>>> docktesters at lists.icgc.org
>>>>> https://lists.icgc.org/mailman/listinfo/docktesters
>>>>>
>>>>>
>>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.icgc.org/mailman/private/docktesters/attachments/20170314/4b65ff9b/attachment-0001.html>
More information about the docktesters
mailing list