An efficient and extensible approach for compressing phylogenetic trees

Handle URI:
http://hdl.handle.net/10754/596985
Title:
An efficient and extensible approach for compressing phylogenetic trees
Authors:
Matthews, Suzanne J; Williams, Tiffani L
Abstract:
Background: Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference.Results: On unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99% (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings.Conclusions: TreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community. © 2011 Matthews and Williams; licensee BioMed Central Ltd.
Citation:
Matthews SJ, Williams TL (2011) An efficient and extensible approach for compressing phylogenetic trees. BMC Bioinformatics 12: S16. Available: http://dx.doi.org/10.1186/1471-2105-12-s10-s16.
Publisher:
Springer Nature
Journal:
BMC Bioinformatics
KAUST Grant Number:
KUS-C1-016-04
Issue Date:
2011
DOI:
10.1186/1471-2105-12-s10-s16
PubMed ID:
22165819
Type:
Article
ISSN:
1471-2105
Sponsors:
Funding for this project was supported by the National Science Foundation under grants DEB-0629849, ΠS-0713168, and ΠS-1018785. Moreover, this publication is based in part on work supported by Award No. KUS-C1-016- 04, made by King Abdullah University of Science and Technology (KAUST). This article has been published as part of BMC Bioinformatics Volume 12 Supplement 10, 2011: Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade. The full contents of the supplement are available online at http://www. biomedcentral.com/1471-2105/12?issue=S10.
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorMatthews, Suzanne Jen
dc.contributor.authorWilliams, Tiffani Len
dc.date.accessioned2016-02-23T13:51:51Zen
dc.date.available2016-02-23T13:51:51Zen
dc.date.issued2011en
dc.identifier.citationMatthews SJ, Williams TL (2011) An efficient and extensible approach for compressing phylogenetic trees. BMC Bioinformatics 12: S16. Available: http://dx.doi.org/10.1186/1471-2105-12-s10-s16.en
dc.identifier.issn1471-2105en
dc.identifier.pmid22165819en
dc.identifier.doi10.1186/1471-2105-12-s10-s16en
dc.identifier.urihttp://hdl.handle.net/10754/596985en
dc.description.abstractBackground: Biologists require new algorithms to efficiently compress and store their large collections of phylogenetic trees. Our previous work showed that TreeZip is a promising approach for compressing phylogenetic trees. In this paper, we extend our TreeZip algorithm by handling trees with weighted branches. Furthermore, by using the compressed TreeZip file as input, we have designed an extensible decompressor that can extract subcollections of trees, compute majority and strict consensus trees, and merge tree collections using set operations such as union, intersection, and set difference.Results: On unweighted phylogenetic trees, TreeZip is able to compress Newick files in excess of 98%. On weighted phylogenetic trees, TreeZip is able to compress a Newick file by at least 73%. TreeZip can be combined with 7zip with little overhead, allowing space savings in excess of 99% (unweighted) and 92%(weighted). Unlike TreeZip, 7zip is not immune to branch rotations, and performs worse as the level of variability in the Newick string representation increases. Finally, since the TreeZip compressed text (TRZ) file contains all the semantic information in a collection of trees, we can easily filter and decompress a subset of trees of interest (such as the set of unique trees), or build the resulting consensus tree in a matter of seconds. We also show the ease of which set operations can be performed on TRZ files, at speeds quicker than those performed on Newick or 7zip compressed Newick files, and without loss of space savings.Conclusions: TreeZip is an efficient approach for compressing large collections of phylogenetic trees. The semantic and compact nature of the TRZ file allow it to be operated upon directly and quickly, without a need to decompress the original Newick file. We believe that TreeZip will be vital for compressing and archiving trees in the biological community. © 2011 Matthews and Williams; licensee BioMed Central Ltd.en
dc.description.sponsorshipFunding for this project was supported by the National Science Foundation under grants DEB-0629849, ΠS-0713168, and ΠS-1018785. Moreover, this publication is based in part on work supported by Award No. KUS-C1-016- 04, made by King Abdullah University of Science and Technology (KAUST). This article has been published as part of BMC Bioinformatics Volume 12 Supplement 10, 2011: Proceedings of the Eighth Annual MCBIOS Conference. Computational Biology and Bioinformatics for a New Decade. The full contents of the supplement are available online at http://www. biomedcentral.com/1471-2105/12?issue=S10.en
dc.publisherSpringer Natureen
dc.rightsThis article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.en
dc.rights.urihttp://creativecommons.org/licenses/by/2.0en
dc.titleAn efficient and extensible approach for compressing phylogenetic treesen
dc.typeArticleen
dc.identifier.journalBMC Bioinformaticsen
dc.contributor.institutionTexas A and M University, College Station, United Statesen
kaust.grant.numberKUS-C1-016-04en

Related articles on PubMed

All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.