Archive for July, 2010
Prediction of Deleterious Non-Synonymous SNPs Based on Protein Interaction Network and Hybrid Properties
Written by Scott Christley et al. on July 30, 2010 – 7:00 am -Non-synonymous SNPs (nsSNPs), also known as Single Amino acid Polymorphisms (SAPs) account for the majority of human inherited diseases. It is important to distinguish the deleterious SAPs from neutral ones. Most traditional computational methods to classify SAPs are based on sequential or structural features. However, these features cannot fully explain the association between a SAP and the observed pathophysiological phenotype. We believe the better rationale for deleterious SAP prediction should be: If a SAP lies in the protein with important functions and it can change the protein sequence and structure severely, it is more likely related to disease. So we established a method to predict deleterious SAPs based on both protein interaction network and traditional hybrid properties. Each SAP is represented by 472 features that include sequential features, structural features and network features. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) were applied to obtain the optimal feature set and the prediction model was Nearest Neighbor Algorithm (NNA). In jackknife cross-validation, 83.27% of SAPs were correctly predicted when the optimized 263 features were used. The optimized predictor with 263 features was also tested in an independent dataset and the accuracy was still 80.00%. In contrast, SIFT, a widely used predictor of deleterious SAPs based on sequential features, has a prediction accuracy of 71.05% on the same dataset. In our study, network features were found to be most important for accurate prediction and can significantly improve the prediction performance. Our results suggest that the protein interaction context could provide important clues to help better illustrate SAP's functional association. This research will facilitate the post genome-wide association studies.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
A Modeling Study on How Cell Division Affects Properties of Epithelial Tissues Under Isotropic Growth
Written by Scott Christley et al. on July 30, 2010 – 7:00 am -Cell proliferation affects both cellular geometry and topology in a growing tissue, and hence rules for cell division are key to understanding multicellular development. Epithelial cell layers have for long times been used to investigate how cell proliferation leads to tissue-scale properties, including organism-independent distributions of cell areas and number of neighbors. We use a cell-based two-dimensional tissue growth model including mechanics to investigate how different cell division rules result in different statistical properties of the cells at the tissue level. We focus on isotropic growth and division rules suggested for plant cells, and compare the models with data from the Arabidopsis shoot. We find that several division rules can lead to the correct distribution of number of neighbors, as seen in recent studies. In addition we find that when also geometrical properties are taken into account other constraints on the cell division rules result. We find that division rules acting in favor of equally sized and symmetrically shaped daughter cells can best describe the statistical tissue properties.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Correcting the Bias of Empirical Frequency Parameter Estimators in Codon Models
Written by Scott Christley et al. on July 30, 2010 – 7:00 am -Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a “corrected” empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of
sequence alignments, our estimators show a significant improvement in goodness of fit compared to the
approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the
-style estimators.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
People Efficiently Explore the Solution Space of the Computationally Intractable Traveling Salesman Problem to Find Near-Optimal Tours
Written by Scott Christley et al. on July 29, 2010 – 7:00 am -Humans need to solve computationally intractable problems such as visual search, categorization, and simultaneous learning and acting, yet an increasing body of evidence suggests that their solutions to instantiations of these problems are near optimal. Computational complexity advances an explanation to this apparent paradox: (1) only a small portion of instances of such problems are actually hard, and (2) successful heuristics exploit structural properties of the typical instance to selectively improve parts that are likely to be sub-optimal. We hypothesize that these two ideas largely account for the good performance of humans on computationally hard problems. We tested part of this hypothesis by studying the solutions of 28 participants to 28 instances of the Euclidean Traveling Salesman Problem (TSP). Participants were provided feedback on the cost of their solutions and were allowed unlimited solution attempts (trials). We found a significant improvement between the first and last trials and that solutions are significantly different from random tours that follow the convex hull and do not have self-crossings. More importantly, we found that participants modified their current better solutions in such a way that edges belonging to the optimal solution (“good” edges) were significantly more likely to stay than other edges (“bad” edges), a hallmark of structural exploitation. We found, however, that more trials harmed the participants' ability to tell good from bad edges, suggesting that after too many trials the participants “ran out of ideas.” In sum, we provide the first demonstration of significant performance improvement on the TSP under repetition and feedback and evidence that human problem-solving may exploit the structure of hard problems paralleling behavior of state-of-the-art heuristics.
Tags: computer, news, science
Posted in Computer Science | Comments Off
Transcriptional Profiles of Leukocyte Populations Provide a Tool for Interpreting Gene Expression Patterns Associated with High Fat Diet in Mice
Written by Scott Christley et al. on July 29, 2010 – 7:00 am -Microarray experiments in mice have shown that high fat diet can lead to elevated expression of genes that are disproportionately associated with immune functions. These effects of high fat (atherogenic) diet may be due to infiltration of tissues by leukocytes in coordination with inflammatory processes.
Methodology/Principal FindingsThe Novartis strain-diet-sex microarray database (GSE10493) was used to evaluate the hepatic effects of high fat diet (4 weeks) in 12 mouse strains and both genders. We develop and apply an algorithm that identifies “signature transcripts” for many different leukocyte populations (e.g., T cells, B cells, macrophages) and uses this information to derive an in silico “inflammation profile”. Inflammation profiles highlighted monocytes, macrophages and dendritic cells as key drivers of gene expression patterns associated with high fat diet in liver. In some strains (e.g., NZB/BINJ, B6), we estimate that 50–60% of transcripts elevated by high fat diet might be due to hepatic infiltration by these cell types. Interestingly, DBA mice appeared to exhibit resistance to localized hepatic inflammation associated with atherogenic diet. A common characteristic of infiltrating cell populations was elevated expression of genes encoding components of the toll-like receptor signaling pathway (e.g., Irf5 and Myd88).
Conclusions/SignificanceHigh fat diet promotes infiltration of hepatic tissue by leukocytes, leading to elevated expression of immune-associated transcripts. The intensity of this effect is genetically controlled and sensitive to both strain and gender. The algorithm developed in this paper provides a framework for computational analysis of tissue remodeling processes and can be usefully applied to any in vivo setting in which inflammatory processes play a prominent role.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Quantifying the Proteolytic Release of Extracellular Matrix-Sequestered VEGF with a Computational Model
Written by Scott Christley et al. on July 29, 2010 – 7:00 am -VEGF proteolysis by plasmin or matrix metalloproteinases (MMPs) is believed to play an important role in regulating vascular patterning in vivo by releasing VEGF from the extracellular matrix (ECM). However, a quantitative understanding of the kinetics of VEGF cleavage and the efficiency of cell-mediated VEGF release is currently lacking. To address these uncertainties, we develop a molecular-detailed quantitative model of VEGF proteolysis, used here in the context of an endothelial sprout.
Methodology and FindingsTo study a cell's ability to cleave VEGF, the model captures MMP secretion, VEGF-ECM binding, VEGF proteolysis from VEGF165 to VEGF114 (the expected MMP cleavage product of VEGF165) and VEGF receptor-mediated recapture. Using experimental data, we estimated the effective bimolecular rate constant of VEGF165 cleavage by plasmin to be 328 M−1s−1 at 25°C, which is relatively slow compared to typical MMP-ECM proteolysis reactions. While previous studies have implicated cellular proteolysis in growth factor processing, we show that single cells do not individually have the capacity to cleave VEGF to any appreciable extent (less than 0.1% conversion). In addition, we find that a tip cell's receptor system will not efficiently recapture the cleaved VEGF due to an inability of cleaved VEGF to associate with Neuropilin-1.
ConclusionsOverall, VEGF165 cleavage in vivo is likely to be mediated by the combined effect of numerous cells, instead of behaving in a single-cell-directed, autocrine manner. We show that heparan sulfate proteoglycans (HSPGs) potentiate VEGF cleavage by increasing the VEGF clearance time in tissues. In addition, we find that the VEGF-HSPG complex is more sensitive to proteases than is soluble VEGF, which may imply its potential relevance in receptor signaling. Finally, according to our calculations, experimentally measured soluble protease levels are approximately two orders of magnitude lower than that needed to reconcile levels of VEGF cleavage seen in pathological situations.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Occupancy Modeling, Maximum Contig Size Probabilities and Designing Metagenomics Experiments
Written by Scott Christley et al. on July 29, 2010 – 7:00 am -Mathematical aspects of coverage and gaps in genome assembly have received substantial attention by bioinformaticians. Typical problems under consideration suppose that reads can be experimentally obtained from a single genome and that the number of reads will be set to cover a large percentage of that genome at a desired depth. In metagenomics experiments genomes from multiple species are simultaneously analyzed and obtaining large numbers of reads per genome is unlikely. We propose the probability of obtaining at least one contig of a desired minimum size from each novel genome in the pool without restriction based on depth of coverage as a metric for metagenomic experimental design. We derive an approximation to the distribution of maximum contig size for single genome assemblies using relatively few reads. This approximation is verified in simulation studies and applied to a number of different metagenomic experimental design problems, ranging in difficulty from detecting a single novel genome in a pool of known species to detecting each of a random number of novel genomes collectively sized and with abundances corresponding to given distributions in a single pool.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Unlocking Short Read Sequencing for Metagenomics
Written by Scott Christley et al. on July 28, 2010 – 7:00 am -Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved.
Methodology/Principal FindingsWe describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read.
Conclusions/SignificanceThis strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Genome Sequencing Reveals Widespread Virulence Gene Exchange among Human Neisseria Species
Written by Scott Christley et al. on July 28, 2010 – 7:00 am -Commensal bacteria comprise a large part of the microbial world, playing important roles in human development, health and disease. However, little is known about the genomic content of commensals or how related they are to their pathogenic counterparts. The genus Neisseria, containing both commensal and pathogenic species, provides an excellent opportunity to study these issues. We undertook a comprehensive sequencing and analysis of human commensal and pathogenic Neisseria genomes. Commensals have an extensive repertoire of virulence alleles, a large fraction of which has been exchanged among Neisseria species. Commensals also have the genetic capacity to donate DNA to, and take up DNA from, other Neisseria. Our findings strongly suggest that commensal Neisseria serve as reservoirs of virulence alleles, and that they engage extensively in genetic exchange.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
Diversity of HIV-1 Subtype B: Implications to the Origin of BF Recombinants
Written by Scott Christley et al. on July 28, 2010 – 7:00 am -The HIV-1 subtype B epidemic in Brazil is peculiar because of the high frequency of isolates having the GWGR tetramer at V3 loop region. It has been suggested that GWGR is a distinct variant and less pathogenic than other subtype B isolates.
Methodology/Principal FindingsNinety-four percent of the HIV-1 subtype B worldwide sequences (7689/8131) obtained from the Los Alamos HIV database contain proline at the tetramer of the V3 loop of the env gene (GPGR) and only 0.74% (60/8131) have tryptophan (GWGR). By contrast, 48.4% (161/333) of subtype B isolates from Brazil have proline, 30.6% (102/333) contain tryptophan and 10.5% (35/333) have phenylalanine (F) at the second position of the V3 loop tip. The proportion of tryptophan and phenylalanine in Brazilian isolates is much higher than in worldwide subtype B sequences (chi-square test, p = 0.0001). The combined proportion of proline, tryptophan and phenylalanine (GPGR+GWGR+GFGR) of Brazilian isolates corresponds to 89% of all amino acids in the V3 loop. Phylogenetic analysis revealed that almost all subtype B isolates in Brazil have a common origin regardless of their motif (GWGR, GPGR, GGGR, etc.) at the V3 tetramer. This shared ancestral origin was also observed in CRF28_BF and CRF29_BF in a genome region (free of recombination) derived from parental subtype B. These results imply that tryptophan substitution (e.g., GWGR-to-GxGR), which was previously associated with the change in the coreceptor usage within the host, also occurs at the population level.
Conclusions/SignificanceBased on the current findings and previous study showing that tryptophan and phenylalanine in the V3 loop are related with coreceptor usage, we propose that tryptophan and phenylalanine in subtype B isolates in Brazil are kept by selective mechanisms due to the distinct coreceptor preferences in target cells of GWGR, GFGR and GFGR viruses.
Tags: biology, computing, news
Posted in Computatioanl biology | Comments Off
