Research Statement

Galileo's Telescope

The last ten years have witnessed a revolution in the acquisition of high throughput data in almost every aspect of biological and biomedical science. Examples of the vast reach of this revolution are manifold, from personalized genomics to electronic health records. Future progress in biomedical science and clinical care will rely heavily on the generation of meaningful knowledge from the integration of distinct sources yielding vast amounts of data—a problem dubbed “big data.” In light of the growing amount of relevant data, it is becoming increasingly evident that most diseases are not the result of single genetic alterations but instead derive from complex interactions between multiple factors (genetic, environmental, etc.). In order to disentangle the relevant factors leading to disease in a statistically robust way, large numbers of samples and high throughput techniques are necessary.

These data are presently being generated by large projects providing an opportunity to transcend a single lab approach and form a bigger picture of the problem. However, analyzing the huge amount of data generated by these technologies, and integrating this with other data sources, demands strong quantitative approaches and a solid understanding of the scientific aspects of the specific biological/clinical questions. These new approaches generate new questions that are not in the realm of traditional biology and could potentially illuminate innovative new paths to problem solving.

Our main scientific interests lie in modeling and understanding the dynamics of biological systems through the lens of genomics. Our work focuses on three distinct topics:

  • Cancer. Next-generation sequencing technologies provide an extraordinary opportunity to identify somatic mutations that contribute to the development of tumors. We are developing methods to identify cancer-driving mutations in high throughput sequencing datasets.
  • Infectious diseases. Evolution is a dynamic process that shapes genomes. Our team at Columbia is developing algorithms and software to analyze genomic data, with a view to understanding the molecular biology, population genetics, phylogeny, and epidemiology of viruses.
  • Electronic Health Records. Clinical databases constitute a rich and complex source of raw data. We are using the power of statistics and computers to tease out important clinical patterns in these diverse, important datasets.

In particular, we develop mathematical, statistical and computational approaches, which cover the analysis of high throughput data right through to the altogether more abstract identification of global patterns in evolutionary processes. The three main global questions that we are addressing are:

  1. Identify of driver mechanisms of evolutionary processes: what are the specific genetic alterations that contribute to genesis, progression and spread?
  2. Reconstruct the history and characterizing the dynamics of progression: what is the evolutionary context within which alterations occur? How do different selection pressures determine genetic alterations?
  3. Uncover the combinatorial patterns of alterations that drive the evolutionary process.

For more details, please check the latest publications from our group.

"Philosophy is written in this vast book, which lies continuously open before our eyes (I mean the universe). But it cannot be understood unless you have first learned to understand the language and recognize the characters in which it is written. It is written in the language of mathematics, and the characters are triangles, circles,and other geometrical figures. Without such means, it is impossible for us humans to understand a word of it, and to be without them is to wander around in vain through a dark labyrinth."

Galileo Galilei, Opere 6:232