Population genomics
Duplication of differentiated alleles
It has been suggested that gene duplication is much more likely to occur if the duplicates are not identical in sequence, but if they represent two alleles already diverged in sequence and in function. The latter is often the case in instances of balancing selection, where two alleles at the same locus are both maintained in a population.
To test this hypothesis, we will identify genes in the fly Drosophila that evolved under balancing selection (based on fully sequenced fly populations), and check if these genes were more likely to be duplicated in related fly species.
Relationship between population size and evolution of different sequence types
Relative to the mutation rate, gene control regions in humans have accumulated more mutations than homologous regions in rodents [Keightley 2005]. This is probably caused by a less efficient removal of slightly deleterious mutations, because humans have had very small population sizes in the distant past. Is the same true more generally? Are the regulatory regions of species with small population sizes ‘unstable’? This project will involve species selection, extraction and alignment of regulatory sequences from publicly available whole-genome data, and application of statistical tests to the alignments.
References
- Evidence for widespread degradation of gene control regions in hominid genomes. Keightley PD, Lercher MJ, Eyre-Walker A. PLoS Biol. 2005;3:e42)
Are Neanderthals a subpopulation of Homo sapiens? (with Thomas Wiehe)
The Neanderthal genome is more similar to non-African than to African human DNA [Green 2010]. This has two possible explanations: (i) introgression (sex between Non-African humans and Neanderthals), or (ii) long-standing population structure in African humans, with Neanderthal splitting from a different lineage than the Non-African humans [Durand 2011]. Assuming that the admixture in (i) occured only outside Africa, these two hypotheses can be distinguished by analysing SNPs from the closest African relatives of non-African humans, to see if they show the same patterns as non-Africans (⇒ model (ii)) or the same pattern as other Africans (⇒ model (i)).
Appropriate published African and non-African sequences should be available.
References
- Green RE, Krause J, Briggs AW, et al. (56 co-authors). 2010. A draft sequence of the Neandertal genome. Science. 328(5979):710–722.
- Testing for Ancient Admixture between Closely Related Populations, Durand et al., MBE 2011
- Dienekes' Anthropology Blog
- Mitochondrial genome variation and the origin of modern humans
Phylogenomics
Ancient hybridisations among yeast species
Yeast genomes show convincing evidence of ancient whole-genome duplications (as do plant and animal genomes). These are generally interpreted as simple genome doublings, i.e., the combination of two identical genomes. On the other hand, we know from the study of plants that allopolyploids (polyploids derived from the merging of two different genomes via hybridisation) are much more frequent than homopolyploids (simple genome doublings), at least in producing new species [Mallet 2007; Stace 1975]. We (Daniel Hartleb, Christian Esser, Martin Lercher) have developed a range of phylogenomic methods to detect ancient hybridisations. In this project, we aim to apply them to fully sequenced yeast genomes to detect evidence for ancient hybridisations among yeast species.
References
- MALLET, J., 2007 Hybrid speciation. Nature 446: 279-283.
- STACE, C., 1975 Hybridization and the flora of the British Isles. Academic Press, London.
Independence of NADP-ME origins in plant evolution (with Vero Maurino)
NADP-ME, an enzyme present also in C3 plants, has acquired a central role in C4 metabolism, where it forms a central enzyme of the CO2 pump. To efficiently perform this new function, NADP-ME kinetics differ between the C3 and the C4 version. This requires several changes in the encoding sequence.
It is widely believed that C4 metabolism evolved at least 50x independently in diverse plant lineages [Sage 2004]. However, it is at least conceivable that horizontal gene transfer (e.g., through hybridisations between C3 and C4 species) also contributed to the dispersion of C4 metabolism.
To test for non-independence of allegedly independent origins of C4, we need to obtain many C3 and C4 NADP-ME sequences. We will then use the corresponding sequence alignment to construct several phylogenies:
- one using all sites,
- one using only synonymous sites (those that are not relevant for enzyme kinetics),
- one using only sites that are fixed among all C3 plants, but vary between C3 and C4 (and potentially among C4 plants).
In the case of independent evolution, tree 2 should closely reflect the accepted taxonomy, i.e., C4 plants should cluster with their C3 relatives. Tree 3 should in this case show no particular structure; in particular, C4 species should not cluster with their taxonomic relatives that allegedly evolved C4 independently.
References
- SAGE, R. F., 2004. The evolution of C4 photosynthesis. New Phytologist 161: 341–370.
Insect phylogeny
In 2006, we found that the true evolutionary relationships between insects were different from what was generally believed [Savard et al 2006]. At the time, only few fully sequenced insect genomes were available. Now is a good time to repeat that analysis with a larger and more reliable dataset. This will involve species selection, download of whole genome data, ortholog identification, sequence alignment, phylogeny reconstruction, and statistical tests to exclude common problems with reconstructed phylogenetic trees.
References
- Phylogenomic analysis reveals bees and wasps (Hymenoptera) at the base of the radiation of Holometabolous insects. Savard J, Tautz D, Richards S, Weinstock GM, Gibbs RA, Werren JH, Tettelin H, Lercher MJ. Genome Res. 2006, 16:1334–8.
- Regier JC, Shultz JW, Zwick A, Hussey A, Ball B, Wetzer R, Martin JW, Cunningham CW. Arthropod relationships revealed by phylogenomic analysis of nuclear protein-coding sequences. Nature 2010; 463:1079–83.
- Phylogenetic relationships among insect orders based on three nuclear protein-coding gene sequences. Ishiwata K, Sasaki G, Ogawa J, Miyata T, Su ZH. Mol Phylogenet Evol. 2011 Feb;58(2):169–80
- Single-copy nuclear genes resolve the phylogeny of the holometabolous insects. Wiegmann BM, Trautwein MD, Kim JW, Cassel BK, Bertone MA, Winterton SL, Yeates DK. BMC Biol. 2009 Jun 24;7:34
Does parental phylogenetic distance affect success of allopolyploids?
Allopolyploids are polyploid species that are derived from hybridisations between parents from different species. Typically, the parent species are genetically as divergent as the average species pair in their clade [Chapman 2007; Buggs 2008; Buggs 2009]. Soltis & coworkers interprets this pattern as 'random', i.e., they assume that allopolyploids can form between any parents. Conversely, Paun et al. see the same observation as evidence that allopolyploid parents 'prefer' to have an average distance (in particular, polyploids tend not to form between closely related parents). The latter seems unlikely, as individual species should not care about any average in their clade. However, the correlation between allopolyploid parent difference and average species difference is indeed very strong.
To distinguish between these two interpretations, we will look beyond average distances, and will test statistically if the parents are closer to the average than are random pairs. In other words: are the parents drawn randomly from the distribution of distances in the clade, or are they preferentially drawn from the middle of the distribution?
References
- CHAPMAN, M. A., and J. M. BURKE, 2007 Genetic divergence and hybrid speciation. Evolution 61: 1773–1780.
- BUGGS, R. J., P. S. SOLTIS, E. V. MAVRODIEV, V. VAUGHAN SYMONDS and D. E. SOLTIS, 2008 Does Phylogenetic Distance Between Parental Genomes Govern the Success of Polyploids? CASTANEA 73: 74–93
- BUGGS, R. J. A., P. S. SOLTIS and D. E. SOLTIS, 2009 Does hybridization between divergent progenitors drive whole-genome duplication? Molecular Ecology 18: 3334–3339.
Comparative genomics
Causes of GC bias in bacterial (and other) genomes
Each bacterial species has a typical proportion of G and C in its genome, and this GC content varies widely between species. It appears that this is partly due to natural selection favouring high GC content [Hildebrandt 2010]. On the other hand, for the synthesis of one AT pair, the bacterium needs 2 ATPs and one nitrogen (N) atom less than for one GC pair. There is indeed evidence that nitrogen limitation is able to influence GC composition (as well as amino acid usage) in bacteria and plants [Elser 2011]. We hypothesize that nitrogen (and possibly energy) limitation is responsible for a selection for low GC content in prokaryotes in general. Variation between species in GC content stems from the individual balance between selection on nitrogen and energy use efficiency on one hand, selection on the ability to encode all amino acids (which makes GC=0 impossible), and the efficiency of selection (which is higher for large effective population sizes). Thus, we predict that low GC content should be correlated with
- low environmental nitrogen availability,
- low environmental energy availability,
- high replication rate (growth rate),
- high effective population size (more efficient selection).
This can be tested by analyzing publicly available data on genomic GC content and population size for bacterial species, in combination with information on nitrogen availability and on energy availability (e.g., light intensity at different ocean depths for photoautotrophic bacteria) in the corresponding environments.
References
- Hildebrand F, Meyer A, Eyre-Walker A. Evidence of selection upon genomic GC-content in bacteria. PLoS Genet. 2010 Sep 9;6(9). pii: e1001107.
- Elser JJ, Acquisti C, Kumar S. Stoichiogenomics: the evolutionary ecology of macromolecular elemental composition. Trends Ecol Evol. 2011 Jan;26(1):38–44.
- Acquisti C, Elser JJ, Kumar S. Ecological nitrogen limitation shapes the DNA composition of plant genomes. Mol Biol Evol. 2009.
Does tRNA abundance drive codon usage bias – or vice versa?
Codon usage bias (just as GC bias) is strong within species, but the preferred codons (or GC content) varies across species. It has been suggested that this is caused by differential tRNA abundance for synonymous codons. This abundance, in turn, is implicitly or explicitly assumed to be largely the consequence of chance. Here, we hypothesize that tRNA abundance is under selection to match the GC distribution in coding sequences (which in turn is determined by the strength and efficiency of selection stemming from limited availability of nitrogen, or by selection for differential GC content between co-living species to limit horizontal gene transfer).
Looking at several (phylogenetically independent) pairs of closely related prokaryotic species, we will test if their difference in GC (or codon bias) is larger than their difference in tRNA abundance. If this is indeed the case, we will conclude that GC diversifies first, and tRNA abundance follows to match it. If we find the opposite pattern, we will conclude that GC follows tRNA abundance. One way to perform this comparison would be to calculate optimal tRNA abundances for each genome based on their codon bias (or GC), and then test if actual tRNA copy numbers are more similar between species than expected from the optimal abundances.
Alternatively, we will examine if the question can be more powerfully addressed using comparative phylogenetic methods (e.g., Felsenstein’s independent contrasts).
PopGenome
What is PopGenome?
The study of genetic diversity is essential for understanding the nature of evolutionary processes at the molecular level.
The PopGenome library provides data analysis in population genetics and is programmed in the powerful, open-source, statistical computing environment R. Several polymorphism statistics, such as the number of segregating sites and nucleotide or haplotype diversity based FST measurements can be calculated. In addition, PopGenome contributes a lot of neutrality statistics such as the Tajima D and Rozas R2 test. Testing of the significance of these statistics requires generating bootstrap samples from a neutral model using a coalescent approach. To do this PopGenome provides the application of the MS program, which was written by Hudson (2002).
The Sliding window method can be used to scan genetic data with different window and jump sizes. Bayesian methods are becoming increasingly important for population genetic studies and will be implemented in the next release. The PopGenome environment will also have the appropriate data handling and analysis capabilities needed for genome-wide resequencing projects.
References
- Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337-338.
FSTAT-Integration of F-statistics in an R environment
F-statistics provide a measure of genetic structuring of populations and was developed by Sewall Wright. There are a lot of variations of FST measurements published, the most common definition is the following:
FSTAT is a computer package which estimates and tests gene diversities and differentiation statistics from codominant genetic markers. It computes both Nei and Weir & Cockerham families of estimators of gene diversities and F-statistics, and tests them using randomisation methods [Goudet 1995]. The main work in this project will be to incorporate into this software and discuss the advantages and disadvantages of the including methods. In PopGenome there are still some FST measurements integrated. We will verify what further calculations are necessary and how to organize the methods in an R environment efficiently.
References
- J. Goudet. Fstat (version 1.2): A computer program to calculate f-statistics. Journal of Heredity, 86(6):485-486.1995.
Estimating recombination rates
Recombination is one of the main forces shaping genome diversity, but the information it generates is often overlooked [Mele et al. 2010].
The Rm statistic from Hudson can be used to estimate the rate of recombination in a sample of DNA sequences. A way of inferring that between two sites at least one recombination event took place in the history of the sample is to use the "four-gamete" test [Hudson and Kaplan 1985]. The four-gamete test can not detect all recombination events, because the history of the observed sequences must have a specific structure and mutations must occur on appropriate lineages of the genealogy. Most of the available methods implemented in programs such as PHASE and LDhat do not specify which are the sequences carrying the information of the recombination events.
The method called IRiS (Identifying Recombination in Sequences) published in the paper "A New Method to Reconstruct Recombination Events at a Genomic Scale" is based on a combinatoric algorithm and uses the patterns created by the polymorphic positions in the extant DNA sequences to infer recombinant sequences and to locate the breakpoint [Mele et al. 2010].
In this project you will implement some methods from Hudson (Rm …) and the IRiS method in the programming language R. We will discuss the topic of recombination events and its meaning for population genetic analysis. We will verify the advantages and disadvantages of severel methods.
References
- Mele M, Javed A, Pybus M, Calafell F, Parida L, et al. (2010). A New Method to Reconstruct Recombination Events at a Genomic Scale. PLoS Comput Biol 6(11): e1001010. doi:10.1371/journal.pcbi.1001010
- Hudson, R. R. and N. L. Kaplan. (1985). Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147-164.


