The Grand Locus / Life for statistical sciences

## Is there a gene for alcoholism? (2)

In the post Is there a gene for alcoholism? I explained how claims to discover the gene for such and such complex behavior (mostly alcoholism and homosexuality) are based on correlations that are never confirmed by experimentation. We will have to wait until neurogenetics comes of age before we can seriously tackle this kind of question. But when that happens, how likely is it that we really discover a gene for alcoholism?

To make my point come across, I will have to touch a few words about the problem of missing heritability. According to current estimates, the human genome consists of ~ 25,000 protein-coding genes and about as many non protein-coding RNAs, the function of which still remains to be established. The implicit meaning of "gene for alcoholism" is actually a mutation that would somehow affect one of these ~ 50,000 functional entities.

Mutation is somewhat inaccurate in this context as we should speak of polymorphism. A piece of our genome is monomorphic if everybody has exactly the same sequence, otherwise, it is polymorphic. The vast majority of polymorphic sequences in humans are SNPs (single-nucleotide polymorphisms), i.e. sequences that differ by only one nucleotide among individuals. There is one SNP every 1,000 base pairs on average, that is around 3 million SNPs in the human population. The other kinds of variations, deletions, insertions, inversions and translocations are rare in comparison.

The traditional way of measuring the heritability of a complex trait, such as autism or alcoholism was to compare the occurrence of the trait in identical and fraternal twins (an approach known as twin studies). Since identical twins have exactly the same genome, when one has a trait, the other should also have it, otherwise the trait is not genetic. Fraternal twins constitute an ideal control group because they evolve in identical cultural conditions, yet they have different genomes. The heritability of a trait is then defined as the fraction of variance explained by the genome. The heritability of autism in twin studies is around 90%, whereas that of alcoholism is controversial.

The problem of missing heritability kicked in with the proliferation of GWAS (genome-wide association studies) that followed the recent explosion of sequencing technologies. The gist of GWAS is to sequence the SNPs of many individuals of a population, find which SNPs are associated with a disorder, and claim that the disorder has a genetic component if there is a strong correlation between at least one SNP and the disorder. Since we can make the approximation that your 3 million SNPs constitute your genetic self, the heritability of autism and other genetic disorders should be explained by the SNPs. Yet, when heritability is measured this way, it peaks at a puny 5% in the best cases. So up to 85% of the heritability disappears in GWAS. And the question is why?

The most convincing answer so far is epistatis. Epistatis designates interactions between alleles of different genes, but the precise definition varies with the context. Here I will take it as a synergistic effect between SNPs. If we call $A$ and $a$ the variants of SNP #1 and $B$ and $b$ the variants of SNP #2 we would say that there is epistasis between the two SNPs if $(A/A, b/b)$ and $(a/a, B/B)$ individuals are autist with a frequency of 1%, whereas $(A/A, B/B)$ individuals are autist 100% of the time. In that case, alleles $A$ and $B$ synergize to trigger the condition. Epistatis is an issue in complex genetic disorders, because it means that the outcome of the combination of alleles of different SNPs is unpredictable (check out the next technical section for a description of epistasis as non linear effects in logistic regression models).

A way of modeling complex disorders is to estimate their penetrance (the probability that the carrier of a genotype will be affected by the disorder) by logistic regression. The logistic model is

$$g(x) = \log \left( \frac{p(x)}{1-p(x)} \right) = \beta_0 + \beta_1 X_1 + ... + \beta_n X_n,$$

where $X_i$ is the allelic form at SNP $i$ ($0$ or $1$ assuming everybody is homozygote for every SNP), $x = (X_1, ..., X_n)$ is the genotype and $p(x)$ is the penetrance of that genotype. In other words, the logistic regression model assumes that the logit of the penetrance is a linear function of the genotype. $\beta_0$ is the intercept, i.e. the logit penetrance of individuals who have allele $0$ at every SNP, and $\beta_i$ is the effect on the logit penetrance of having allele $1$ at SNP $i$.

Let's illustrate this model by an example with only two SNPs. Suppose the disorder is present at a background level of 1% among individuals of genotype $(0,0)$. This immediately gives $\beta_0 = \text{logit}(0.01) = -4.60$. Also suppose that the penetrance raises to 10% by having allele 1 at either of each SNP. We now have $\beta_0 + \beta_1 = \text{logit}(0.1) = -2.20$, wich gives $\beta_1 = 2.40$, which is also the value of $\beta_2$. All these values determine that the penetrance of individuals with genotype $(1,1)$ has to be $\frac{1}{1+\exp(-4.60+2.40+2.40)} = 0.45$. If this turns out to be higher (or lower), the model is invalid and has to be updated to

$$g(x) = \log \left( \frac{p(x)}{1-p(x)} \right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_{1\cdot 2} X_1 X_2.$$

It now incorporates an interaction term $\beta_{1\cdot2}$ which represents the epistasis between the two SNPs as an extra increase in penetrance for the genotype $(1,1)$. In a similar way, the general logistic model shown above ignores epistasis between SNPs. To incorporate them, we have to upate the model by adding the terms $\beta_{i\cdot j}X_iX_j$, of which there are $n(n-1)/2$. This is only considering pairwise epistasis interactions. It is possible to also include three-way epistasis by adding the terms $\beta_{i\cdot j \cdot k}X_iX_jX_k$, of which there are $n(n-1)(n-2)/6$, and actually nothing prevents to include all possible combinations, except the insane number of parameters that such a model would have in the case of the human genome.

The point here is that this type of synergy is common in multi-factorial genetic disorders, and yet, they are ignored in most GWAS studies, which can explain that they predict only 5% of the variation of such traits. So, why would they not take epistasis into account? Well, because they cannot. If a combination of two SNP variants has an unpredictable effect on a trait, you need at least one individual with this combination in order to know what the effect is. With ~ 3 million SNPs there are ~ 4,500 billion pairwise combinations, i.e. much more than the human population. This means that in spite of its large size, the human population is far from exploring all its genetic possibilities. This also means that there is no way we can predict the effect of some combinations of SNPs, because no human being ever had this combination. You might argue that SNPs are just proxies for the ~ 50,000 functional elements, of which there are only 1.25 billion pairs. But because the human population is not purely homogeneous, some of these combinations are not present in a single person on this planet.

So if epistasis is indeed the cause of missing heritability, some complex genetic disorders will never be understood, because the number of unpredictable combinations of variants will never be explored in full. There will always be a part of unpredictable in humans as a genetic entity.

In conclusion, I would like to make a comment that should have been a word of caution in my previous post. Here I fight extravagant claims that scientists have found a gene for alcoholism because it brings no understanding whatsoever and carries a false idea of neurogenetic determinism. I do not claim that alcoholism has no genetic component, but instead that such results should be handled with agnosticism: if alcoholism does have a not-yet-understood genetic component, the genome carries some information that indeed can be used to identify individuals at risk for purposes of prevention. Publicly extrapolating on this is calling for unnecessary media attention.