The spectrum of point mutations - consistent with auto-mutagenesis.

[See ref 5]

What is the meaning of 'random point mutations'?
Point mutations rank high on the list of causes for human genetic disorders and developmental lethality, but also as mechanisms of variation that drove evolution for eons. Yet, many basic questions about them remain unanswered.
For example, numerous mechanisms such as replication errors, genomic repair, radiation damage, random chemical inter-conversion and others have been suggested as potential causes of point mutations. However, it is not clear which of them are the main causes. Are the main sources the exogenous ones such as cosmic radiation or are the main ones generated from within the cell (e.g. as the result of chemical conversion by certain enzymes)?
Especially, it is not clear whether these sources act randomly or not. In order to decide this question one has to answer experimentally 2 questions.
(a) Is the location of the point mutations random?

(b) Is the type of point mutation random?
(i.e. if a base X was replaced with another base Y, was Y randomly chosen, or was it determined by X and its neighbors?)
The question of the randomness of the location of point mutations seems relatively unproblematic. In this case most scientists would presumably be satisfied, if there was an even distribution of the different kinds of point mutation along a genome.
The situation is very different if we are asking whether the different kinds of point mutations [X->Y], i.e. which base X turned into which other base Y, occur randomly in genomes. In order to answer this question we must determine the random expectation for each possible kind of point mutation, i.e. we must know how likely any or all of the above mentioned mechanisms have generated it. Subsequently, we would have to compare the spectrum of all point mutations in a genome with its random expectation. Unfortunately, we know neither the spectrum of all point mutations, nor the random expectation of the above mentioned mechanisms, not to mention that there are likely mechanisms of point mutation that are as yet unknown. Therefore, it seems quite important to establish at least the spectrum of all naturally occurring point mutations in different genomes.

The use of 'punctuated' repetitive sequences to determine the spectrum of point mutations.
How can we find these spectra? One may try to compile them from the large number of individual point mutations that have been described in the literature. However, to my knowledge nobody has undertaken the Herculean task to search the entire literature and archive all of them. But, even if a comprehensive list of all point mutations in genes would exist, they may not provide unbiased spectra, as natural selection probably eliminated every point mutation that impacted negatively on gene functions.
Therefore, it would be important to analyze well-defined DNA sequences besides genes, if
(a) They exist in large enough numbers in the genome to permit the identification of thousands of point mutations and to establish their natural spectrum.
(b) They are directly available through published genome sequences.
(c) It is unlikely that they are as vigorously subjected to natural selection as genes are.
I tackled this task through a study of point mutations within stretches of repetitive DNA: Simply comparing the mutated sequence period with its preceding and following neighbors allows one to determine which base the mutated one once was, and into which it was mutated.
Most vertebrate genomes contain large amounts of repetitive DNA that can provide large sample sizes for all types of point mutations. These kinds of repetitive sequences will be called 'punctuated repetitive sequences'or simply 'punctuated sequences'.After writing the necessary software and using it to detect and categorize some 51,000 such point mutations in repetitive vertebrate DNA, I found several peculiar patterns and rules in their spectra, which are reported and interpreted here. Figure 1a shows a typical example of a punctuated sequences and the point mutations found in it (Fig. 1b).

Fig.1. Identification of point mutations from repetitive DNA containing point mutations ('Punctuated sequences').
(a). Punctuated sequence in human chr. 1 starting at position 643,635. The punctuating base is a 'G' (highlighted base at the end of each line). It repeats 30 times in a row every 77 bases (Only 12 of the repeats are shown). It is also an example for occasional shifts in inter-punctuation segments as a result of small insertions and deletions which moved them out of register, as indicated by the highlighted motif 'GGAAC' .
(b). Identification of point mutations in the manually aligned inter-punctuation segments of Example 1a. Comparing the bases within every vertical column shows that there are exactly 7 individual point mutations, namely 3 cases of [G->A], 2 cases of [G->C] , one case of [A->C], and one case of [A->G].

How many kinds of point mutations are there?
At first sight it may seem that there are 16 kinds of point mutations because there are 16 possibilities to substitute 4 bases with 4 others. However, assuming strand-independence, there are actually only 6 essential point mutations [X->Y]. The argument goes as follows.
As mentioned, theoretically one can substitute 4 bases X with 4 bases Y, (X=A,C,G,T, and Y=A,C,G,T) in 16 different ways. However, 4 of them, namely [X->X] with (X=A,C,G,T) represent the replacement of a base with itself and, therefore, cannot be considered as point mutations.
The remaining 12 possible point mutations are

[C->A], [T->A], [G->A], [A->C], [T->C], [G->C],
[A->T], [C->T], [G->T], [A->G], [C->G], [T->G].

The above list disregards the fact that point mutations can and do occur on either of the 2 complementary DNA strands. For example, while counting all the cases of [C->A] in a genome, one should include the cases of [G->T], as each of them is actually just another [C->A] mutation, except that it occurred on the complementary strand.
One may go even further. Assuming that the mechanisms of point mutation are strand-independent, i.e. that they do not prefer one strand over the other, one may predict that the number of every point mutation c([X->Y]) should be approximately the same as the number of its complementary point mutation c([Xcompl->Ycompl]).
In order to test this prediction, a data set of 51,000 point mutations was collected from the genomes of human, chimpanzee, zebrafish, sea squirt, rat, mouse, and pufferfish. A correlation plot of the pooled counts of all 12 kinds of point mutations [X->Y] and the counts of their complements yielded a correlation coefficient of 0.953. The value is close enough to unity to confirm the prediction. It justifies excluding the complementary point mutation [Xcompl->Ycompl] of each mutation [X->Y] from the above list (1), while combining their counts. As a result, the present study considers only the following six essentially different point mutations, namely
[A->T], [C->T], [G->T], [A->G], [C->G], and [T->G].
The numbers of each of these point mutations [X->Y], were normalized in order to obtain their frequency spectra. The normalization procedure took into account the number of 'target' bases X and the requirement that all frequencies must add up to unity. The details of the normalization procedure are explained in ref 5].

The spectra of vertebrate point mutations.
In addition to the spectrum of the complete human genome (8103 point mutations), I collected the spectra of the complete genomes of chimpanzee (3098 point mutations), zebrafish (17914 point mutations), sea squirt (1419 point mutations), rat (10106 point mutations), mouse (8039 point mutations), and pufferfish (fugu) (2393 point mutations). The spectra of human, chimpanzee, zebrafish, and sea squirt were almost identical. Likewise, the spectra of rat and mouse were strikingly similar (not shown).

Fig.2. Normalized spectra of point mutations S([X->Y]).
The data were derived from the punctuated sequences of the complete genomes of human (8103 point mutations), chimpanzee (3098 point mutations), zebrafish (17914 point mutations), and sea squirt (1419 point mutations). The horizontal lines are drawn at the average level of frequency = 1/6. The error bars represent standard errors of the counts. For better visualization of the patterns, the largest peaks were rendered black.

The auto-mutagenic mechanisms that may explain the spectra.
In all cases, the point mutations [C->T] and [A->G] were the most frequent, while [A->T] and [C->G] displayed intermediate frequencies. The least frequent point mutations were [G->T] and [T->G].
One may express this relationship symbolically in the following way:

{[A->G],[C->T]} ≥ {[A->T],[C->G]} ≥ {[G->T],[T->G]}.
This hierarchy of incidence was best seen by computing the average frequency of all tested species (Fig. 3).

Fig.3. Average spectrum of the spectra S([X->Y]) shown in Fig. 2 ordered by amplitude. Error bars indicate standard deviations of the averages.
The grouping of the 6 essential point mutations into the 2 most frequent (black columns), 2 ntermediate frequent (gray columns) and the 2 least frequent (white columns) mutations will be used to identify the different underlying mechanisms as auto-mutagenic (see text).

The close similarities between the spectra of point mutations do not necessarily point to a single underlying mechanism, but could be the effective result of a multiplicity of disparate mechanisms. Indeed, it seems that the simplest explanation of the spectra would be a combination of mechanisms, which may be described as follows.
  • (a) The 2 largest peaks at [C->T] and [A->G] may be the result of enzymatic inter-conversion such as the de-amination of 5-methyl-cytosine or the de-methylation of 2-aminated-adenine.
  • (b) The 2 next smaller peaks at [A->T] and [C->G] are in effect single base pair inversions.
    For example, if A is changed into T on one strand, then there was initially a complementary T on the opposite strand that was subsequently changed into A, which means that the AT-pair effectively flipped around. Therefore, in order to explain these peaks, I make the ad hoc assumption that there are natural mechanisms of single base-pair inversions. Of course, the base pairs must not physically flip around, as their 5'-ends would collide with 5'-ends and their 3'-ends with 3-ends. Hence, the flipping around must be effectively the excision of an AT- or GC-pair followed by the insertion of the corresponding inverted pair. There is no generally known mechanism of single pair inversion. However, each of the 4 possible cases, have been observed as naturally occurring gene mutations related to human disease.
  • (c) The 2 remaining peaks [G->T] and [T->G] can be explained as the result of a combination of the 2 described mechanisms, namely an inter-conversion followed by a single base-pair inversion.
    For example, assume an initial GC-pair, whose C on the opposite strand undergoes a [C->T] inter-conversion. At the next replication, the resulting mismatched GT-pair may be corrected to become an AT-pair. If subsequently, the AT-pair is subjected to a single base-pair inversion it turns into a TA-pair. The combination of these steps has effectively turned the initial G into a T and, therefore represents a [G->T] point mutation.
    Similarly, the [T->G] point mutation can be explained as a [T->C] conversion (i.e. an [A->G] inter-conversion on the opposite strand) followed by a single base-pair inversion of the resulting CG-pair.

  • The above model also predicts qualitatively the hierarchy of the relative frequencies of the point mutations.
  • Each inter-conversion {[C->T] and [A->G]} would involve only one base, at least initially.Therefore, inter-conversions should be the most readily achieved and, consequently most frequent mutation.
  • Base-pair inversions such as {[A->T] and [C->G]} would involve deletions and insertions on both strands and, therefore, should be less frequent.
  • Finally, as combinations of inter-conversions and base-pair inversions, the 2 point mutations {[G->T] and [T->G]} are the most complex and, thus, should be the least frequent of the three cases.
  • Significance for the "functional anarchy" of genomes:

    Do the spectra have genome-wide validity?
    Although derived from punctuated sequences, the proposed mechanisms do not require repetitive DNA in order to function. Therefore, if the hypothesis is correct, the resulting spectra of point mutation should be the same in every region of the genome, at least initially. Later on, after natural selection had had time to eliminate detrimental mutations, the spectra may have become altered, especially in the coding regions.
    Another argument in favor of a genome-wide validity of the spectra could be the remarkable similarities of the observed spectra between different species. It is difficult to see how these supposedly 'blind' mechanisms of mutation would be able to differentiate between different regions of a genome, but not between the genomes of different species. In this context it should also be noted that the observed point mutations may not only apply to vertebrates, even though the present study focused on vertebrate genomes, mostly because of their large size.
    Arguments against genome-wide validity of the observed spectra rest mainly on the uncertainty whether punctuated sequences have functions that make them subject to natural selection. At the present time it seems unlikely that they do, although there is growing evidence that repetitive sequences may, indeed, have significant biological functions. But even if the punctuated sequences had no selectable biological function, replication- and repair mechanisms could conceivably operate differently on them than on others and potentially distort their spectrum of point mutations.
    As to the other regions of the genomes that are neither genes nor punctuated sequences, there is no method available as yet to determine the spectra of their point mutations. It is also not known whether they are subject to natural selection. Therefore, the answer to the question of the genome-wide validity of the reported spectra must ultimately be decided experimentally, by artificially mutating punctuated sequences and studying the effects on the development and phenotype of the resulting organism.
    Point mutations - yet another case of self-inflicted 'anarchy'?

    At any rate, the possibility to explain the entire spectrum of point mutations as the result of endogenous mechanisms suggests that these mechanisms contribute substantially to the 'functional anarchy' of genomes. On one hand they seem to act indiscriminantly and randomly, thus adding to the 'anarchy'. On the other, they are driving variation endogenously and, thus, accelerate evolution. In other words, it appears that genomes did not leave the generation of enough blind point mutations to cosmic radiation, but developed the above 'reckless' mechanisms to create them all by themselves.