The Evolution of Alu-mutations - a clever defense.

[See ref 8]

The deadly threat of Alu-elements and other retro-transposons.

Among the known genomes hazards retro-transposons may be the most dangerous. They should pose a lethal threat for every genome they invade. Their method of amplification via transcripts that reinsert into the host genome through reverse transcription could conceivably lead to an exponential 'explosion' of copy numbers that would completely fragment and thus destroy the host genome. In the case of the Alu retro-transposon this catastrophe did not happen to our ancestral genomes it invaded, although its copy numbers in the e.g. human genome exceed 1 million (4, 5). Lucky for us, our ancestral genomes appear to have found effective defense strategies that limited the proliferation of the Alu-elements to harmless levels and, in the process, may even have created a selective advantage.

One of the defense strategies may have been the mutation of the Alu-elements, possibly aimed at crippling their ability to proliferate. Since the entire spectrum of conceivable point mutations is consistent with the interpretation that all point mutations were caused by auto-mutagenic mechanisms, it would make sense, if the genomes had unleashed this arsenal for their defense. As will be shown in this presentation, the Alu mutants in the human genome that contain 50 or more base substitutions outnumber the 'original' Alu-copies by a wide margin. Considering that the Alu-elements are only approximately 280 bases long, such large numbers of base substitution must have had a substantial impact on their functionality.

One might expect that random Alu-proliferation in the host genome followed by random base substitutions of each Alu sequence results in poorly reproducible, rather chaotic distributions of Alu-mutants. Surprisingly, however, the process created precisely defined frequency distributions of Alu-mutants that were the same for all human chromosomes (and chimpanzee chr.1) and depended only on the specific family to which the Alu-element belonged. In order to explain this finding, this presentation offers a simple mathematical model of the dynamics of the proliferation of Alu-elements while their capacity to proliferate is increasingly inhibited by point mutations. If correct, this model will permit to reconstruct the evolutionary past of the Alu mutants and also to predict their evolutionary future. It may even serve to justify the interpretion of Alu mutants as time stamps on the host genome.

NOTE:The following study focuses on the number of mutations in an Alu-sequence regardless of their position. In view of the high level of sophistication of today's sequence analysis of Alu-elements this approach may seem rather crude. However, similar to aerial photography, the omission of details may sometimes offer a depiction of large-scale features that might otherwise go undetected. As a further simplification, we will distinguish only between the 3 major Alu-families, AluY, AluS and AluJ, while ignoring their division into a total of 217 sub-families. In other words, the members of all sub-families will be treated as mutants. They will not be ignored.

The genome pixel image (GPxI) of Alu-elements and their mutations.

In order to find the Alu-mutants in the human genome, I used my search program 'GA-dnaorg.exe' and specific search primers for the 3 major subfamilies of 209 [b] size whose GPxIs are shown in Fig. 1.

Fig.1. GPxI images of the sequences of the 3 major Alu-families.
Note, how similar the defining sequences are. Therefore, the present study distinguishes only between these 3 Alu-families and ignores their further division into 217 sub-families (see text).

The search algorithm used in the search program was a simple base-by-base comparison between a search primer and a genome sequence while the search primer was moved along the genome. The success of the search was defined as a match where the number of base substitutions remained below a certain threshold N. Whenever the program found a suitable match it recorded its position, sequence and exact number of base substitutions in a data file.

The threshold N must not be chosen too small, lest the search would miss too many Alu-mutants. It must also not be too large, lest the search would accept sequences that could no longer be considered Alu mutants. As shown by their GPxIs (Fig 2) the patterns of the sequences identified by the search program were, indeed, easily recognizable Alu-mutants even for values as high as N = 100 base substitutions, provided the size search primer was 200 bases or larger. The same criterion of yielding recognizable Alu-patterns was applied to the selection of a suitable size of the search primers. While a search primer with = 200 bases searched with a threshold of N = 100 yielded clear Alu-patterns, a search primer with a size of 50 bases did not yield recognizable Alu-patterns, even if the threshold N was a low as 25 bases.

These and similar criteria led to the choices of a search primer size = 209 or 213 bases and a threshold of an acceptable number of N = 100 bases substitutions throughout the following.

Fig.2. Effect of the maximal number of tolerated base substitutions on the Alu-mutant sequences found by the search program.
The scale on top represents the positions of the bases of each found sequence beginning with the down-stream end of the AluY-sequences in the human genome. The GPxIs show small portions of the upstream and downstream flanks of the various Alu-mutants. Note the appearance of poly-A stretches (=black pixel stretches) at the start of the each down-stream flank of each Alu-sequence found by the search program.

The images display the GPxI of 100 AluY-sequences in human chr.1 obtained by the search program using the AluY search primer (size 200) and tolerating
  • (a) up to 5 base substitution.
  • (b) up to 25 base substitutions.
  • (c) up to 100 base substitutions.
  • The universal frequency distribution of Alu-mutants.

    Applying the described search method individually to the human chromosomes 1 - 22 and X yielded 389,956 AluY mutants. If normalized for the same chromosome size, their frequency distributions were remarkably identical for all chromosomes (Fig. 3a) as evidenced by the very small standard deviations between the values of different chromosomes (bars in Fig. 3). Similarly, the search program found 171,066 AluS mutants and 172,240 AluJ mutants in human chromosomes 1 - 7. Although their average distribution curves were characteristically different for the different members of the Alu family, different chromosomes yielded again surprisingly identical distribution curves (Fig. 3b, c). The distribution curve of the AluJ mutants consisted almost exclusively of heavily mutated elements, confirming that the AluJ-elements are the oldest of the family.

    Fig.3. The remarkably high degree of reproducibility of the mutant distributions of Alu elements in the human genome.
    Bars indicate the standard deviations of the values of the tested number of chromosomes. Abscissa: number n of base substitutions of the Alu-mutants; ordinate: Frequency of the various Alu-mutants with exactly n base substitutions, normalized to a maximum amplitude of 100.

  • (a) Distribution of AluY-mutants (averaged over chromosomes 1 - 22, and X)
  • (b) Distribution of AluS-mutants (averaged over chromosomes 1 - 7)
  • (c) Distribution of AluJ-mutants (averaged over chromosomes 1 - 7)
  • Another surprising feature was the appearance of multiple peaks in the distributions, suggesting that there had been several waves of increased replication in the evolutionary past of the Alu-elements.
    In all cases, the frequency distributions showed a pronounced dominance of Alu- mutations with 50 and more base substitutions over Alu-elements that contained fewer than 50 mutations. Equating large numbers of base substitutions with large evolutionary age, it suggests that most Alu-elements in the human genome are quite 'old'.
    Testing chimpanzee chromosome 1 yielded the same distributions as the human chromosomes.

    The decision, which of 2 Alu-elements is more similar to the 'original' based on their mutant distribution.

    The search for mutated Alu-elements poses a fundamental question. How can we know whether the sequence AluY that we used as a search primer is the 'original' sequence? Why should not another mutant Alu-sequence AluYm with (say) m base substitutions be the 'original' while AluY was one of its m-fold mutants?

    To be sure, there is clear evidence that Alu sequences are part of the 7SL RNA gene of numerous species, including Drosophila m. and Xenopus l.. However, among them only certain primates have processed it into a retro-transposon, whereas Xenopus and Drosophila have highly analogous 7SL RNA genes but no Alu-elements. Therefore, there was early in the evolution of these primates a mutation of the primate 7SL RNA gene or an invasion from the 7SL RNA gene of another species that laid the foundation of the Alu-elements as we know them today.

    Obviously, we can never decide whether a particular Alu-sequence is the 'true original' because the original may not even exist any more today. Therefore, in the literature many authors placed the terms 'original' or 'source' sequences in inverted commas as was done in the present article. Nevertheless, based on the set {M} of all Alu-mutants known today, it is quite possible to determine which of 2 Alu-mutants is more similar to the 'original' than the other. Traditionally, the students of Alu-elements have solved the problem by detailed studies of homologies between domains of different Alu-sequences, which can determine which sequence pre-dates the other and, thus identify the earliest among them as the most 'original'.

    The mutant distributions presented here offer another rather simple way to tell which of two Alu-mutants is more similar to the 'original' sequence. Consider the set {M} of mutants that all arose from a common original sequence Alu0 in the human or any other genome. Using Alu0 as a search primer will yield a specific mutant distribution A0[n] similar to the ones in Fig. 3. Now select one of the mutant Alu0-sequences X which, unbeknownst to you, differs from Alu0 by m base substitutions. Using X as a search primer will yield its mutant distribution AX[n] from the same set {M} of Alu-mutants.

    The comparison between the distributions A0[n] and AX[n] will show quite easily that the search primer Alu0 is more similar to the 'original' Alu-sequence than X by the following criteria: Compared to A0[n] the distribution AX[n] will be shifted towards large numbers of base substitutions while lacking mutants of X with 1, 2, 3,… and other small numbers of base substitutions.

    For example, Fig. 4 shows the mutant distributions obtained from certain AluY-mutants X(i) with i = 0, 15, 30, and 63 base substitutions which were used as search primers on human chr.1 Clearly, the more base substitutions the search primer X(i) contained, the fewer mutants could be found that contained less than (say) 30 base substitutions.

    Explanation: The finding can be explained by the fact that {M} contains many mutants of Alu0 with 1, 2, 3,… and other small numbers of base substitutions, as the presence of such mutants is the definition of {M}. Hence, A0[n] will contain substantial numbers of mutants with fewer than 30 base substitutions.

    On the other hand, it is extremely unlikely to find in {M} a single sequence that could qualify as a mutant of (say) X(30) that contains additional 1, 2, 3,… or other small numbers of base substitutions. Such a sequences would have to be mutants of Alu0 with the exact same 30 base substitutions as X(30) in exactly the same positions, but contain 1, 2, 3,… additional ones in other (or the same) positions. The probability to find such mutants is very small since the number of all possible mutants with 30 base substitutions is on the order of 1060, while {M} contains only a miniscule fraction of them, namely approximately 106. Hence, AX(30)[n] will contain almost no mutants with only 1, 2, 3, … base substitutions.

    This reasoning was used earlier, in order to conclude that AluJ was much older than AluY because it contained almost no mutants with fewer than 30 base substitutions (Fig. 3c). Likewise, one can see immediately that the age of AluS is in between AluY and AluJ but more similar to AluY, as its mutant distribution has fewer such mutants than AluY but many more than AluJ.

    Most importantly, the above criteria can be used to determine the evolutionary age of any given Alu-mutant and, consequently, the minimal evolutionary age of the region of a chromosome where this particular Alu-mutant was found. After all, that region of the chromosome cannot be younger than the Alu-element that invaded it.

    Fig.4. Dependence of the mutant distribution on the number of initial base substitutions contained in its search primer.
    The search program using the different search primers allowed up to 100 base substitutions for each mutant. Abscissa: number n of base substitutions of the Alu-mutants; ordinate: absolute count of the various Alu-mutants with exactly n base substitutions.

  • (a) Mutant distributions resulting from the original AluY search primer with 0 (dark line) and 15 (gray line) additional base substitutions.
  • (b) Mutant distributions resulting from the original AluY search primer with 30 (dark line) and 63 (gray line) additional base substitutions. The more mutations the search primer has suffered, the more its mutant distribution curve looses mutants with low numbers of base substitutions and shifts to the right.
  • A mathematical model of the dynamics of Alu mutations.

    The mathematical model makes very simple, common sense assumptions. They are explained in detail in ref 8. Specifically, it proposes that the number of Alu-mutants A[n,R] that contain n base substitutions at any given time R
  • (a) increases through replication, although the probability of replication diminishes rapidly with n,
  • (b) increases because some of the (n-1)-fold mutants acquire one new base substitution proportional to the fraction of their un-mutated bases,
  • (c) decreases because some of the (n)-fold mutants acquire one new base substitution proportional to the fraction of their un-mutated bases.

  • It expresses these increases and decreases quantitatively as a function of A[n,R] and a certain time interval δR, in which they occurred.
    In addition, it assumes that there were several episodes of new bursts of fully proliferative Alu-elements ('seedings') in the evolutionary past of primate genomes. These rather minimal assumptions were sufficient to reproduce the details of the actual mutant distributions of AluY, AluS and AluJ to a high degree of accuracy. None of theses assumptions expresses any chromosome specificity. Therefore, the mathematical model explains one of the main finding of this study, namely the remarkable similarity between the Alu-mutant distributions of different chromosomes.
    A major attraction of the mathematical model is the possibility to reconstruct the past mutation distributions and their future (Figure 5). (See also the animated versions of the reconstruction of the experimental Alu-mutant distributions, AluY and AluJ (the thick lines show the present day distributions of the mutations)).Of course, the model cannot predict whether and when future seedings may occur.

    Fig.5. The time development of the distribution of the AluY-mutants in the human genome, reconstructed and predicted by the mathematical model and calibrated.The panel marked 'present' matches the AluY mutant distribution of Fig. 3a quite well. Watch also the ANIMATION OF THE EVOLUTION of the AluY and the AluJ mutations (the thick lines show the present day distributions of the mutations).

    The calibration of the evolutionary age of Alu-elements.

    In practice, the mathematical model calculates the new values of the number of Alu-mutants A[n,R] that contain n base substitutions at any given time R (i.e. the spectrum of Alu-mutants at time R) after each time interval of δR which was assumed to have constant length. Subsequently, it uses these new values to calculate the next set of values A[n,R], i.e. the spectrum at time R+δR. In this way, it develops recursively the spectrum round by round of recursion from its beginning into its future, each round of recursion corresponding to one unit of time δR.

    There is no evidence that the actual evolutionary time intervals δT that corresponds to a unit δR of the mathematical time had always the same value. It is quite possible, that at some times in evolution, the progression towards the next computed spectrum of Alu-mutants took much longer than at others. However, we are making here the same assumption as the field in general, namely that the rate of mutations is approximately constant, i.e.

    δT = τ δR, and, consequenctly,

    T = τ R .

    with a calibration constant τ. As explained in detail in ref 8, the generally accepted time of -60 million years of the first appearance of Alu-elements in primates yields a value of τ = 250,000 years/recursion.

    Using this calibration to determine the times at which new bursts ('seedings') of fully proliferative Alu_elements appeared yielded the results shown in Figure 6. It turned out that these times coincided quite well with the appearance of new sub-families (See ref 8).

    Fig.6. Timing and relative magnitude of the various seedings of Alu-elements in the evolutionary past of the AluY, AluS, and AluJ families reconstructed by the mathematical model.

    Significance for the "functional anarchy" of genomes:

    It is conceivable that the numerous base substitutions of the Alu-elements effectively reduced or even inhibited the proliferation these retro-transposons which, otherwise, might have fragmented and destroyed the host genome. The situation is quite remarkable because the Alu-elements are less than 300 bases long and, yet, a majority of them contained 60 and more point mutations. How did the genomes manage to aim such a barrage of point mutations specifically at the Alu-elements and not at all other 300 base long segments of the genome?
    The obvious answer is, that they did not need to. The safety net of natural selection guaranteed that the Alu-elements accumulated many more base substitutions than did the vital parts of their host genome. After all, if a point mutation damaged a vital part of the genome, that organism and its genome were eventually eliminated by natural selection. In contrast, if a point mutation hit an Alu-element, it even helped the survival of that genome, because it crippled a dangerous invader even more. Hence, selection favoured genomes which concentrated base substitutions in the Alu-sequences.

    Therefore, it seems, that the massive base substitutions in Alu-elements may represent a case where 'reckless' point mutations were not only tolerated and tamed, but even became a life-saving defense against the 'reckless' attacks of another genome hazard, namely retro-tranpositions.

    TABLE OF CONTENTS