The explicit identification of a fractal set of genome sequences..The existence of fractal sets of DNA sequences have long been suspected on the basis of statistical analyses of genome data. Here we identify for the first time explicitly the GA-sequences as a class of fractal genomic sequences that are easy to recognize and to extract, and are scattered densely throughout the chromosomes of a large number of genomes from different species and kingdoms including the human genome. Their existence and their fractality may have significant consequences for our understanding of the origin and evolution of genomes. Furthermore, as universal and natural markers they may be used to chart and explore the non-coding regions.
General background.Fractals are very well known in many fields. Ranging from clouds and mountain ranges to the floret organization of 'Queen Anne's Lace' and the packaging of DNA into chromatin, fractals have been found throughout nature. Their hallmark is the self-similarity of patterns across large ranges of size, and their creation is often based on certain recursive mechanisms.
Intuitively one would not expect that there is enough room for fractal sequence patterns in the dense information coding of a genome. Still, as demonstrated here, they not only exist, but they are fundamental elements of every genome. Of course, this claim refers merely to the architecture of genomes, not to their content: Different genomes contain different fractal sets.
It could be most revealing to know whether the fractal sequence sets of different species are related to each other or whether they share species independent properties. Changes in their fractality may be biologically significant. For example, in the case of the intervals between heartbeats it has been shown not only that they are a fractal in nature, but also that their fractal properties were altered in cases of heart disease. Would the fractality of the human genome sequence sets in cases of genome-related diseases be altered in a similar way? Unfortunately, such questions could not be answered because, so far, no fractal sets of genome sequences had been explicitly identified. The present chapter tries to close this gap as it identifies for the first time explicitly a class of fractal genomic sequences that are scattered throughout the chromosomes of a large number of genomes from different species and kingdoms including the human genome. They are easy to recognize and extract from the genomes.
The existence of such large, stochastically extremely unlikely, yet universal sets of fractal sequences is quite startling and is likely to have an impact on our efforts to understand the origin of genomes and, as alluded to earlier, possibly also on our diagnosis of genome-related diseases. It seems also important for our exploration and understanding of the structure of non-coding regions as it offers a set of universal natural markers that are spread across all over genomes. It may also be relevant for the mechanisms of transferring genomic information between species, and for the evolution of genomes and organisms. In order to identify, describe and analyze fractal genomic sequences, this chapter will initially focus on the specific example of pure GA-sequences and demonstrate their fractality. The results will lend themselves easily to a more general understanding of other similar kinds of fractal sequence types.
The expansion of the definition of GA-sequences.The previous chapters had defined GA-sequences as genome sequences consisting exclusively of G's and A's and are larger than 50 bp. The following drops the latter condition and allows any arbitrary sequence length including the trivial sequence length of 1.
This more general definition turned out to be the key to finding the fractal character of the GA-sequences. In addition, it simplified the architecture of genomes into a set of 2 classes of interlaced sequences, namely GA-sequences and their intervals. By definition, the intervals between GA-sequences contain neither G's nor A's and, therefore, they may contain only T's or C's. In other words, the intervals are nothing but the pure TC-sequences.
Thus, every genome can be considered as the interlacing of the 2 fundamental sets of sequences, namely the GA-sequences and their intervals, the TC-sequences. We will show later that both of these 2 sets of sequences are fractal sets.
The universal power law behavior of the GA-sequences.A necessary condition for any set of sequences to be fractal is that they must obey a power law, although several types are acceptable. For example, it could follow an exponential distribution of the type N(s,λ )=A e - λ s (λ>0; example: radioactive decay), or a Pareto distribution of the type N(s,λ)=A s-λ (λ >0; example: the distribution of the sizes of grains of sand at a beach) or others. The exponent λ or an appropriately defined equivalent of λ of the power law is related to the so-called 'Hausdorff-dimension'. In more recent times Benoit Mandelbrot called it the 'fractal dimension'.
The GA-sequences of various genomes were extracted as described in previous chapters and the distributions of their sizes and intervals were obtained by straightforward counting. Logarithmic plots of the distributions of human GA-sequences turned out to resemble in some aspects an exponential distribution, but most closely a Pareto-distributions (Fig. 1a). This resemblance became particularly obvious if the data were plotted in a logarithm/logarithm graph, where a true Pareto distribution would yield a straight line (see Fig. 1b line). As illustrated in Fig. 1b, the data from human chromosome 5 resembled closely a λstraight line with a slope of σ = -2.9. The corresponding fractal dimension fD = -(σ+1) = 1.9.
Fig.1. The similarity between the size distributions of GA-sequences and a Pareto-distribution.
a. Comparison between the size distributions of the GA-sequences of human chr.5 (marked y), an exponential distribution (marked x; exponent: -0.3/bp; max. number:107), and a Pareto distribution (marked z; exponent: -2.9; max. number: 108) in logarithm/linear plots.
b. Logarithm/logarithm plot of the size distributions of the GA-sequences of human chr.5 (marked with data points) and a Pareto distribution with exponent -2.9 and a max number of 108 (straight line) showing their close resemblance.
In the case of ideal fractals the sizes should vary between 0 and infinity. Of course, fractals in the real world cover only finite ranges of sizes. As seen in Figure 1, the sizes of GA-sequences selected here for analysis ranged between 4 and approximately 1000, thus covering almost 3 orders of magnitude which qualified them as potential fractals in a practical sense, though not in a rigorously mathematical sense.
Very similar results were obtained when we analyzed the pure GA-sequences of many other genomes including the human chromosomes 1 - 7, rhesus chr.1, mouse chr. 1, dog chr.1, cat chr.3, chicken chr.1, zebrafish chr.1, xenopus l. chr.0, anopheles, drosophila melanogaster. chr. X, zea maize, poplar chr.1, the entire genome of C. elegans, and of C. Briggsae A surprising result of this analysis was the finding that not only the sizes of the GA-sequences resembled a Pareto distribution, but also the intervals between consecutive GA-sequences obeyed the identical power law (see Fig. 2).
Fig.2.Universal characteristics of the fractal GA-sequences of a wide range of species and kingdoms.
Common logarithm/logarithm plots of the numbers of GA-segments (panel a) and intervals (panel b) of the various genomes as functions of their sizes. Abscissa: segment (interval) sizes; ordinate: numbers of segments (intervals).The data of the different genomes were normalized to the same total genome size of 100 Mbp. Depicted are the data from human chr.1, chr.2, chr.3, chr.4, chr.5, chr.6, chr.7, rhesus chr.1, mouse chr. 1, dog chr.1, cat chr.3, chicken chr.1, zebrafish chr.1, xenopus l. chr.0, anopheles, drosophila melanogaster. chr. X, zea maize, poplar chr.1, the entire genome of C. elegans, and of C. Briggsae.
The 'Genome Pixel Images (GPxI) of GA-sequences.The described power law behavior is not sufficient to identify the GA-sequences as fractals. In order to do this, one must demonstrate that they form patterns that are scale invariant. In other words, the basic architecture of short GA-sequences should be the same as the architecture of very long ones. It is not necessary that the patterns remain rigorously identical for all sizes. As in the examples of coast lines or clouds a certain irregularity is acceptable as long s the patterns repeat their essential characteristics for all different sizes. In order to test this requirement we turned to a method to depict genome sequences in a graphic way that we had introduced in an earlier chapter. The method is called 'Genome Pixel Images'. It turns the DNA sequences into compact and detail-rich visual patterns by mapping each of the 4 bases onto 4 different gray-tone pixels and 'writing' the resulting pixel sequences like text from left to right and top to bottom. The method uses black pixels for adenine and white ones for guanine. Cytosine and thymidine are depicted as light and dark gray values. The method is both sensitive and intuitive as it takes advantage of the exceptional ability of the human visual sense to detect patterns in images.
The 'Lα -order' of the GA-sequences of actual chromosomes.Obviously, the GPxI patterns of any set of sequences are determined in the horizontal direction by the actual nucleotide sequences involved. In contrast, the patterns are arbitrary in the vertical direction, depending on the specific order in which the individual sequences were placed. The following order of the pixel images of GA-sequences will be used throughout this article ('Lα-order').
The 'baseline' GPxI pattern of Lα -ordered, hypothetical GA-sequences.As a 'baseline' pattern with which to compare the experimental patterns of equally ordered, actual genomic GA-sequences, we constructed the GPxI patterns of hypothetical, Lα -ordered random, computer-generated GA-sequences. For simplicity sake their sizes were constructed to follow an exponential distribution unlike the real GA-sequences which were more similar to a Pareto distribution.
Figure 3a shows such an exponential distribution. Translating it into a histogram of the actual counts of GA-sequences with a certain size s yielded a graph similar to Figure 3b. It showed for example, that this hypothetical genome contained 4 GA-sequences with size 76, 13 sequences with size 44, 32 sequences with size 20 , etc.. Creating the GPxI of the Lα -ordered GA-sequences meant to turn each of the four 76 bp long sequences into their pixel image and writing them left-adjusted underneath each other in the order dictated by the LLα -order. Subsequently, one could do the same with the five 72 bp long sequences, …, the thirteen 44 bp long sequences, thirty-two 20 bp long sequences and so forth. The result is shown in Figure 3c.The most prominent features of the plot are the right-hand outlines of the pixel images which follow a curve in distinct steps of size. The curved outline is obviously a reflection of the exponential size distribution. Meant only as illustration, the numbers of the hypothetical GA-sequences were chosen far lower that the numbers found in actual genomes. Also, the amplitudes of the size steps were greatly exaggerated in order to show clearer that the plot generates blocks' of pixel images of the GA-sequences that contain all GA-sequences of the same size (Figure 3c, label 'block'). Since the GA-sequences followed an exponential distribution, the number of GA-sequences of a certain size increased exponentially with decreasing size, rendering the blocks the longer in the vertical direction, the shorter they were in the horizontal direction. It is obvious from Figure 3c, that this set of random GA-sequences showed no specific scale invariant pattern. Therefore, it illustrates the above remark, that compliance with a power law is not sufficient to qualify a set of objects as fractals.
Fig.3. The GPxI of a hypothetical, Lα -ordered set of random GA-sequences with an exponential size distribution
Subsequently, the GPxI's of each size block were ordered logically a second time with G (=white pixels) ranking before A (=black pixels).
The effect of the logical ordering of each block is noticeable only at the left sides of each block which depict the typically very short
runs of poly-A or poly-G. Otherwise no specific pattern is detectable within a block as is expected from a set of random GA-sequences.
The fractal GPxI pattern the of GA-sequences of whole chromosomes.The result was dramatically different when we applied the same protocol to all GA-sequences from human chromosome 1 (Figure 4) or any other real chromosome. As expected, the resulting GPxI revealed a series of blocks that increased in height as they decreased in width, while creating an exponentially curved outline with their corners.
Entirely unexpected, however, was the finding that the visual pattern of every block was strikingly similar to their neighboring blocks. It appeared, therefore, that the sequence architecture that had generated the GA-sequences of each block was common to all of them, and that the blocks were self-similar elements of the set of GA-sequences over a large range of block sizes.
Fig.4. The GPxI of the GA-sequences of human chr.1Application of the method described in Figure 3 to the largest 5200 actual GA-sequences of human chr.1. The index numbers 1 - 5230 of the GA-sequences are noted by the scale to the left side of the images. One of the blocks is marked with a double arrow. For space reasons, all blocks beyond index 5230 were omitted, as they were far too long.
A characteristic feature of the scale invariance of fractal patterns is the appearance of sub-domains that express a similar pattern as the whole. In the case of the blocks of GA-sequences this feature is very obvious in the cases of very large blocks. Figure 5 shows the example of the large block consisting of the 1843 ordered GA-sequences of human chr.1 of size = 26 bp. (Note: This block and many similarly large ones were omitted from Figure 4 because of their excessive sizes.). One can recognize numerous typical sub-block patterns nested within the single block shown in the illustration and sub-sub blocks nested within sub-blocks.
Fig.5.The appearance of sub-block patterns nested within single large blocks.
The figure shows the GPxI of the single large block of the 1843 GA-sequences of human Chr.1 which are 26 bp long. The block is split into 2 halves. Dotted arrows indicate a sub-block and a sub-sub-block with similar patterns as the whole block.
The characteristics of block patterns.According to the L -order the vertical pixel line forming the left-most edge of each block begins with an uninterrupted white line (i.e.the first base of these GA-sequences is a G) and is followed with an uninterrupted black line (i.e.the first base of these GA-sequences is an A). This criterion was applied to identify the beginning and end of each block. As pointed out earlier very few GA-sequences of actual genomes were identical to each other. Therefore, the patterns of different blocks were never rigorously identical if one compared them pixel by pixel. Nevertheless, viewing the block patterns at lower resolution of details their striking similarity despite the irregularities in their details may justify us to define a irregularly self-similar block pattern of different block sizes of the GA-sequences for the genome in question by a list of their specific characteristics. One of the typical blocks marked in Figure 4 (white double arrow) is enlarged in Figure 6a and its basic appearance is analyzed in Figure 6b. It displays a number of features that we found present in the GPxI's of the GA-sequences of all genomes analyzed.
Fig.6.Characteristics of block patterns of GA-sequences
The range of block sizes with similar patterns.
The similarity of the block patterns was particularly obvious in the cases of blocks of GA-sequences that were shorter than 50 bp. In contrast, the blocks with longer GA-sequences contained too few members to identify a specific pattern. Nevertheless, the pixel patterns of the few members of these blocks were consistent with the much more obvious patterns of the larger blocks. In order to compare the patterns of blocks of different sizes, we applied a method to standardize the scales. Using this method of standardizing the scale, we found that in the case of the human genome the sizes of blocks with similar patterns ranged between 4 and 3900 covering almost 3 orders of magnitude. The total range of block sizes can actually be increased considerably more. For example, the present study restricted the size of the GA-sequences to values larger than 25 bp If one allows GA-sequences as short as 10 bp, according to their Pareto distribution the block sizes may increase another 2 orders of magnitude. The total range of block sizes is also a matter of the total number of GA-sequences of the genome in question. For example, the mouse genome contains about 5 times as many GA-sequences as the human genome. Accordingly, the range of block sizes of mouse GA-sequences may increase another order of magnitude. Still, many naturally occurring fractals such as coast lines cover much larger ranges of sizes. Yet, we submit that even 3-5 orders of magnitude of self-similar patterns of the effectively Pareto-distributed GA-sequences appears sufficiently different from random patterns to deserve a classification as fractals. Please note, that we are not claiming that the individual GA-sequences are fractals, but that the subsets of GA-sequences that have the same length are.
The pattern blocks of other genomes.The GA-sequences of the other human chromosomes yielded GPxI patterns quite similar to that of chromosome 1 (Figure 4). Therefore, we examined the patterns of other genomes from a variety of species and kingdoms such as maize, poplar, C. elegans, Ciona intestinalis, zebrafish, Xenopus laevis, chicken, rat, mouse, and Rhesus. They included plants, invertebrates and vertebrates at different levels of evolution. Similar to the patterns of human GA-sequences, they also yielded pattern blocks that were similar to each other. It appeared, therefore, that the general architecture of the GA-sequences is universally a fractal similar to the one shown in Fig. 4. Consequently, instead of characterizing all the GA-sequences of a genome, it may be sufficient to describe only the typical block pattern of that genome's GA-sequences. As shown in Figure 7, these block patterns were generally different for different species. The most prominent differences between species were found in the extent of the poly-A domains and the often considerable width of the Type C GA-ladders.
Fig.7. Typical block patterns of GA-sequences from chromosomes of different species belonging even to different kingdoms
An attempt to explain the origin of the fractal set of GA-sequences.As shown earlier it is utterly inconceivable that the many hundreds of nucleotide large members of the fractal set of GA-sequences arose by random concatenations of nucleotides. That leaves only 2 ways in which they could have arisen.
Using computer simulations to distinguish between these possibilities, we found that only the assumption of fragmentation of large, initial GA-sequences was able to yield the experimentally observed size distributions of fragments and intervals.
For simplicity sake this method assumed a single, very large 'seed' GA-sequence whose fragmentation by inversions would give rise to the entire set of fractal GA-sequences. Similar results would be obtained if the initial conditions would include several different large 'seed' sequences.
In order to simulate only the fractal size distribution of the GA-sequences no further specific assumption was necessary about the composition of G's and A's of the postulated initial 'seed' sequence. However, preliminary results of our attempts to simulate not only the size distribution as described here, but also the actual sequences of the GA-sequences in the human genome suggested that the initial 'seed' sequences may have been a large stretch of poly-GAGA.
This kind of a large repetitive sequence is not a problematic assumption. On the contrary, most genomes sequenced to date were found to contain numerous large repetitive sequences composed of a variety of small motifs. They are located predominantly in the centromeric, but also in the telomeric regions of chromosomes. (In many cases they prevented the unambiguous formulation of contigs and forced the sequencers to replace them with large stretches of N's in the published sequences.) In addition, shorter repetitive sequences of codons have been described in exons and introns which appear to be significant for the evolution and function of proteins. In order to simplify the analysis each strand of double stranded DNA was depicted using only 2 symbols, 'gray' and 'black' (see Fig.8a) which are to represent either A's and G's or T's and C's. The computer simulation subjected these simplified depictions to a large number of recursive inversions. Subsequently, the size distributions of segments and intervals was obtained by counting the numbers of different 'gray' and 'black' segments as a function of their size (Fig.8b).
Fig.8. Schematic representation of genomes as interlacing of the GA-sequences with the TC-sequences
(=intervals). All G's and A's are depicted in gray ('segments') and all T's and C's in black ('intervals').
Simulation of the segment- and interval distribution by recursive inversions.In the following we consider the effects of recursive inversion on a genome containing initially one very large 'seed' GA-sequence. Every inversion must exchange some 'blacks' with some 'grays' between the strands and increasingly equalize all the statistical aspects of 'blacks' and 'grays' not only between strands but also for each strand individually.
In order to simulate the process, we developed special software which subjected computer-constructed DNA strands to increasing numbers of inversions (random sizes <= a limit) while evaluating the emerging segment- and interval-distributions. As the human chr.1 is 2,380 times larger than the size of the simulated genomes (100,000 [bp]), the simulated segment- and interval- counts were multiplied with this normalization factor.
Fig.9 shows a typical simulation. Although initially complex and unstructured, with increasing numbers of inversions the distributions of segments and interval became almost identical to each other (Fig.9b). At the same time the quality of the match between the simulated data and the data of the human genome (depicted as black square marks) increased to a high level of accuracy which was reached at an optimal number of inversions Nm of approximately 20,000. Above N > Nm the simulations deviated again from the human genome data as he distribution curves of segments and intervals became steeper and shifted to the smaller segments sizes, while remaining almost identical to each other, nevertheless.
Fig.9. Simulation of the fragmentation by inversions of an initially large, continuous GA-sequence into segments ( white lines) and intervals (black lines) that match precisely the segment distribution of human chr.1 (square marks). Abscissa: segment (interval) sizes; ordinate: numbers of segments (intervals). Total simulated genome size: 100,000 [bp]. Initial unfragmented GA-segment: 20,000 [bp] (continuous segment). Size of the inversions: random <= 1000 [bp]. The panels show the log-log plot of the simulated segment distribution between 1000 (panelo a) and 20,000 (panel b) randomly located inversions. (Note: Starting the simulations with one large, continuous initial GA-segment was not necessary to create fractal GA-sequences, but only to match the human data above segments sizes of 30 [bp].)
Fig.10. Animated display of the above simulation.
Alternative simulations. Instead of an initial "seed" genome configuration that consisted of a single large continuous GA-sequence, we also used large numbers of smaller, randomly sized and randomly located GA-sequences as "seeds". They generated equally well fitting distributions.
Alternatively, we used initial genomes that were created as a Markov chain of nucleotides. These simulations yielded excellent matches with the actual human genome data but only for the small GA-sequences below sizes of 30 [b]. They deviated considerably at larger segment and interval sizes. Replacing the inversions with other kinds of mutations, including insertions, deletions, and even hypothetical mechanisms such as growth and aggregation of small GA-motifs changed the results of the simulations dramatically. Especially, none of these mutations was able to generate segment and interval distributions that were even similar to each other, let alone identical as in the case of inversions.
Simulation of the self-similarity patterns of the fractal GA-sequences.As illustrated in Fig.8, the simulation of the size distribution of the GA-sequences did not need to differentiate between G's and A's. It was only necessary to measure the lengths of the many fragments into which that the many inversions had splintered the initially continuous 'seed' GA-sequence.
In contrast, the simulation of the self-similar sequence patterns of the fractal GA-sequences of actual genomes must, of course, offer a model to illustrate how these specific sequences may have arisen from the fragmentation of the initial GA-sequence(s).
Using the same simulation parameters as above (i.e. model genome size: 100 kbp; "seed" GA-sequence" size 50 kbp; 20,000 inversions with random sizes <= 1 kbp), we found that the following additional rules were able to generate sequence patterns quite similar to the ones observed in actual genomes.
Figure 11 shows an example of the GPxI pattern of a simulated genome. The high degree if similarity with the patterns of the human chromosome 1 (Fig. 4) is rather striking.
Fig.11. Computer simulation of the sequence patterns resulting from fragmentation by inversions of an initially 50% large, continuous poly-GAGA-sequence into segments using the above additional rules to convert the GA-motifs.
Speculations about the evolutionary and biochemical mechanisms involved.The striking similarity between the simulated and the actual sequence patterns suggests that one can find rather simple rules for the processing of an assumed simple "ur"-genome (a large poly-GAGA segment connected to an equally long poly-TC segment) which can yield the complex fractal sequence patterns of actual genomes. Obviously, there may be several equally effective sets of rules, although our efforts to find any such rules at all have convinced us that there are probably not many.
Of course, the rules must not only be simple. One should also be able to translate them into conceivable evolutionary and biochemical mechanisms. As to the rules presented here, they require no more complex mechanisms than inversions followed by the insertion, replacement or interconversion of purine bases. Such mechanisms are known to exist, but, of course, there is no experimental evidence yet that they occurred on the massive scale as postulated by the presented model.
The following figures are to illustrate the basic assumptions and methods used above to simulate the fractal sequence patterns of Fig.11.
Fig.12. Illustration of a conceivable way to create an "ur"-genome that consists of large stretches of poly-GAGA motifs which add up to 50%.
Assume a starting poly-GAGA sequence that is paired with a matching poly-TC sequence. If a large primordial temperature fluctuation melts the strands, they may afterwards concatenate and eventually generate a matching complementary strand pair provided that the ambient temperature returned to a lower value. The resulting double stranded "ur"-genome which now contains 50% GA-motifs in a long contiguous stretch may repeat the same procedure and so forth. Thus an "ur"-genome may result that contains many long stretches of GA-motifs which add up to exactly 50%.
Fig.13. Illustration of the purine conversion required by Method I
Upper panel: Conversion of both G's of a GAGA motif to A's of a random number of contiguous upstream GAGA motifs (w/wo frameshift). Lower panel: Conversion of one or both G's into A's of the remaining GAGA-motifs (w/wo frameshift) of the same GA-segment.
Fig.14. Illustration of the purine conversion required by Method II
Random conversion of one or both G's of a GAGA motif to A's or one or both A's of a GAGA-motif to G's.