__ The explicit identification of a fractal set of genome sequences..__

Intuitively one would not expect that there is enough room for fractal sequence patterns in the dense information coding of a genome. Still, as demonstrated here, they not only exist, but they are fundamental elements of every genome. Of course, this claim refers merely to the architecture of genomes, not to their content: Different genomes contain different fractal sets.

It could be most revealing to know whether the fractal sequence sets of different species are related to each other or whether they share species independent properties. Changes in their fractality may be biologically significant. For example, in the case of the intervals between heartbeats it has been shown not only that they are a fractal in nature, but also that their fractal properties were altered in cases of heart disease. Would the fractality of the human genome sequence sets in cases of genome-related diseases be altered in a similar way? Unfortunately, such questions could not be answered because, so far, no fractal sets of genome sequences had been explicitly identified. The present chapter tries to close this gap as it identifies for the first time explicitly a class of fractal genomic sequences that are scattered throughout the chromosomes of a large number of genomes from different species and kingdoms including the human genome. They are easy to recognize and extract from the genomes.

The existence of such large, stochastically extremely unlikely, yet universal sets of fractal sequences is quite startling and is likely to have an impact on our efforts to understand the origin of genomes and, as alluded to earlier, possibly also on our diagnosis of genome-related diseases. It seems also important for our exploration and understanding of the structure of non-coding regions as it offers a set of universal natural markers that are spread across all over genomes. It may also be relevant for the mechanisms of transferring genomic information between species, and for the evolution of genomes and organisms. In order to identify, describe and analyze fractal genomic sequences, this chapter will initially focus on the specific example of pure GA-sequences and demonstrate their fractality. The results will lend themselves easily to a more general understanding of other similar kinds of fractal sequence types.

__ The expansion of the definition of GA-sequences.__

This more general definition turned out to be the key to finding the fractal character of the GA-sequences. In addition, it simplified
the architecture of genomes into a set of 2 classes of interlaced sequences, namely **GA-sequences and their intervals. **
By definition, the intervals between GA-sequences contain neither G's nor A's and, therefore, they may contain only T's or C's.
In other words, the intervals are nothing but the pure TC-sequences.

*Thus, every genome can be considered as the interlacing of the 2 fundamental sets of sequences, namely the GA-sequences
and their intervals, the TC-sequences. We will show later that both of these 2 sets of sequences are fractal sets.*

__ The universal power law behavior of the GA-sequences.__

The GA-sequences of various genomes were extracted as described in previous chapters and the distributions of their sizes and intervals were obtained by straightforward counting. Logarithmic plots of the distributions of human GA-sequences turned out to resemble in some aspects an exponential distribution, but most closely a Pareto-distributions (Fig. 1a). This resemblance became particularly obvious if the data were plotted in a logarithm/logarithm graph, where a true Pareto distribution would yield a straight line (see Fig. 1b line). As illustrated in Fig. 1b, the data from human chromosome 5 resembled closely a λstraight line with a slope of σ = -2.9. The corresponding fractal dimension fD = -(σ+1) = 1.9.

**Fig.1. The similarity between the size distributions of GA-sequences and a Pareto-distribution.****
**

**a. Comparison between the size distributions of the GA-sequences of human chr.5 (marked y), an exponential distribution
(marked x; exponent: -0.3/bp; max. number:10 ^{7}), and a Pareto distribution (marked z; exponent: -2.9; max. number: 10^{8})
in logarithm/linear plots.
**

**b. Logarithm/logarithm plot of the size distributions of the GA-sequences of human chr.5 (marked with data points)
and a Pareto distribution with exponent -2.9 and a max number of 108 (straight line) showing their close resemblance.**

In the case of ideal fractals the sizes should vary between 0 and infinity. Of course, fractals in the real world cover only finite ranges of sizes. As seen in Figure 1, the sizes of GA-sequences selected here for analysis ranged between 4 and approximately 1000, thus covering almost 3 orders of magnitude which qualified them as potential fractals in a practical sense, though not in a rigorously mathematical sense.

Very similar results were obtained when we analyzed the pure GA-sequences of many other genomes including the human chromosomes 1 - 7, rhesus chr.1, mouse chr. 1, dog chr.1, cat chr.3, chicken chr.1, zebrafish chr.1, xenopus l. chr.0, anopheles, drosophila melanogaster. chr. X, zea maize, poplar chr.1, the entire genome of C. elegans, and of C. Briggsae A surprising result of this analysis was the finding that not only the sizes of the GA-sequences resembled a Pareto distribution, but also the intervals between consecutive GA-sequences obeyed the identical power law (see Fig. 2).

**Fig.2.Universal characteristics of the fractal GA-sequences of a wide range of species and kingdoms.****
**

**Common logarithm/logarithm plots of the numbers of GA-segments (panel a) and intervals (panel b) of the various genomes as
functions of their sizes. Abscissa: segment (interval) sizes; ordinate: numbers of segments (intervals).The data of
the different genomes were normalized to the same total genome size of 100 Mbp. Depicted are the data from human
chr.1, chr.2, chr.3, chr.4, chr.5, chr.6, chr.7, rhesus chr.1, mouse chr. 1, dog chr.1, cat chr.3, chicken chr.1,
zebrafish chr.1, xenopus l. chr.0, anopheles, drosophila melanogaster. chr. X, zea maize, poplar chr.1, the entire
genome of C. elegans, and of C. Briggsae.
**

__ The 'Genome Pixel Images
(GPxI) of GA-sequences.__

__ The 'Lα -order' of the GA-sequences of actual chromosomes.__

- In the vertical direction the GA-sequences were arranged by decreasing size.
- Their GPxI pixel lines were left-adjusted.
- Each set of GA-sequences of the same size was ordered alphabetically with G's ranking before A's starting at the 5' end.

__ The 'baseline' GPxI pattern of Lα -ordered, hypothetical GA-sequences. __

Figure 3a shows such an exponential distribution. Translating it into a histogram of the actual counts of GA-sequences with a certain size s yielded a graph similar to Figure 3b. It showed for example, that this hypothetical genome contained 4 GA-sequences with size 76, 13 sequences with size 44, 32 sequences with size 20 , etc.. Creating the GPxI of the Lα -ordered GA-sequences meant to turn each of the four 76 bp long sequences into their pixel image and writing them left-adjusted underneath each other in the order dictated by the LLα -order. Subsequently, one could do the same with the five 72 bp long sequences, …, the thirteen 44 bp long sequences, thirty-two 20 bp long sequences and so forth. The result is shown in Figure 3c.The most prominent features of the plot are the right-hand outlines of the pixel images which follow a curve in distinct steps of size. The curved outline is obviously a reflection of the exponential size distribution. Meant only as illustration, the numbers of the hypothetical GA-sequences were chosen far lower that the numbers found in actual genomes. Also, the amplitudes of the size steps were greatly exaggerated in order to show clearer that the plot generates blocks' of pixel images of the GA-sequences that contain all GA-sequences of the same size (Figure 3c, label 'block'). Since the GA-sequences followed an exponential distribution, the number of GA-sequences of a certain size increased exponentially with decreasing size, rendering the blocks the longer in the vertical direction, the shorter they were in the horizontal direction. It is obvious from Figure 3c, that this set of random GA-sequences showed no specific scale invariant pattern. Therefore, it illustrates the above remark, that compliance with a power law is not sufficient to qualify a set of objects as fractals.

**Fig.3. The GPxI of a hypothetical, Lα -ordered set of random GA-sequences with an exponential size distribution****
**

**Graph of an assumed exponential size distribution p = N(s,λ )=N**_{max}e^{ - λ s}of a set of hypothetical GA-sequences in a logarithm/linear plot.**Histogram of the counts of the number of GA-sequences of the different values s of the sizes derived from the distribution in panel a.****The GPxImages of the computer-generated, random GA-sequences that belong to the above distribution. Each GPxI of a GA-sequences is displayed in the horizontal direction. These GPx images were ordered in the vertical direction according to size, thus defining blocks (marked 'block') of GPxI's that represent each one of the columns of the histogram in panel b.**

__ The fractal GPxI pattern the of GA-sequences of whole chromosomes.__

Entirely unexpected, however, was the finding that the visual pattern of every block was strikingly similar to their neighboring blocks. It appeared, therefore, that the sequence architecture that had generated the GA-sequences of each block was common to all of them, and that the blocks were self-similar elements of the set of GA-sequences over a large range of block sizes.

**Fig.4. The GPxI of the GA-sequences of human chr.1**

A characteristic feature of the scale invariance of fractal patterns is the appearance of sub-domains that express a similar pattern as the whole. In the case of the blocks of GA-sequences this feature is very obvious in the cases of very large blocks. Figure 5 shows the example of the large block consisting of the 1843 ordered GA-sequences of human chr.1 of size = 26 bp. (Note: This block and many similarly large ones were omitted from Figure 4 because of their excessive sizes.). One can recognize numerous typical sub-block patterns nested within the single block shown in the illustration and sub-sub blocks nested within sub-blocks.

**Fig.5.The appearance of sub-block patterns nested within single large blocks.****
**

**The figure shows the GPxI of the single large block of the 1843 GA-sequences of human Chr.1 which are 26 bp long. The block is split
into 2 halves. Dotted arrows indicate a sub-block and a sub-sub-block with similar patterns as the whole block.
**

__ The characteristics of block patterns.__

- Poly-A-domain: The most prominent feature is a consequence of the upstream poly-A segments of increasing length (black area at the lower left side with a curved upper border) indicating that a large number of GA-sequences of every size class have an upstream poly-A segment.
- AGA-border: With very few exceptions each of the poly-A segments was terminated with a AGA motif that appeared as a isolated while border line in the GPxI's.
- Type A patterns: Further downstream each of these GA-sequences that began with a poly-A segment continues with individually different GA-sequences, which, however are A-rich. Therefore, their GPxI's appeared darker than the upper portions of the block.
- Type B patterns: The GA-sequences of each block which began upstream with only short oligo-A segments or with oligo-G segments appeared whitish as they had more balanced G/A ratios.
- Type C patterns: Frequently, the blocks contain long runs of poly-GA sequences. They appear as repetitive 'ladders' of varying width.
- Type D patterns: In contrast to the pronounced poly-A portions, the upstream poly-G portions (white domains) were usually very short. They are marked as 'Type D' patterns.

**Fig.6.Characteristics of block patterns of GA-sequences****
**

**Enlargement of the marked block of Figure 5.****Schematic, low resolution drawing of the block in panel a in order to identify the various typical pixel patterns marked as poly-A, Types A - D, and AGA-border.****The block shown in Fig. 5 contains 13 times more GA-sequences than the one shown in panel a. Therefore, 1/13th of its GA-sequences were randomly selected, reordered according to L , and its GPxI was produced. The result is a block pattern similar to panel a.**

__ The range of block sizes with similar patterns.__

The similarity of the block patterns was particularly obvious in the cases of blocks of GA-sequences that were shorter than 50 bp. In contrast, the blocks with longer GA-sequences contained too few members to identify a specific pattern. Nevertheless, the pixel patterns of the few members of these blocks were consistent with the much more obvious patterns of the larger blocks. In order to compare the patterns of blocks of different sizes, we applied a method to standardize the scales. Using this method of standardizing the scale, we found that in the case of the human genome the sizes of blocks with similar patterns ranged between 4 and 3900 covering almost 3 orders of magnitude. The total range of block sizes can actually be increased considerably more. For example, the present study restricted the size of the GA-sequences to values larger than 25 bp If one allows GA-sequences as short as 10 bp, according to their Pareto distribution the block sizes may increase another 2 orders of magnitude. The total range of block sizes is also a matter of the total number of GA-sequences of the genome in question. For example, the mouse genome contains about 5 times as many GA-sequences as the human genome. Accordingly, the range of block sizes of mouse GA-sequences may increase another order of magnitude. Still, many naturally occurring fractals such as coast lines cover much larger ranges of sizes. Yet, we submit that even 3-5 orders of magnitude of self-similar patterns of the effectively Pareto-distributed GA-sequences appears sufficiently different from random patterns to deserve a classification as fractals. Please note, that we are not claiming that the individual GA-sequences are fractals, but that the subsets of GA-sequences that have the same length are.

__ The pattern blocks of other genomes.__

**Fig.7. Typical block patterns of GA-sequences from chromosomes of different species belonging
even to different kingdoms**

__ An attempt to explain the origin of the fractal set of GA-sequences.__

- All GA-sequences began as one or several very large GA-sequences that were fragmented later during evolutionary times. Their subsequent fragmentation may have occurred by conversions of G- or A-nucleotides, or by inversions. Alternatively,
- All GA-sequences began as a large number of small GA-motifs that were scattered all over the genome and later grew or aggregated into much larger sizes.

Using computer simulations to distinguish between these possibilities, we found that only the assumption of fragmentation of large, initial GA-sequences was able to yield the experimentally observed size distributions of fragments and intervals.

For simplicity sake this method assumed a single, very large 'seed' GA-sequence whose fragmentation by inversions would give rise to the entire set of fractal GA-sequences. Similar results would be obtained if the initial conditions would include several different large 'seed' sequences.

In order to simulate only the fractal size distribution of the GA-sequences no further specific assumption was necessary about the composition of G's and A's of the postulated initial 'seed' sequence. However, preliminary results of our attempts to simulate not only the size distribution as described here, but also the actual sequences of the GA-sequences in the human genome suggested that the initial 'seed' sequences may have been a large stretch of poly-GAGA.

This kind of a large repetitive sequence is not a problematic assumption. On the contrary, most genomes sequenced to date were found to contain numerous large repetitive sequences composed of a variety of small motifs. They are located predominantly in the centromeric, but also in the telomeric regions of chromosomes. (In many cases they prevented the unambiguous formulation of contigs and forced the sequencers to replace them with large stretches of N's in the published sequences.) In addition, shorter repetitive sequences of codons have been described in exons and introns which appear to be significant for the evolution and function of proteins. In order to simplify the analysis each strand of double stranded DNA was depicted using only 2 symbols, 'gray' and 'black' (see Fig.8a) which are to represent either A's and G's or T's and C's. The computer simulation subjected these simplified depictions to a large number of recursive inversions. Subsequently, the size distributions of segments and intervals was obtained by counting the numbers of different 'gray' and 'black' segments as a function of their size (Fig.8b).

**Fig.8. Schematic representation of genomes as interlacing of the GA-sequences with the TC-sequences
(=intervals).**** All G's and A's are depicted in gray ('segments') and all T's and C's in black ('intervals').
**

**Schematic of a double stranded DNA. Segments and intervals divide each strand into exactly 2 non-overlapping domains. Together, the black and gray domains make up each entire strand of the double stranded DNA. The reverse complement of each segment is a member of the set of intervals of the opposite strand and vice versa.****Separate depiction of the segments and intervals of the right-hand strand of panel (a).**

__ Simulation of the segment- and interval distribution by recursive inversions.__

In order to simulate the process, we developed special software which subjected computer-constructed DNA strands to increasing numbers of inversions (random sizes <= a limit) while evaluating the emerging segment- and interval-distributions. As the human chr.1 is 2,380 times larger than the size of the simulated genomes (100,000 [bp]), the simulated segment- and interval- counts were multiplied with this normalization factor.

Fig.9 shows a typical simulation. Although initially complex and unstructured,
with increasing numbers of inversions the distributions of segments and interval became almost identical
to each other (Fig.9b). At the same time the
quality of the match between the simulated data and the data of the human genome (depicted as black square
marks) increased to a high level of accuracy which was reached at an optimal number of inversions
N_{m} of approximately 20,000. Above N > N_{m} the simulations deviated again from the human genome data as
he distribution curves of segments and intervals became steeper and shifted to the smaller segments sizes,
while remaining almost identical to each other, nevertheless.

**Fig.9. Simulation of the fragmentation by inversions of an initially large, continuous GA-sequence into segments**** (
white lines) and intervals (black lines) that match precisely the segment distribution of human chr.1 (square marks).
Abscissa: segment (interval) sizes; ordinate: numbers of segments (intervals). Total simulated genome size: 100,000 [bp].
Initial unfragmented GA-segment: 20,000 [bp] (continuous segment). Size of the inversions: random <= 1000 [bp]. The panels show the log-log plot
of the simulated segment distribution between 1000 (panelo a) and 20,000 (panel b) randomly located inversions.
(Note: Starting the simulations with one large, continuous initial GA-segment was not necessary to create fractal GA-sequences, but only to match
the human data above segments sizes of 30 [bp].)
**

**Fig.10. Animated display of the above simulation.****
**

__Alternative simulations.__
Instead of an initial "seed" genome configuration that consisted of a single large continuous GA-sequence, we also used large numbers of smaller,
randomly sized and randomly located GA-sequences as "seeds". They generated equally well fitting distributions.

**Alternatively, we used initial genomes that were created as a Markov chain of nucleotides. These simulations yielded excellent matches with the actual human genome data but
only for the small GA-sequences below sizes of 30 [b]. They deviated considerably at larger segment and interval sizes.
Replacing the inversions with other kinds of mutations, including insertions, deletions, and even hypothetical mechanisms
such as growth and aggregation of small GA-motifs changed the results of the simulations dramatically. Especially, none of
these mutations was able to generate segment and interval distributions that were even similar to each other, let alone
identical as in the case of inversions.
**

__ Simulation of the self-similarity patterns of the fractal GA-sequences.__

In contrast, the simulation of the self-similar sequence patterns of the fractal GA-sequences of actual genomes must, of course, offer a model to illustrate how these specific sequences may have arisen from the fragmentation of the initial GA-sequence(s).

Using the same simulation parameters as above (i.e. model genome size: 100 kbp; "seed" GA-sequence" size 50 kbp; 20,000 inversions with random sizes <= 1 kbp), we found that the following additional rules were able to generate sequence patterns quite similar to the ones observed in actual genomes.

- The "seed" GA-sequence" was a poly-GAGA sequence that constituted 50% of the "ur"-genome. It was not necessary that the "seed"-sequence represented a single, continuous stretch. The results were qualitatively the same if the "ur"-genome contained multiple separate "seed"-sequences, as long as their total size amounted to 50% of the "ur"-genome.
- This "ur"-genome was subjected to a large number (e.g. 20000) of inversions that generated the many GA-fragments described in the above distributions.
- 50% of the resulting GA-segments were randomly selected and modified by Method I. 97% of the remaining GA-segments were subjected to Method II and 3% remained unchanged.
- A stretch of random size of the original GAGA-motifs starting at the 5' end was converted to AAAA; the remaining downstream GAGA-motifs of the same segment were converted randomly to any one of the A-rich motifs (AAAA, AAGA, AGAA, or GAAA). This method generated the pattern of Type A (Fig.6b)
- Every GAGA-motif was converted randomly into one of the 16 tetra GA-motifs such as AAAA, AAAG, AAGA,...,AGGG, GGGG. This method generated the patterns of Type B and D (Fig.6b). The final 3% of the poly-GAGA segments remained unchanged. They represented the pattern of Type C (Fig.6b).

Figure 11 shows an example of the GPxI pattern of a simulated genome. The high degree if similarity with the patterns of the human chromosome 1 (Fig. 4) is rather striking.

**Fig.11. Computer simulation of the sequence patterns resulting from fragmentation by inversions of an initially 50% large, continuous poly-GAGA-sequence
into segments using the above additional rules to convert the GA-motifs. ****
**

__ Speculations about the evolutionary and biochemical mechanisms involved.__

Of course, the rules must not only be simple. One should also be able to translate them into conceivable evolutionary and biochemical mechanisms. As to the rules presented here, they require no more complex mechanisms than inversions followed by the insertion, replacement or interconversion of purine bases. Such mechanisms are known to exist, but, of course, there is no experimental evidence yet that they occurred on the massive scale as postulated by the presented model.

The following figures are to illustrate the basic assumptions and methods used above to simulate the fractal sequence patterns of Fig.11.

**Fig.12. Illustration of a conceivable way to create an "ur"-genome that consists of large stretches of poly-GAGA motifs which add up to 50%.****
**

**Assume
a starting poly-GAGA sequence that is paired with a matching poly-TC sequence. If a large primordial temperature fluctuation melts the strands, they may afterwards concatenate and eventually generate a
matching complementary strand pair provided that the ambient temperature returned to a lower value. The resulting double stranded "ur"-genome which now contains 50% GA-motifs in a long contiguous stretch may repeat the same
procedure and so forth. Thus an "ur"-genome may result that contains many long stretches of GA-motifs which add up to exactly 50%.
**

**Fig.13. Illustration of the purine conversion required by Method I ****
**

**Upper panel: Conversion of both G's of a GAGA motif to A's of a random number of contiguous upstream GAGA motifs (w/wo frameshift).
Lower panel: Conversion of one or both G's into A's of the remaining GAGA-motifs (w/wo frameshift) of the same GA-segment.
**

**Fig.14. Illustration of the purine conversion required by Method II ****
**

**Random conversion of one or both G's of a GAGA motif to A's or one or both A's of a GAGA-motif to G's.
**