I. Pure GA-sequences - genome sign posts?

[See ref 6, ref 7, ]

Pure GA-sequences as candidates for genomic sign posts.

A simple way to search for natural genomic sign posts would be to look for unexpected, non-coding sequences that exist in large numbers. Here, we describe a specific type, namely sequences between 50 and 1000 bases long that consist exclusively of only 2 bases.

Indeed, such sequences would be quite unexpected. Crudely estimated, the probability p that a sequence of 50 bases contained only (say) A's and G's would be p = (1/2)50 = 0.000,000,000,000,000,09. In other words, such sequences should never be found.

Nevertheless, as will be reported here, the chromosomes of human, chimpanzee, dog, cat, rat, and mouse contain many tens of thousands of such sequences. They will be called 'pure GA-sequences'. The presentation describes their frequency, distribution, composition from smaller motifs, and apparent protection from point mutation, and speculates about their potential functions as genomic sign posts of genomes and spatial linkers of chromatin. In addition, it discusses several results from the field of heat shock biology that could be viewed as experimental support for this interpretation.

The following is an example of a 1305 [b] long pure GA-sequence at position 717,235 in human chr.7.

The phenomenon of pure GA-sequences

Plotting the sizes of all pure GA-sequences as a function of their position along a random, computer-generated DNA strand, showed that such sequences are hardly ever longer than 15 bases (Fig.1a). In stark contrast, a similar plot of an 8 Mb large section of human chr.1 between 24 Mb and 32 Mb, displayed numerous much longer sequences, including 76 which were between 50 and 269 bases long (Fig.1b).

Fig.1. Length of pure GA-sequences as a function of their starting position along an 8 [Mb] long stretch of DNA sequence. (Abscissa: position in [Mb]; Ordinate: length of pure GA-sequence in [b].
a. Computer generated random sequence of 20% G's and 30% A's (similar to the human genome).
b. Human chr.1 between positions 24 [Mb] and 32 [Mb]. (Gaps in the density of GA-sequences are due to un-sequenced regions).

Every pure GA-sequence was naturally paired with a pure TC-sequence on the opposite strand. Likewise, each pure TC-sequence on a strand corresponds to a pure GA-sequence on the complementary strand. In view of the actions of countless inversions in the evolutionary past of each chromosome, one may expect that there are approximately as many pure GA-sequences as there are pure TC-sequences on each chromosome. After all, each inversion that contained a GA-sequence simply exchanged it with its reverse complementary TC-sequence from the opposite strand, and vice versa, thus asymptotically equalized their numbers on each strand [see 2]. Indeed, human chr.1, which contained 1667 pure GA-sequences also contained 1734 pure TC-sequences. The corresponding numbers for human chr3 were 1155 and 1118, and for the human X chromosome 1115 and 1059.

It should be noted that pure GA-sequences represented only a small fraction of every chromosome, in spite of their abundance. For example, the total sequence length of the mentioned 1667 pure GA-sequences of human chr.1 amounted to as little as 0.0642 [%] of this chromosome.

The size-distribution of pure GA-sequences.

The probability that several A's and G's occur N times in a row should be a rapidly diminishing exponential function of N. Indeed, a logarithmic plot of the frequencies of the lengths of the pure GA-sequences of a computer-generated, random DNA sequence yielded a straight line (Fig. 2). In contrast, a logarithmic plot of the frequency of the actual probability of the size of pure GA-sequences yielded a power law distribution. Figure 2 shows the example of the human genome. After normalization for the values of size = 1 to by dividing each raw count by the sum of all raw counts the probability density function became

p(size) = (a-1) (size)-a , with a=3.2.

The exponent a = 3.2 was determined by a double logarithmic plot of the normalized function. The equation fitted with high accuracy especially the values > 50. Thus, it seemed that the sizes of GA-sequences follow a Pareto- distribution. Initially defined by the Italian economist Vilfredo Pareto to describe the distributions of income and wealth, a great many other real-world phenomena such as the size distributions of sand or meteorites, the sizes of human settlements and many others were found to follow this distribution. Compared to exponential distributions including Poisson-distributions for small probabilities, it is characterized by a much slower reduction of frequencies at large size values.

Fig.2. Distribution of GA-sequence lengths. Abscissa: Ga-sequence length [b]. Ordinate logarithm of counts. The vertical line indicates the defining threshold of 50 [b] for pure GA-sequences.
a. Exponential distribution of the computer-generated GA-sequence of Fig.1a.
b. The Pareto-distribution of GA-sequence lengths of the entire human genome.

The exclusion of poly-A and poly-G sequences.

Obviously, one may consider poly-A and poly-G sequences as special cases of pure GA-sequences that contain only one of the 2 bases. As these are well-described in the literature, the present examination intended to exclude them while focusing specifically on pure GA-sequences that contain both nucleotides. As it turned out, this could be accomplished quite simply by setting a length threshold, because poly-A and poly-G sequences were very rarely longer than 50 bases.

For example, among the mentioned 76 pure GA-sequences in the 8 Mb large section of human chr.1 between 24 Mb and 32 Mb there were no poly-A or poly-G sequences that were larger than 50 bases. Even within the entire 240 Mb large human chr.1, which contained 1667 such GA-sequences there were no poly-G sequences and only 7 poly-A sequences longer than 50 bases. Therefore, in the following,
we define as pure GA-sequence a DNA sequence that consists exclusively of G's and A's and is longer than 50 bases.
It should be noted, that this definition reduces drastically the number of poly-A and poly-G sequences among the pure GA-sequences, but it does not eliminate them completely.

Relationship between chromosome size and numbers of pure GA-sequences.

The numbers of pure GA-sequences were approximately proportional to the size of the chromosome to which they belonged. A correlation plot between the sizes of each human chromosome and its number of pure GA-sequences yielded a correlation coefficient of 0.907 (Fig.3). Its slope corresponded to an average density of GA-sequences of 6.9 [sequences/[Mb].

Fig.3.Relationship between chromosome size and number of pure GA-sequences among the 23 human chromosomes (Y-chromosome was omitted)

The species-dependent spatial density of pure GA-sequences.

The densities of pure GA-sequences from chromosomes of different species differed to a much greater degree than the densities of different chromosomes from the same species. For example, the average density of the 23 human chromosomes (excluding the Y-chromosome) was 6.9 [sequences/Mb] ( std.dev. = 2.1 [sequences/Mb]). In contrast, the density of mouse chr.2 and rat chr.3 were 3 and 4.7 times larger (Fig.4a,b). On the other hand, maize, arabidopsis and C.elegans had 10 to 50 times smaller densities of pure GA-sequences than human chr.1. (Fig.4c). Thus densities could vary up to 250-fold between species. It should be noted that maize, arabidopsis, and C.elegans did not only contain very few pure GA-sequences, they were effectively simply poly-GA sequences.

Fig.4.Density of pure GA-sequences for various mammals and other organisms.

a. Human chr.1 between positions 8 [Mb] and 16 [Mb]. (axes as in Fig.1).
b. Mouse chr.2 between positions 16 [Mb] and 24 [Mb]. (axes as in Fig.1).
c. Density of pure GA-sequences for individual chromosomes of various mammalian and non-mammalian chromosomes (Ordinate: number of pure GA-sequences per [Mb] sequence. ). The tags along the abscissa indicate the various chromosomes in the order from left to right: mouse chr.1, rat chr.1, dog chr1, human chr.1, chimpanzee chr.1, zebrafish chr.1, cat genome segment 1, maize chr.1, Arabidopsis chr.1, and caenorhabditis elegans chr.X.

The individuality of pure GA-sequences.

There are 293 1028 different ways to generate different pure GA-sequences that are on average 93 bases long. This astronomically large number would be able to afford each pure GA-sequence its own, individual sequence.

In order to test whether each pure GA-sequence was in fact an individual I measured the degree of homology between every pure GA-sequence and every other that was found in human chromosomes 1, 2, 3, 7, 17 and X. The tests used the Needleman-Wunsch algorithm [3]. The resulting frequency distribution of a total of 34,123,491 individual tests is shown in Figure 5.

On average there were only 0.5% (stddev: 0.4%) cases of identity among the pure GA-sequences within the same chromosomes, and 0.2% (stddev: 0.1%) cases between different chromosomes of the same species. Most other pairs of pure GA-sequences were approximately 50% homologous, as one would expect statistically from sequences of only 2 bases.

Similarly, I tested for the homologies between pure GA-sequences of chromosomes of human, chimp, dog, mouse, rat, and cat (97,151,955 tests). In this case there were on average 0.5% (stddev: 0.4%) cases of identical pure GA-sequences among the chromosomes of different mammalian species. The rare cases of 100% homology belonged uniformly to 4 special types of pure GA-sequences, namely poly-A, poly-G, poly-GA, or poly-GAAA sequences. All other pure GA-sequences were unique individuals. Please note that the mentioned poly-A and poly-G sequences were left among the pure GA-sequences because the size restriction for pure GA-sequences of 50 bases or longer eliminated most but not all of them.

Are pure GA-sequences coding?

There are 8 codons that consist exclusively of G's and A's, namely AAA, AAG, AGA, AGG, GAA, GAG, GGA, and GGG. They code for Arg, Gly, Glu, and Lys. As it seemed conceivable that some pure GA-sequences coded for proteins that contained chains of these 4 amino acids, I tested how many pure GA-sequences were coding for mRNAs. Searching all human transcripts (440 [Mb]) for pure GA-sequences longer than 50 bases and excluding, of course, all poly-A sequences, I found 394 cases. In contrast, the entire human genome contained a total of 19,139 pure GA-sequences, indicating that at most 1.95% of them are transcribed into messenger RNAs. Thus, for all practical purposes the pure GA-sequences may be considered non-coding for proteins.

The tetra-GA motifs of pure GA-sequences.

Concatenating end-to-end all 1667 pure GA-sequences of human chr.1 yielded the GPxI shown in Figure5b. The comparison with a computer-constructed random GA-sequence file (Figure5a), confirmed that the pure GA-sequences contain many repetitive patterns.

The period length of the common motifs can easily be determined by yet another application of the GPxI method. Adopting the rationale of the so-called Markham rotation, one can superimpose pixel-by-pixel a particular GPxI with other GPxIs that were created by frame-shifts of 1,2,3, …[b] of the original sequence.

Assume a motif has the size of N bases and forms strings of various lengths. Every time the original GPxI is superimposed with one that was frame shifted by N or an integral multiple of N, the images of the motif strings coincide and thus appear reinforced. As illustrated in the GPxI of the pure GA-sequences of human chr.1 (Fig.5b) frame shifts of 4, but not of 1, 2 , and 3 reinforced the patterns, indicating that the prevalent repeated motifs of pure GA-sequences are tetra-GA motifs. These motifs were not only present, but constituted a significant part of the pure GA-sequences. Furthermore, the 4-fold patterns seem to repeat over several lines in the vertical direction of the GPxI, as if consecutive GA-sequences shared similar chains of tetra-GA motifs.

Many of the 16 different tetra-GA motifs (AAAA, AAAG, AAGA, AGAA, GAAA, GAAG, GGAA, AAGG, AGGA, AGAG, GAGA, GAGG, AGGG, GGAG, GGGA, GGGG) give rise to the same repetitive chains, provided one disregards the first 2 or 3 bases with which the chains begin. For example, chains of any of the 4 tetra-GA-motifs AAAG, AGAA, AAGA, and GAAA will generate essentially the same sequence …AAAGAAAGAAAGAAAGAAAG…. Only the beginning and ends of the chains may differ.

Similar considerations suggest that in addition to AAAG among the remaining tetra-GA-motifs only AAGG, AGAG, and GGGA were able to generate essentially different chains (AAAA and GGGG are excluded by definition of the pure GA-sequences). These tetra-GA-motifs occurred with different frequencies in the pure GA-sequences. Evaluating the 206,450 occurrences of tetra-GA motifs in the 19,139 pure GA-sequences of the entire human genome yielded the following probabilities of their occurrence: AAAG (10.4%), AAGG (7.1%), AGAG (5.1%), and GGGA (3%). Together all of the tetra-GA motifs made up 46 - 47% of the entire length of the pure GA-sequences of the human genome. The rest were individual sequences that guarantee the individuality of the GA-sequences.

Fig.5. Predominance of tetra-GA motifs in the pure GA-sequences of human chr. 1 as demonstrated by the GPxI method. The highlighted field in the left hand panels are enlarged in the right hand panels.(Scales: 50[b]/division)
a. The GPxI of a computer-constructed DNA file consisting of random sequences of G (white pixels) and A (black pixels). Therefore, no pixels with other gray-values are visible. The randomness is of the sequences is expressed by the lack of any detectable patterns.
b. GPxI of the end-to-end concatenated pure GA-sequences of human chr. 1 shows clearly a number of patterns. Although different, they seem to share a periodicity of 4.
c., d. Use of a modified Markham rotation [3] to demonstrate the prevalence of the 4-periodicity. In panel c the GPxI of panel b is superimposed on itself although frame shifted by 2 bases. The result is a rather featureless gray image. In panel d the applied frame shift is 4. The result is the almost identical re-appearance of the original GPxI, indicating that a frame-shift of 4 reinforces the prevalent patterns.

The genomic 'neighborhood' of pure GA-sequences.

In addition to the pure GA-sequences themselves I recorded also their 400 [b] large flanks in various chromosomes of humans, chimpanzees, rhesus monkey, mouse, and zebrafish. It should be noted that some of the GA-sequences and their flanks had to be omitted as they were duplications for the following reason. If 2 consecutive GA-sequences were closer together than the flank size of 400 [b], their flanks would overlap and, thus be recorded twice, at least in part. Therefore, the flanks of all GA-sequences closer than 1 [Kb] were eliminated throughout this presentation.

The GPxI of the first 1,100 GA-complexes of human chr. 1 displayed in their natural order of occurrence are shown in Fig. 6a. The upstream (=left hand) ends of all GA-complexes were aligned in the vertical direction, which automatically also aligned the upstream ends of the GA-sequences. In contrast, the downstream flanks were not aligned in this GPxI, because the lengths of the pure GA-sequence were variable [see 1], thus pushing the ends of the downstream flanks to variable positions.

There were 5 striking results of the depicted GPxI of the aligned GA-complexes.
  • 1. The pure GA-sequences appeared to contain many non-random patterns.
  • 2. Neighboring GA-sequences seemed to share many patterns as evidenced by the enhanced visibility of the patterns after alignment as in Fig. 6b.
  • 3. Alternating stripes ('upstream stripes') appeared in the upstream flanks of certain primates.
  • 4. In contrast, similarly aligned downstream flanks showed no pattern of any kind.
  • 5. The upstream end of most GA-sequences appeared black in the GPxIs, indicating that these began with poly-A sequences of variable length.
  • Fig.6.Typical appearance of the GPxI of the GA-complexes (= upstream flank of 400 [b] + GA-sequence +downstream flank of 400 [b]) of human chromosomes.
    The GA-complexes are vertically aligned with the upstream ends of their GA-sequences. While the ends of all upstream flanks are automatically aligned, because they extend the same distance from the GA-sequences, the ends of the downstream flanks are not and appear frayed, as the length of each GA-sequence varies. The aligned GA-sequences in their natural order of occurrences in the chromosome are labeled as 'GA-ribbon'
    a. GPxI of the first 1,100 GA-complexes of human chr.1 in their natural order of occurrence in the chromosome. Note the appearance of the 'upstream stripes' (see text) in the aligned upstream flanks and the predominantly black (= poly-A) upstream beginnings of the aligned GA-sequences.(Scale: 50 [b]/division)
    b. Enlargement of the frame shown in panel a. Arrow points to the border between upstream flank and GA-sequence. By definition, it consists of T''s or C's. (Scale: 50 bases).

    A relationship to poly(A)- and Alu-sequences.

    A closer inspection of Fig. 6a suggests that the stripe patterns appeared upstream of a pure GA-sequence whenever its upstream end began with a certain stretch of poly(A) (i.e. with many black pixels). In order to test this conjecture, I extended the definition of GA-sequences to include more cases with poly(A) stretches. At this point the reader is reminded that pure GA-sequences were defined as GA-sequences longer than 50 bases in order to exclude poly(A) and poly(G) sequences which, of course, fulfill trivially the definition of a GA-sequence, namely to contain no C's or T's. Therefore, the inclusion of more poly(A) containing GA-sequences was achieved by simply easing the size restriction down to sizes of only 20 bases and longer. The resulting GA-sequences will be called 'common' GA-sequences in the following. By definition, the common GA-sequences included the pure ones. Reducing the length restriction yielded a much increased number of GA-sequences. For example, human chromosome 1 contained 1667 pure GA-sequences and 19,513 common GA-sequences. As a result, the ribbon of GA-sequences became much darker in the GPxI and the upstream stripes became much more pronounced (Fig. 7).

    Fig.7. Architecture of the upstream flanks of selected chromosomes of various vertebrates.
    The GPxIs were obtained by aligning the upstream ends of the common GA-sequences in their natural order of occurrence in the chromosomes. It appears that only human and chimpanzee chromosomes express upstream stripes. However, the upstream stripes of human and chimpanzee were identical.(Scale: 50[b]/division).

    Upstream stripes appeared in identical form in the GPxIs of the (common) GA-complexes of human chromosomes 1 (Fig. 7), 7 and X and even in the GPxIs of chimpanzee chromosomes (Fig. 7). In contrast, chromosomes of rhesus monkey, dog mouse and zebrafish showed no obvious patterns in the upstream flanks (Fig. 7).

    The GPxIs generated from the common GA-sequences of human and chimpanzee chromosomes after re-ordering them by the size of their upstream poly(A)-segment confirmed that the upstream poly(A) stretches were required for the appearance of upstream stripes: Whenever the GA-sequences did not end in an upstream poly(A) motif, upstream stripes were not visible in the GA-complex, either (Fig. 8a). In contrast, when the GPxI of a GA-sequence displayed a predominantly black stretch, the upstream stripes were strongly expressed in its upstream flank (Fig. 8b). They also demonstrated that the poly(A)-segments (depicted black in the GPxIs) were located almost exclusively at the upstream ends of the GA-sequences (see e.g. Fig. 7). In this way, the poly(A)-segments created a certain asymmetry and directionality of the GA-sequences, which may support the idea that they may be markers for a reading direction of the GA-sequences. Apparently, in exceptional cases GA-complexes can suffer inversions. After sorting the GA-complexes according to the poly(A) content of their downstream flanks, I found in human chr.7 a handful of GA-complexes whose upstream stripes were absent, but their exact mirror images appeared in the GA-sequences.

    In an unrelated study I searched the human chromosome 1 for the locations of Alu sequences. The search used the AluY-sequence as template and tolerated up to 10 point mutations at arbitrary locations for successful matches. Once found, the matching sequences and their 400 [b] large up- and downstream flanks were recorded and used to generate the GPxI of the corresponding Alu-complexes (Fig 8c). Surprisingly, the upstream stripes of human and chimpanzee appeared identical to the stripe pattern of the Alu-sequences (Fig. 8b,c). A further surprise was the absence of any Alu-patterns in the upstream flanks of the GA-sequences of rhesus monkeys (Fig. 7), (or anywhere else in the rhesus genome), as Alu-sequences are generally believed to be shared by all primates.

    Fig.8. Expression of upstream stripes as a function of poly(A) segments located at the upstream end of the GA-sequences. Identity between upstream stripes and Alu-sequences. The GPxIs show portions of the GA-complexes of human chr.1 after sorting them by the decreasing size of poly(A) segments at the upstream end of the GA-sequences. The aligned GA-sequences are labeled as 'GA-alignment' because they are not depicted in their natural order. (Scale: 50[b]/division)
    a. Absence of upstream stripes where the upstream ends of the GA-sequences contained no poly(A) segments.
    b. Strong expression of upstream stripes where the GA-sequences ended in large upstream poly(A) segments (black stretches)
    c. GPxI of the matches of the Alu-consensus sequence cited in the text and their 400 base large up- and down-stream flanks found in human chr.1. Note, the Alu-pattern extends upstream beyond the limit of the consensus sequences. Numerous point mutations can be seen as individual pixels that have a different gray value than the consensus pattern above and below. Furthermore, each Alu-sequences seems to terminate downstream in a stretch of black pixels, i.e. in a poly(A) sequence.

    Other kinds of pure base-restricted sequences

    In order to place the above findings in a larger perspective, other kinds of sequences should be mentioned that, like pure GA-sequences, are restricted in their base composition. They must belong to one of the following cases:
  • (a) The sequence is restricted to 3 bases, i.e. exactly 1 base is missing (e.g. pure GAT-sequences.
  • (b) The sequence is restricted to 2 bases, i.e. exactly 2 bases are missing (e.g. pure GA-restricted sequences, such as pure GA-sequences).
  • (c) The sequence is restricted to 1 base, i.e. exactly 3 bases are missing (e.g. pure A-restricted sequences which, of course, are also known as poly-A sequences).
  • The following gives a brief overview of their numbers of occurrence in the example of human chr.1, without going into the same details as in the case of pure GA-sequences.
  • Pure A-, C-, G-, and T-sequences. I found only 7 poly-A sequences > 50 bases in human chr.1. The longest among them measured 69 [b]. Likewise, there were 11 poly-T sequences with the longest measuring 57 [b]. There were no poly-C or poly-G sequences longer than 50 bases.
  • Pure AC-, AT-, GC-, CT-, and GT-restricted sequences. With the exception of pure GC-restricted sequences all other kinds were found in large numbers. As mentioned earlier, the complements of pure base-restricted sequences were probably base-restricted sequences that were placed there by a previous inversion. Therefore, the following overview of their numbers counts them together, unless they were already combined because they were their own complements such as pure GC- and TA-restricted sequences. Counting the numbers of such sequences in human chr.1 and 3, I found that pure GA/TC sequences were by far the most frequent (2273 cases), and that pure GC sequences effectively did not exist (5 cases) (Fig.6). The predominance of pure GA-sequences was one of the reasons, this presentation focused on them. As mentioned earlier, random computer-generated control sequences contained not a single case of any of these base-restricted sequences.
  • Pure CGT-, ACG-, AGT-, and ACT-restricted sequences. Their defining property, namely to miss exactly one base, is much less restricting than the requirements of the pure 1- and 2 base-restricted sequences. Consequently, they were found in much larger numbers. Again combining their numbers with the numbers of their complements, there were 4 times as many pure 3-base-restricted sequences that were missing the C or the G (37,271 cases), than there were sequences that missed the A or the T (9601 cases) in human chr.1. There average length was 73 [b] (std.dev. 39 [b]).
  • Fig.9. Numbers of the 4 possible pure 2-base-restricted sequences in human chr.1.
    The counts of each are combined with the counts of its complementary sequences. It appears that the pure GA/CT-sequences are the most frequent while pure GC-sequences are too few to appear on the scale of the figure.