Pure GA-sequences as candidates for genomic sign posts.A simple way to search for natural genomic sign posts would be to look for unexpected, non-coding sequences that exist in large numbers. Here, we describe a specific type, namely sequences between 50 and 1000 bases long that consist exclusively of only 2 bases.
The phenomenon of pure GA-sequencesPlotting the sizes of all pure GA-sequences as a function of their position along a random, computer-generated DNA strand, showed that such sequences are hardly ever longer than 15 bases (Fig.1a). In stark contrast, a similar plot of an 8 Mb large section of human chr.1 between 24 Mb and 32 Mb, displayed numerous much longer sequences, including 76 which were between 50 and 269 bases long (Fig.1b).
Fig.1. Length of pure GA-sequences as a function of their starting position along
an 8 [Mb] long stretch of DNA sequence. (Abscissa: position in [Mb]; Ordinate: length of
pure GA-sequence in [b].
a. Computer generated random sequence of 20% G's and 30% A's (similar to the human genome).
b. Human chr.1 between positions 24 [Mb] and 32 [Mb]. (Gaps in the density of GA-sequences are due to un-sequenced regions).
The size-distribution of pure GA-sequences.The probability that several A's and G's occur N times in a row should be a rapidly diminishing exponential function of N. Indeed, a logarithmic plot of the frequencies of the lengths of the pure GA-sequences of a computer-generated, random DNA sequence yielded a straight line (Fig. 2). In contrast, a logarithmic plot of the frequency of the actual probability of the size of pure GA-sequences yielded a power law distribution. Figure 2 shows the example of the human genome. After normalization for the values of size = 1 to by dividing each raw count by the sum of all raw counts the probability density function became
Fig.2. Distribution of GA-sequence lengths. Abscissa: Ga-sequence length [b].
Ordinate logarithm of counts. The vertical line indicates the defining threshold of 50 [b] for pure GA-sequences.
a. Exponential distribution of the computer-generated GA-sequence of Fig.1a.
b. The Pareto-distribution of GA-sequence lengths of the entire human genome.
The exclusion of poly-A and poly-G sequences.Obviously, one may consider poly-A and poly-G sequences as special cases of pure GA-sequences that contain only one of the 2 bases. As these are well-described in the literature, the present examination intended to exclude them while focusing specifically on pure GA-sequences that contain both nucleotides. As it turned out, this could be accomplished quite simply by setting a length threshold, because poly-A and poly-G sequences were very rarely longer than 50 bases.
Relationship between chromosome size and numbers of pure GA-sequences.The numbers of pure GA-sequences were approximately proportional to the size of the chromosome to which they belonged. A correlation plot between the sizes of each human chromosome and its number of pure GA-sequences yielded a correlation coefficient of 0.907 (Fig.3). Its slope corresponded to an average density of GA-sequences of 6.9 [sequences/[Mb].
Fig.3.Relationship between chromosome size and number of pure GA-sequences among the 23 human chromosomes (Y-chromosome was omitted)
The species-dependent spatial density of pure GA-sequences.The densities of pure GA-sequences from chromosomes of different species differed to a much greater degree than the densities of different chromosomes from the same species. For example, the average density of the 23 human chromosomes (excluding the Y-chromosome) was 6.9 [sequences/Mb] ( std.dev. = 2.1 [sequences/Mb]). In contrast, the density of mouse chr.2 and rat chr.3 were 3 and 4.7 times larger (Fig.4a,b). On the other hand, maize, arabidopsis and C.elegans had 10 to 50 times smaller densities of pure GA-sequences than human chr.1. (Fig.4c). Thus densities could vary up to 250-fold between species. It should be noted that maize, arabidopsis, and C.elegans did not only contain very few pure GA-sequences, they were effectively simply poly-GA sequences.
Fig.4.Density of pure GA-sequences for various mammals and other organisms.
a. Human chr.1 between positions 8 [Mb] and 16 [Mb]. (axes as in Fig.1).
b. Mouse chr.2 between positions 16 [Mb] and 24 [Mb]. (axes as in Fig.1).
c. Density of pure GA-sequences for individual chromosomes of various mammalian and non-mammalian chromosomes (Ordinate: number of pure GA-sequences per [Mb] sequence. ). The tags along the abscissa indicate the various chromosomes in the order from left to right: mouse chr.1, rat chr.1, dog chr1, human chr.1, chimpanzee chr.1, zebrafish chr.1, cat genome segment 1, maize chr.1, Arabidopsis chr.1, and caenorhabditis elegans chr.X.
The individuality of pure GA-sequences.There are 293 1028 different ways to generate different pure GA-sequences that are on average 93 bases long. This astronomically large number would be able to afford each pure GA-sequence its own, individual sequence.
Are pure GA-sequences coding?There are 8 codons that consist exclusively of G's and A's, namely AAA, AAG, AGA, AGG, GAA, GAG, GGA, and GGG. They code for Arg, Gly, Glu, and Lys. As it seemed conceivable that some pure GA-sequences coded for proteins that contained chains of these 4 amino acids, I tested how many pure GA-sequences were coding for mRNAs. Searching all human transcripts (440 [Mb]) for pure GA-sequences longer than 50 bases and excluding, of course, all poly-A sequences, I found 394 cases. In contrast, the entire human genome contained a total of 19,139 pure GA-sequences, indicating that at most 1.95% of them are transcribed into messenger RNAs. Thus, for all practical purposes the pure GA-sequences may be considered non-coding for proteins.
The tetra-GA motifs of pure GA-sequences.Concatenating end-to-end all 1667 pure GA-sequences of human chr.1 yielded the GPxI shown in Figure5b. The comparison with a computer-constructed random GA-sequence file (Figure5a), confirmed that the pure GA-sequences contain many repetitive patterns.
Fig.5. Predominance of tetra-GA motifs in the pure GA-sequences of human chr. 1
as demonstrated by the GPxI method. The highlighted field in the left hand panels
are enlarged in the right hand panels.(Scales: 50[b]/division)
a. The GPxI of a computer-constructed DNA file consisting of random sequences of G (white pixels) and A (black pixels). Therefore, no pixels with other gray-values are visible. The randomness is of the sequences is expressed by the lack of any detectable patterns.
b. GPxI of the end-to-end concatenated pure GA-sequences of human chr. 1 shows clearly a number of patterns. Although different, they seem to share a periodicity of 4.
c., d. Use of a modified Markham rotation  to demonstrate the prevalence of the 4-periodicity. In panel c the GPxI of panel b is superimposed on itself although frame shifted by 2 bases. The result is a rather featureless gray image. In panel d the applied frame shift is 4. The result is the almost identical re-appearance of the original GPxI, indicating that a frame-shift of 4 reinforces the prevalent patterns.
The genomic 'neighborhood' of pure GA-sequences.In addition to the pure GA-sequences themselves I recorded also their 400 [b] large flanks in various chromosomes of humans, chimpanzees, rhesus monkey, mouse, and zebrafish. It should be noted that some of the GA-sequences and their flanks had to be omitted as they were duplications for the following reason. If 2 consecutive GA-sequences were closer together than the flank size of 400 [b], their flanks would overlap and, thus be recorded twice, at least in part. Therefore, the flanks of all GA-sequences closer than 1 [Kb] were eliminated throughout this presentation.
Fig.6.Typical appearance of the GPxI of the GA-complexes (= upstream flank of
400 [b] + GA-sequence +downstream flank of 400 [b]) of human chromosomes.
The GA-complexes are vertically aligned with the upstream ends of their GA-sequences. While the ends of all upstream flanks are automatically aligned, because they extend the same distance from the GA-sequences, the ends of the downstream flanks are not and appear frayed, as the length of each GA-sequence varies. The aligned GA-sequences in their natural order of occurrences in the chromosome are labeled as 'GA-ribbon'
a. GPxI of the first 1,100 GA-complexes of human chr.1 in their natural order of occurrence in the chromosome. Note the appearance of the 'upstream stripes' (see text) in the aligned upstream flanks and the predominantly black (= poly-A) upstream beginnings of the aligned GA-sequences.(Scale: 50 [b]/division)
b. Enlargement of the frame shown in panel a. Arrow points to the border between upstream flank and GA-sequence. By definition, it consists of T''s or C's. (Scale: 50 bases).
A relationship to poly(A)- and Alu-sequences.A closer inspection of Fig. 6a suggests that the stripe patterns appeared upstream of a pure GA-sequence whenever its upstream end began with a certain stretch of poly(A) (i.e. with many black pixels). In order to test this conjecture, I extended the definition of GA-sequences to include more cases with poly(A) stretches. At this point the reader is reminded that pure GA-sequences were defined as GA-sequences longer than 50 bases in order to exclude poly(A) and poly(G) sequences which, of course, fulfill trivially the definition of a GA-sequence, namely to contain no C's or T's. Therefore, the inclusion of more poly(A) containing GA-sequences was achieved by simply easing the size restriction down to sizes of only 20 bases and longer. The resulting GA-sequences will be called 'common' GA-sequences in the following. By definition, the common GA-sequences included the pure ones. Reducing the length restriction yielded a much increased number of GA-sequences. For example, human chromosome 1 contained 1667 pure GA-sequences and 19,513 common GA-sequences. As a result, the ribbon of GA-sequences became much darker in the GPxI and the upstream stripes became much more pronounced (Fig. 7).
Fig.7. Architecture of the upstream flanks of selected chromosomes of various vertebrates.
The GPxIs were obtained by aligning the upstream ends of the common GA-sequences in their natural order of occurrence in the chromosomes. It appears that only human and chimpanzee chromosomes express upstream stripes. However, the upstream stripes of human and chimpanzee were identical.(Scale: 50[b]/division).
Fig.8. Expression of upstream stripes as a function of poly(A) segments located
at the upstream end of the GA-sequences. Identity between upstream stripes and Alu-sequences.
The GPxIs show portions of the GA-complexes of human chr.1 after sorting them by the
decreasing size of poly(A) segments at the upstream end of the GA-sequences. The aligned
GA-sequences are labeled as 'GA-alignment' because they are not depicted in their
natural order. (Scale: 50[b]/division)
a. Absence of upstream stripes where the upstream ends of the GA-sequences contained no poly(A) segments.
b. Strong expression of upstream stripes where the GA-sequences ended in large upstream poly(A) segments (black stretches)
c. GPxI of the matches of the Alu-consensus sequence cited in the text and their 400 base large up- and down-stream flanks found in human chr.1. Note, the Alu-pattern extends upstream beyond the limit of the consensus sequences. Numerous point mutations can be seen as individual pixels that have a different gray value than the consensus pattern above and below. Furthermore, each Alu-sequences seems to terminate downstream in a stretch of black pixels, i.e. in a poly(A) sequence.
Other kinds of pure base-restricted sequencesIn order to place the above findings in a larger perspective, other kinds of sequences should be mentioned that, like pure GA-sequences, are restricted in their base composition. They must belong to one of the following cases: The following gives a brief overview of their numbers of occurrence in the example of human chr.1, without going into the same details as in the case of pure GA-sequences.
Fig.9. Numbers of the 4 possible pure 2-base-restricted sequences in human chr.1.
The counts of each are combined with the counts of its complementary sequences. It appears that the pure GA/CT-sequences are the most frequent while pure GC-sequences are too few to appear on the scale of the figure.