The genome pixel image (GPxI)

[See ref 7 (Appendix I), ]

For some of the examination of genomes presented here, it will be advantegeous to depict genome sequences in a novel way. Instead of describing the as strings of letters, we will present them as optical images by the GPxI method described earlier. Briefly, the method assigns to the bases of a DNA sequence the following gray-tone values: A: black, G: white, C: dark gray and T: light gray (Fig.A1a). This assignment is, of course, arbitrary, but must remain the same throughout. It transforms the consecutive bases of the sequence into a continuous line of pixels with varying gray values. In addition, the method requires the choice of an arbitrary, but also fixed image width W. Whenever the line of pixels reaches W, it wraps around like any other text would, and continues at the beginning of the next line immediately underneath. For example, the GPxI of a computer-constructed, random DNA sequence appears as the featureless dot-pattern shown in Fig.A1b.
It is, of course, also possible to choose the image width equal to the size of the depicted sequences. In this way, an array of sequences (e.g. the Alu-sequences) can be written in register .

Fig.A1. Basic principle of the 'genome pixel image' (GPxI) method.
(a). DNA sequence written in the tradional way and the assignment of a certain graytone to each base (insert).
(b). Writing the above DNA sequence from left to right while expressing each base as a single pixel with the assigned gray-value yields a line of pixesl with varying graytones.
(c).Whenever the pixel line has reached the edge of the image, it wraps around and continues on the left margin.
(d).By omitting the white spaces between the consecutive line the Genome Pixel Image (GPxI) emerges.
(e).Examples of the GPxIs of a random DNA file and a highly structured part of the human X-chromosome.

For the visitor who finds the above figure too busy, I add an animated version of the generation of GPxIs.

Fig.A1f. Animation of the basic principle of the 'genome pixel image' (GPxI) method.

Examples of GPxIs and their interpretations

Figure A2 shows the GPxI of the first 150 Kb of the human X chromosome (Fig.A2a). While the size of 150 Kb is already too large for many applications of the traditional alignment methods, the striking patterns visible in the GPxI image in near the 5’ end highlight immediately the exact location for candidates of repetitive sequences without any prior knowledge of any special properties of the sequences in this location. Furthermore, one can see immediately that these special sequences occur in 2 clusters separated by a large stretch of non-repetitive DNA. Their pseudo-repetitive character becomes obvious through the action of 2 consecutive magnifications shown of Figures A2b and A2c: The larger the magnification, the less the repetitions of any patterns become detectable.

Fig.A2. GPxI of the first 150 Kb of the human X chromosome (Un-sequenced portions are omitted). (Scales: 50[b]/division)
(a). The appearance of several pseudo-repetitive sequences as various, seemingly repetitive patterns. The appearance of identical repetition vanishes with increasing magnification of the GP demonstrating the power of the human visual sense to still detect rules and relationships between DNA sequences even after mutations and variations have obliterated them to a large degree.
(b). Enlargement of the portion of the GPxI within the black frame in panel a.
(c). Enlargement of the portion of the GPxI within the black frame in panel b.

The distinction between repetitive and pseudo-repetitive sequences can be tested by GPxIs in a much more objective way, too. Obviously, the appearance of any patterns depends on the width of the GPxI, as it determines which downstream part of a sequence is written directly below it. Given a series of truly repetitive motifs there will exist specific values for the GPxI-width where the motifs fall into perfect register and, thus, generate a pattern of vertical lines. As shown in Figure A3a, at a GPxI-width of 610 [b] the obliquely striped patterns shown in Fig.A2a seem to turn into vertical lines (marked as ‘1’ in Figure A3). However, as shown by the magnified inset at the right hand side, the vertical lines are not perfect. Instead, they fall into 3 groups shifted out of register by 2 insertions. Furthermore, they are interrupted by numerous point mutations that appear as differently colored pixels within many vertical lines. Both properties identify them not only as pseudo-repeats, but also identify the causes of their differences.

Fig.A3.Effect of GPxI-width on pattern appearance and recognition on a portion of the GPxI of Figure A2. The numbers 1,2,and 3 indicate the same domains on each panel. Enlargments of these domains are shown on the right hand side.(Scale: 50[b]/division)
(a). GPxI-width = 610 [b]. The pattern at '1' turns vertical but, as shown by the enlargement, contains deviations in the form of 2 shifts (=insertions) and single deviant pixels (=point mutations).
(b). GPxI-width = 568 [b]. The domains '2' and '3' appear almost random.
(c). GPxI-width = 551 [b]. Domain '2' shows a clear periodicity with few deviations. Domain '3'' shows pseudo-repetitive patterns.

The method of changing the width of the GPxI may also bring out the existence of otherwise easily overlooked relationships. For example, the domains labeled as ‘2’ and ‘3’ in Figure A3 may appear rather unstructured and, thus, unrelated at a GPxI-width of 568 [b], whereas pseudo-repetitive patterns become clearly visible at 551 [b] GPxI-width.

Depiction of selected DNA segments placed in register

In the above examples, the GPx Images depicted continuous genome sequences that continued in the next line underneath because the string of pixels had reached the margin of the image. In this way, the above patterns became quite visisble. However, most parts of natural genomes do not contain as many pseudo-repetitive sequences as the above examples. Nevertheless, visual pattern recognition can be used to detect homologies, similarities or deviations from homologies etc. very easily using the GPxI method. After placing isolated segments of related genome sequences underneath each other and in register one can compare them visually with each other, even though they belong to different parts of genomes or different genomes of different species. For example, Fig. A4 shows a collection of Alu-sequences from human chromosome 1 placed in register. It is easy to detect their common features as well as their individual differences (mutations) in this way.

Fig.A4.Visual comparison of 100 selected segments of human chromosome 1 that contain an Alu-sequence with up to 50 point mutations compared to AluY which were placed in register with their downstream ends (=right hand side where the poly-A ('pA') portion begins).
The labels indicate uf: upstream flank, df: downstream flank, pA: typical poly-A portion at the downstream end of Alu-sequences.

Depiction of statistical properties of genome sequences

Statistical properties such as AT- or GC-richness can also be depicted by the GPxI method. In this case one may use color coding and superimpose the color on the GPx Image. As the GPx Image is generated line by line, the computer tallies up the ratio r = (A+T)/G+C) cumulatively as it writes a line of the image. As long as r remains close to unity, the computer tints the pixel green. However, if a particular line is AT-rich, r will rise above unity from left to right and the tint of the pixels will correspondingly become increasingly blue. On the other hand, if the line is GC-rich it will be made increasingly red corresponding to the decrease of r from left to right.

Fig.A5.Line by line assessment of the ratio r = (A+T)/(G+C) in the human Y-chromosome between positions 2,496,000 and 2,808,000.
As the ratio increases from left to right in AT-rich genome segments, the color becomes increasingly blue. If it decreases from left to right in GC-rich genome segments, the color becomes increasingly red. Green lines indicate balanced ratios along the line.

Animation of GPx Images

GPx Images, like all other images, can be animated, provided one knows a computer-generated or observed time sequence of the depicted DNA sequence. Fig. A6 shows the example of a computer generated depiction of random transpositions on a portion of the human X-chromosome.

Fig.A6.Animation of a segment of the human X-chromosome to illustrate the effect of (computer-generated) transpositions.
Transpositions appear as shifts of a portion of the sequence, preceeded by the temporary appearance of a white segment where the transposon originated)acting on the segment

In summary, the method offers 4 important advantages over the traditional homology-based methods.
(A). No prior knowledge or suspicion of any special relationships between the tested sequences is required.
(B) No special data-processing is required beyond the simple and fast reading of the sequences in question and their base-by-base translation into a line of gray-tone pixel.
(C) In contrast to traditional homology-based methods, which become increasingly cumbersome, time consuming, and difficult to interpret as the numbers and sizes of the tested sequences increase, the detection of patterns in the GPxI’s becomes even easier and more meaningful under the same circumstances.
(D) Patterns and, thus, relationships between sequences remain easily detectable, even when mutations and other scrambling and distorting influences on the sequences may have randomized them and, thus, reduced the possibilities to demonstrate them mathematically to almost nil.