Abstract

The genetic code is the set of rules by which DNA stores the genetic information about formation of protein molecule. In this paper we discuss an algebraic structure of the genetic code in terms of the four DNA bases (A, C, G, T). Some relations between transition/transversion mutation of codons and algebraic properties of respective codons of the group structure are obtained. We also construct a distance matrix of the amino acids. We establish some relations between the distance matrix and physico-chemical properties of amino acids. Further we argue that the distance matrix reflects evolutionary pattern of amino acids.

Key Words: DNA, genetic code, mutation, algebra, distance matrix.

Introduction

The genetic code is the set of rules by which information encoded in genetic material (DNA or RNA sequences) is translated into proteins. Proteins are the basic constructional blocks and functional elements of living organisms. Amino acids are the building blocks of proteins.

Each protein is formed by a linear chain of amino acids. There are 20 different amino acids being found till now that occurs in proteins. The molecular repositories of genetic information are the nucleic acids. There are two types of nucleic acids: deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) found in the cells. The information is stored in its sequence of nucleotides. Amino acids are synthesized by RNAs and RNAs are obtained from DNAs. Each amino acid is a triplet code (codon) of four possible bases (A, C, G, T) of DNA. The chain of amino acids takes on different shapes to form different proteins. DNA consists of two complimentary long chains of nucleotides. Watson and Crick (1953) have proposed a model of the way in which the nucleotide chains may produce copies of themselves. According to Watson-Crick, adenine (A) of one strand is always paired with a thymine (T) on the other. Similarly, the guanine (G) of one strand is paired with a cytosine (C) on the other. Therefore, the sequence of bases of one side is enough to deduce the other. For example, if bases along one pole are AGTCGCTA then the other have the complimentary sequence TCAGCGAT. Mathematically, DNA can be considered as a sequence of four letters A, C, G, T.

The transmission of information from DNA to protein building goes through two processes: transcription and translation. Due to mutation, the sequence of bases is not copied precisely in replicating the strand of DNA. This results in the change of protein formation. Different kinds of mutations are possible in codons namely, point mutation, frame-shift mutation, deletion, insertion, inversion. In this paper we will consider only the case of point mutation. In case of a point mutation, there is a simple change in one base of the gene sequence. It replaces a single base nucleotide with another nucleotide of the genetic material, DNA or RNA. This mutation may be at single point, mutation at two points etc. The point mutation from purine (A, G) to purine or a pyrimidine(C, T) to pyrimidine is known as transition mutation and the point mutation from a purine to pyrimidine or vice-versa is known as transversion mutation. Point mutations usually take place during DNA replication.

In RNA, 64 codons make up the genetic code, though there are only 20 amino acids. This means that there are some overlap i.e., more than one codon code for the same amino acid. We can consider this as a function of many to one carrying codons to amino acids. It is therefore of interest to find out if the genetic code has any mathematical property which gets optimized when the number of codons becomes nearly thrice the number of the amino acids (Balakrishnan, 2002). For this purpose, different formal mathematical models of the genetic code have been proposed.

Any three bases among the four DNA bases form a codon and the importance of base position is suggested by the error (accepted mutation) frequency found in the codons. The frequency of errors decreases from the third base to first base and then next to the second base. That is the second base is biologically most relevant and third base is least relevant base in the codon. Also the second position (most significant base position) of codons is connected with the hydrophobicity of amino acids. The amino acids having A at the second position of their codons are hydrophilic: {D, E, H, N, K, Q, Y} and those with U at the second position are hydrophobic: {I, L, M, F, V} (Watson & Crick, 1953).

Many attempts (Antoneli et al., 2003; Balakrishnan, 2002; Bashford et al., 1998; Bashford et al., 2000; Beland and Allen, 1994; Lehmann, 2000; Robin et al., 1999; Schuster et al. 1994; Stadler et. al, 2001; Siemion et al., 1995 and so on) has been made to introduce a formal characterization of the genetic code algebraically. Hornos and Hornos(1993) is the first one who introduced the group theoretical methods in the study of genetic code. Sanchez et al. (2004, 2005a, 2005b, 2005c) brings a new idea for describing the quantitative relationship between DNA genomic sequences. Sanchez et al. (2004, 2005a) proposed a Boolean structure of the genetic code in which the partial order of the codon set and the Boolean deductions between codons are connected to the physico-chemical properties of amino acids. Again in Sanchez et al. (2005c), it was shown that using the same base properties, it is possible to infer a different codon order and a different algebraic structure of the genetic code. Working on the same field Ali and Phukan (2013) also discussed another algebraic structure of the genetic code. From both of these algebraic structures, some interesting connections between algebraic and biological properties have been observed. In (Gohain et al., 2015), a lattice structure of the genetic code has been developed wherein some interesting relations between the lattice structure and biological aspects has been observed. In this paper we are trying to investigate some concepts of algebra in genetic code. Also we try to explore the evolutionary pattern of amino acids based on hamming distance.

Algebraic structure on genetic code

Sanchez et al. (2005c) observed that the four RNA (or DNA) bases can be arranged or ordered considering the codon-anticodon interactions between them. The hydrogen bond number and the chemical type (purine and pyrimidine) of bases play an important rule on this. From which two orders of the base sets:{A, C, G, U} and {U, G, C, A} are obtained and further a sum operation (Table 1) is defined on these two base sets. The two sets are isomorphic to the cyclic group Z4 (group Z4 of integer module 4). They considered the following as important criteria for determining the orders:

1. Chemical types cause the main difference between bases.

2. The greatest difference between one element and the next serve as criterion to select arrangements.

3. The starting base needs a minimum hydrogen bond number.

Working on the same field, by considering the same order of bases, Ali and Phukan (2013) define a product operation (Table 1 )on the base set P={A, C, G, U}. With these two binary operations (sum and product) the set P fulfils the axioms of a ring structure. In the ring (P, +, • ), A is additive identity and C is the multiplicative identity. Also, P has commutative ring structure with identity element.

Sum + A C G U

Product • A C G U

A A C G U A A A A A

C C G U A C A C G U

G G U A C G A G A G

U U A C G U A U G C

Table 1 :Sum and product operations on P = { A , C , G , U }

Ali and Phukan (2013) arranged all the codons in the genetic code table (Table 2) by using the Cartesian product of the ring P i.e., and denote it as , where

Each codon of the form XYZ is associated with the element (X, Y, Z ) of and thus an one to one correspondence can be established between set and . Next, a sum and product operation is defined between the codons by the following way

With these two operations possesses ring structure and is isomorphic to . For example, the element has correspondence with the element . The genetic code with corresponding amino acids table is shown in Table 2.

A C G U

(1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3)

A 000 AAA K 010 ACA T 020 AGA R 030 AUA I A

001 AAC N 011 ACC T 021 AGC S 031 AUC I C

002 AAG K 012 ACG T 022 AGG R 032 AUG M G

003 AAU N 013 ACU T 023 AGU S 033 AUU I U

C 100 CAA Q 110 CCA P 120 CGA R 130 CUA L A

101 CAC H 111 CCC P 121 CGC R 131 CUC L C

102 CAG Q 112 CCG P 122 CGG R 132 CUG L G

103 CAU H 113 CCU P 123 CGU R 133 CUU L U

G 200 GAA E 210 GCA A 220 GGA G 230 GUA V A

201 GAC D 211 GCC A 221 GGC G 231 GUC V C

202 GAG E 212 GCG A 222 GGG G 232 GUG V G

203 GAU D 213 GCU A 223 GGU G 233 GUU V U

U 300 UAA – 310 UCA S 320 UGA – 330 UUA L A

301 UAC Y 311 UCC S 321 UGC C 331 UUC F C

302 UAG – 312 UCG S 322 UGG W 332 UUG L G

303 UAU Y 313 UCU S 323 UGU C 333 UUU F U

(1) Corresponding elements of , (2) The base triplets codons,

(3) The one letter symbol of amino acids, “-” Corresponds to stop codons

Table 2: Genetic code table

We propose the following

Definition: Codon in which all bases are purines are termed as even codons and the codons in which at least one base is a pyrimidine are odd codons.

We consider the term even as the position of all bases of these codons in P is even.

It is observed that the set of all even codons, that is {AAA, AAG, GAA, GAG, AGA, AGG, GGA, GGG}is a subgroup of the group .

The order of the elements of the group divides the group into three classes. The following table gives the order of the codons.

Order Related codon

1 AAA

2 AAG, GAA, GAG, AGA, AGG, GGA, GGG

4 AAC, AAU, CAA, CAC, CAG, CAU, GAC, GAU, UAA, UAC, UAG, UAU, ACA, ACC, ACG, ACU, CCA, CCC, CCG, CCU, GCA, GCC, GCG, GCU, UCA, UCC, UCG, UCU, AGC, AGU, CGA, CGC, CGG, CGU, GGC, GGU, UGA, UGC, UGG, UGU, AUA, AUC, AUG, AUU, CUA, CUC, CUG, CUU, GUA, GUC, GUG, GUU, UUA, UUC,UUG, UUU

Table 3: Partition of the group into three classes w.r.t. their orders

The transition mutation (purine to purine or pyrimidine to pyrimidine) and transversion mutation of codons are connected with changes in parity (change form odd codon to even or vice-versa) and the order of codons (order as element of the ring). Following are a few connections that we have observed:

1. One-point transition of any base keeps the codon parity as well as codon order. These mutations generally do not introduce extreme changes in physico-chemical properties.

2. Transversion of bases changes codon parity as well as codon order.

3. Transversion of codons having a pyrimidine as second base (biologically most significant position) keeps the codon parity as well as codon order.

4. All odd codons have maximal order and all even codons have order less than that.

5. During single base transversion, even codons are always muted to odd codons and for each codon, the resulting muted codons are algebraically inverse of one another. For example, first base transversion of the even codon AAG are CAG and UAG, which are algebraically inverse of one another.

6. In first base transversion, the even codons are changed to a codon that code to a polar amino acid, for the second base it is to hydrophobic and for the third base it is to a small amino acid. Also, due to third base transversion, the hydrophilic (hydrophobic) codon (even) changes to a hydrophilic (hydrophobic) codon.

We have eight even codons and the substitution of the bases of the codons according to the Watson-Crick base pairs (A↔U, G↔C) gives another eight codons which are not zero-divisors of the group .

Even codon ↔ not zero-divisor codon

AAA↔ UUU

AAG ↔ UUC

GAA ↔CUU

GAG ↔ CUC

AGA ↔ UCU

AGG ↔ UCC

GGA ↔CCU

GGG ↔ CCC

Table 4: Substitution of the bases of all even codons w.r.t Watson-Crick base pairs

The even codons with their muted codons (transversion) and the not zero-divisor codons with their muted codons (transversion) partitions the whole set of codons into two equal, disjoint subsets. The following table gives the even and not zero-divisor codons with their muted codons:

Even codons Muted codons(transversion) Not zero- divisor codons Muted codons(transversion)

AAA CAA, UAA, ACA, AUA, AAC, AAU UUU AUU, GUU, UAU, UGU, UUA, UUG

AAG CAG, UAG, ACG, AUG, AAC, AAU UUC AUC, GUC, UAC, UGC, UUA, UUG

GAA CAA, UAA, GCA, GUA, GAC, GAU CUU AUU, GUU, CAU, CGU, CUA, CUG

GAG CAG, UAG, GCG, GUG, GAC, GAU CUC AUC, GUC, CAC, CGC, CUA, CUG

AGA CGA, UGA, ACA, AUA, AGC, AGU UCU ACU, GCU, UAU, UGU, UCA, UCG

AGG CGG, UGG, ACG, AUG, AGC, AGU UCC ACC, GCC, UAC, UGC, UCA, UCG

GGA CGA, UGA, GCA, GUA, GGC, GGU CCU ACU, GCU, CAU, CGU, CCA, CCG

GGG CGG, UGG, GCG, GUG, GGC, GGU CCC ACC, GCC, CAC, CGC, CCA, CCG

Table 5: Transversion of even codons and not zero-divisor codons

Thus, we can define a function such that for ,

.

where,

An alternative way of defining the function is such that for

It is observed that all the elements having order less than 4 maps to an element of order 4 and will give us the set of all not zero-divisors of CG. The function f represents the triple base mutation of all even codons in terms of Watson-Crick base pairs.

The set obtained by transversion of domain of f (even codons) together with the domain set and the range set together with the set obtained by the transversion of the range set (set of all not zero-divisors) partitions the whole set CG into two disjoints sets. In other words, if M is the set of even codons and their one-point transversions, N is the set of all not zero-divisors and their one-point transversions, then,

Distances between amino acid and their biological significance

We define a distance matrix of the codons by considering Hamming distance. This distance matrix will allow us to define a distance matrix of the amino acids by considering the mean distance between the codons that code for the respective amino acids.

For example, the hamming distance between the codons GGU and UGC is given by the number of base positions at which the corresponding codons are different. Next, the distance between the amino acid pairs is being found by computing the mean distance between their respective codons.

For example, we calculate the distance between the amino acids Glycine (G) and Tryptophan (W) is 1.75.

The codons that code to G are GGA, GGC, GGG, GGU.

The codons that code to W is UGG.

The distances between the codons are:

GGA GGC GGG GGU

UGG 2 2 1 2

The mean distance between the above codons gives us the distance between the respective amino acids.

The following table represents the distance between each amino acid pairs:

Table 6: The distance between amino acids pairs computed as the mean distance between their respective codons.

It is being noticed that in most of the cases the differences of physico-chemical properties of amino acids increases with the increase of the distance values. And between most of the hydrophilic and hydrophobic amino acids there are great distance values. For example, the distance value of the amino acids Phenylalanine (strong hydrophobic) and Lysine (strong hydrophilic) is the maximum distance 3. The mutations with small differences between amino acids would mean a small distance value of the corresponding amino acids. This distance matrix also defines a metric on the set of amino acids.

It is interesting to note that similar results were also obtained by Sanchez et al. (2004) wherein he uses a different approach to obtain the distance table.

From the distance matrix of Table 6 we can obtain the graph of the amino acids. We have considered the amino acids as the set of vertices, where two vertices (amino acids) α and β are connected by an edge if their distance is less than some threshold value . At first we consider the average distance (2.21) as threshold value. The corresponding graph is depicted below in Fig 1. Then we examine the graph structures for different thresholds. The graph of amino acids against different threshold values is shown below.

Case 1: For

Fig. 1

Case 2: For

Fig. 2

Case 3: For

Fig. 3

Case 4: For

Fig. 4

From the graph structures in Fig. 1, Fig. 2, Fig. 3 and Fig. 4; we observe that as we increase the threshold value, the accessibility of getting an amino acid from other decreases simultaneously. The graphs in Fig. 1 and Fig. 2 are connected while the others are disconnected. Also in Fig. 3, it is observed that the amino acids A, G, P, T, V are isolated. And these are different from the remaining amino acids in a way that all of them are coded by four codons, having same base in the first and second base position.

In Fig. 4, we have observed that the amino acids V, L, F, R, G, S, A, T, P, Y are isolated and the amino acids I, W, K, E, Q are connected with M, C, N, D, H respectively. It was already mentioned that the second base is considered as biologically most significant base, whereas third base is least significant base in a codon, according to evolutionary importance in genetic code. Here the non-isolated amino acids are different from the remaining 10 isolated amino acids in a way that one is obtained from the other by third base mutation of a codon. And the corresponding codons of the connected amino acids have same base in the first and second base positions. Also, in case of the isolated amino acids, the third base mutation of the corresponding codons of an amino acid produces synonymous codons. That is the muted codon codes the same amino acid.

An amino acid is coded by codons. Evolution of one amino acid from another is mediated by mutation in the corresponding codons. Mutations at different positions of a codon do not produce the same effect. As the distance between a pair of amino acids is based on the distances between their corresponding codons, we can say that nearer an amino acid from another, more is its chance or likelihood of evolving from the other. Thus the corresponding graphs give a picture of the evolution of amino acids. For example, the likelihood of evolving of the amino acid N from K is much more than that from G.

Next we discuss a real life example which shows that frequently occurring codon mutations usually have small distances. For that we consider the distance between the single point drug resistance mutations in HIV-1 protease gene and the respective gene of the HXB2 strain and human beta globin gene. It can be seen that in both cases the distance between most of the codons is equal to 1. Also in human beta-globin gene, if there is a small change in its physico-chemical properties of the amino acids then this results in the change of biological function of hemoglobin.

Amino acid Mutations Codon- Mutation Distance value Antiviral drug Amino acid Mutattions Codon- Mutation Distance value Antiviral drug

A711 GCU→AUU 2 ABT-378 L10Y CUC→UAC 2 BMS 232632

A71L GCU→CUC 3 ABT-378 L23I CUA→AUA 1 BILA 2185 BS

A71T GCU→ACU 1 Indinavir, Crixivan L24I UUA→AUA 1 Indinavir, Crixivan

A71V GCU→GUU 1 Nelfinavir, Viracept L24V UUA→GUA 1 Telinavir

D30N GAU→AAU 1 Nelfinavir, Viracept L33F UUA→UUC 1 ABT-538

D60E GAU→GAA 1 DMP 450 L63P CUC→CCC 1 ABT-378, AG1343

G16E GGG→GAG 1 ABT-378 L90M UUG→AUG 1 Nelfinavir, Viracept

G48V GGG→GUG 1 Telinavir, MK-639 L97V UUA→GUA 1 DMP-323

G52S GGU→AGU 1 AG1343 M36I AUG→AUA 1 Nelfinavir, Viracept

G73S GGU→AGU 1 AG1343 MK-639 M46F AUG→UUC 2 A-77009

H69Y CAU→UAU 1 Aluviran, Lopinavir M46I AUG→AUA 1 Indinavir, Crixivan

I47V AUA→GUA 1 ABT-378 M46L AUG→UUC 2 Indinavir, Crixivan

I50L AUU→CUU 1 BMS 232632 M46V AUG→GUG 1 A-77006

I54L AUC→CUC 1 ABT-378 N88D AAU→GAU 1 Nelfinavir, Viracept

I54M AUU→AUG 1 BILA 2185 BS N88S AAU→AGU 1 BMS 232632

I54T AUC→ACC 1 ABT-378 P81T CCU→ACU 1 Telinavir

I54V AUC→GUC 1 ABT-378, MK-639 R8K CGA→AAA 2 A-77003

I82T AUC→ACC 1 A-77003 R8Q CGA→CAA 2 A-77004

I84A AUA→GCA 2 BILA 1906 BS R57K AGA→AAA 1 AG1343

I84V AUA→GUA 1 Nelfinavir, Viracept T91S ACU→UCU 1 ABT-378

K20M AAG→AUG 1 Indinavir, Crixivan V32I GUA→AUA 1 A-77005, Telinavir

K20R AAG→AGG 1 Indinavir, Crixivan V75I GUA→AUA 1 Telinavir

K45I AAA→AUA 1 DMP-323 V77I GUA→AUA 1 AG1343

K55R AAA→AGA 1 AG1343 V82A GUC→GCC 1 Ritonovir, Norvir

L10I CUC→AUC 1 Indinavir, Crixivan V82F GUC→UUC 1 Ritonovir, Norvir

L10R CUC→CGC 1 Indinavir, Crixivan V82I GUC→AUC 1 A-77011

L10V CUC→GUC 1 Indinavir, Crixivan V82S GUC→UCC 2 Ritonovir, Norvir

L10F CUC→UUC 1 Lopinavir V82T GUC→ACC 2 Ritonovir, Norvir

Table4: The distance value of the mutations found in the HIV protease gene that confers drug resistance with regard to the wild type of HXB2. Most of the reported mutations in the HIV protease gene have distance value equal to or less than 3.

Amino acid Mutations Codon Mutation Distance value Biological effect Reference

P36H CCT→CAT 1 High oxygen affinity [11939509] Hemoglobin. 2002, 26,:21-31

T123I ACC→ATC 1 Asymptomatic [11300351] Hemoglobin. 2001, 25, 67-78.

V20E GTG→GAG 1 High oxygen affinity [7914875] Eur J Haematol. 1994, 53, 21-5

V20M GTG→ATG 1 High oxygen affinity [7914875] Eur J Haematol. 1994, 53, 21-5

V126L GTG→CTG 1 Neutral [11939515] Hemoglobin. 2002, 26, 7-12

V111F GTC→TTC 1 Low oxygen affinity [10975442] Hemoglobin. 2000, 24, 227-37

H97Q CAC→CAA 1 High oxygen affinity [8571935] Am J Hematol. 1996, 51, 32-6

V34F GTC→TTC 1 High oxygen affinity [10846826] Int J Hematol. 2000,71, 221-6

E121Q GAA→CAA 1 [8095930] Hemoglobin. 1993, 17, 9-17

L114P CTG→CCG 1 Non-functional [11300352] Hemoglobin. 2001, 25, 79-89

A128V GCT→GTT 1 Mild instability [11300349] Hemoglobin. 2001, 25, 45-56

H97Q CAC→CAG 1 High oxygen affinity [8890707] Ann Hematol.1996, 73,183-8

D99E GAT→GAA 1 High oxygen affinity [1814856] Hemoglobin. 1991, 15, 487-96

D21N GAT→AAT 1 [8507722] Ann Hematol. 1993, 66, 269-72

N139Y AAT→TAT 1 High oxygen affinity [8718692] Hemoglobin. 1995, 19, 335-41

V34D GTC→GAC 1 Unstable [1260309] Hemoglobin. 2003, 27, 31-5

E121K GAA→AAA 1 [7908281] Hemoglobin. 1993, 17, 523-35

A140V GCC→GTC 1 Mild polycythemia [7908281] Hemoglobin. 1993, 17, 523-35

K82E AAG→GAG 1 Altered oxygen affinity [9028820] Hemoglobin. 1997, 21, 17-26

G83D GGC→GAC 1 Hb Pyrgos (Normal) [9255613] Hemoglobin. 1997; 21, 345-61

D99N GAT→AAT 1 High oxygen affinity [11843288] Int J Hematol. 2002, 75, 35-9

G15R GGT→CGT 1 Neutral [11939517] Hemoglobin. 2002, 26, 77-81

V111L GTC→CTC 1 Fannin-Lubbock variant [7852084] Hemoglobin. 1994, 18, 297-306

G119D GGC→GAC 1 Fannin-Lubbock variant [7852084] Hemoglobin. 1994, 18, 297-306

E26K GAG→AAG 1 [9140717] Hemoglobin. 1997, 21, 205-18

N108I AAC→ATC 1 Low affinity [12010673] Haematologica. 2002, 87, 553-4

H146P CAC→CCC 1 High oxygen affinity [11475152] Ann Hematol. 2001, 80, 365-7

H92Y CAC→TAC 1 Cyanosis [9494043] Hemoglobin. 1998, 22, 1-10

C112W TGT→TGG 1 Silent and unstable [8936462] Hemoglobin. 1996, 20, 361-9

A111V GCC→GTC 1 Silent [7615398] Hemoglobin. 1995,19, 1-6

A123S GCC→TCC 1 Silent [7615398] Hemoglobin. 1995,19, 1-6

D52G GAT→GGT 1 Silent [9730366] Hemoglobin. 1998, 22, 355-71

V126G GTG→GGG 1 Mild beta-thalassaemia [1954392] Blood. 1991,78, 3070-5

W15Stop TGG→TAG 1 Beta-thalassaemia [10722110] Hemoglobin. 2000 Feb;24(1):1-13

F42L TTT→TTG 1 Hemolytic anemia [11920235] Hematol J. 2001;2(1):61-6

Table 5: The distance value of the mutations found in the human beta-globin gene.

Therefore we can conclude that the hamming distances determined in codons are connected with the physico-chemical property of amino acids.

Conclusion

In this paper we discussed an algebraic structure of the genetic code which exhibited some interesting connections of physico-chemical properties of amino acids with the algebraic structure. We observed that there is a closed connection between the order of the codons and transition/transversion mutations. We have shown that the set of all codons which are not zero divisors can be obtained from the even codons and the transversion of these two sets partitioned the whole set of codons into disjoint subsets. Next we obtained a distance matrix of codons and from which we get a distance matrix of amino acids. It is observed that the difference of physico-chemical properties of amino acids is reflected in this distance matrix. The distance matrix of the amino acids generates graph of the amino acids. As evolution of amino acids is mediated by mutation of corresponding codons, so the graph structures reflects the evolutionary pattern of amino acids, in the sense that two amino acid connected by an edge has more likelihood of evolving from each other than otherwise.