Sankar Chatterjee and Surya Yadav have created a simulation showing how the genetic code may have evolved.
Six years ago, Texas Tech University's Sankar Chatterjee released a groundbreaking theory on the beginning of life on Earth, what he called "the Holy Grail of science." He claimed that a heavy bombardment of icy comets and carbon-rich asteroids 4 billion years ago left young Earth's surface pockmarked with craters, similar to the surface of the moon.
Filled with water and the cosmic building blocks for life, delivered by these meteorites, these craters eventually became the primitive cradles in which the first simple organisms grew.
Based on theories of chemical evolution and evidence from the Earth's early geology, Chatterjee's proposal still left one gaping question unanswered: exactly how these primordial organisms developed information systems.
"It's become clear in recent years that the biological world is computational at its core," said Chatterjee, a Horn Professor in the Department of Geosciences and Curator of Paleontology at the Museum of Texas Tech University. "Algorithms, or instruction sets, are found in every cell and in the manner in which information flows through and between cells. Digital storage of molecular information is the key to defining life and understanding its origin. The key mechanism is the origin of the genetic code."
As Chatterjee explains, the genetic code was deciphered in the 1960s, and the many scientists responsible for cracking the code were awarded Nobel Prizes. But since that time, there has been no comprehensive theory about why the genetic code evolved in the first place, before the origin of DNA and the first life.
Until now.
In collaboration with Surya Yadav, a professor of information systems in the Jerry S. Rawls College of Business, Chatterjee has built upon his former theory.
"The question of the origin of the code is the greatest challenge in modern molecular biology and origin-of-life research," Chatterjee said. "We have provided a novel model: how the genetic code might have evolved gradually with the improvement of the translation machine during protein synthesis."
Origin of the genetic code
In the craters on Earth's surface 4 billion years ago was what Chatterjee calls a prebiotic soup: a combination of water and biomolecules deposited there by comets and meteorites, all stewing together thanks to the hydrothermal energy from erupting vents. Among the biomolecules were likely several dozen types of amino acids and an assortment of nucleotides. Four specific nucleotide bases – uracil (U), cytosine (C), adenine (A) and guanine (G) – began combining into chains of ribonucleic acid (RNA). Similarly, about 12 kinds of amino acids were joined together to form peptide chains.

Because RNA contains a sequence of these nucleotide bases that is analogous to the letters in a word, it can function as an information-containing molecule, Chatterjee explained. Moreover, RNA, as a single chain, is free to take any kind of shape. From this basic architecture of a single-stranded RNA molecule, different species of RNA – such as ribozymes, transfer RNA (tRNA), messenger RNA (mRNA) and ribosomal RNA (rRNA) – evolved inside protocells. Each species contained a supply of information, distinct in attribute and configuration, in response to the specific amino acids it collected.
"The advent and multifunction of different species of RNA molecules signal the transition from the age of chemistry to the age of information," Chatterjee said.
Among the molecular milieu, mRNAs began to encode the recipe for proteins, while tRNAs carried different amino acids and tried to match the three-nucleotide-sequences – called codons – of mRNA, each of which corresponds to a specific amino acid.
But mRNA languages and protein languages are different. A bilingual translator was needed to read the message in mRNA and a molecular machine was needed to manufacture protein according to the recipe. The translators are special kinds of enzymes, called aaRS, that help convert the code to the right language. Then the mRNA is fed into the ribosome, and the ribosome reads the message and makes a protein accordingly.
The genetic code is essentially a set of rules defining how the four-letter code of mRNA is translated into the 20-letter code of amino acids, which are the building blocks of proteins. Proteins, in turn, are the "hardware" – the main enzymes and structural material – for cells.
Evolution of the genetic code

The genetic code developed in three distinct stages that coevolved with the refinement of the translation machine. The primitive genetic code used only four amino acids and four codons to make a simple strand of protein.
"In the primitive translation machine, a symbiotic relationship was established among three components – pre-tRNA, pre-aaRS, and pre-mRNA – to create a short chain of amino acids, which form the biosynthetic protein," Chatterhee said. "The protein chain grew through the addition of further amino acids in the same manner. By linking the amino acids carried by the pre-tRNAs, the first protein synthesis occurred. But at this stage of the primitive code, the translation machine was simple and made errors during protein synthesis."
The transitional genetic code was the second generation, employing 10 amino acids and 16 codons. Compared to the primitive translation machine, the transitional translation machine was somewhat refined to minimize errors. In this stage of translation, pre-tRNA evolved into tRNA through gene duplication. Pre-mRNA evolved into mRNA by linking several strands of pre-mRNA to increase the storage capacity. Pre-aaRS joined to specific tRNA and became aaRS. The protein chain in this stage was moderately long.

The universal genetic code was the final stage of code development along with the evolution of translation machine, maximizing its efficiency. It contains 64 codons specifying 20 amino acids. Chatterjee says the universal code proved more reliable than the primitive or transitional codes with minimum errors, so natural selection favored it.
The final and most important component of the translation machine, the ribosome, was a hybrid of rRNAs and r-proteins. With the participation of the ribosome, the translation machinery became more elaborate with tRNA/aaRS/mRNA/ribosome complexes, which enabled higher specificity in the genetic coding. The protein chain in this stage is long and complex, with a biological information system that adds rules, instruction, feedback and algorithm to its repertoire.
Chatterjee and Yadav hypothesize that the genetic code evolved as pathways for the synthesis of new amino acids became available – and these, in turn, were the results of progressive refinement of the translation machine.
"Through successive refinement, the universal code has optimized functional efficiency to minimize coding errors," Chatterjee said. "Once the universal code evolved, the protein synthesis became highly coordinated, beautifully orchestrated and universally adopted by all life."

Chatterjee and Yadav proposed that the coevolution of the genetic code and the translation machine marks the beginning of Darwinian evolution at the molecular level, an interplay between information and its supporting structure. This hypothesis provides the logical and incremental steps for the origin of programmed protein synthesis.
The code obviously is not the result of a random assignment of codons to amino acids, Chatterjee said, because it has a specific, organized structure with a large number of codons to provide redundancy; that is, several codons may specify the same amino acid.
"The expanded genetic code is so universal that there is strong evidence that all life on Earth had a single origin in the universal code before the last universal common ancestor evolved," he said. "This universal genetic code has been operating for the last 4 billion years and has remained unchanged since it was perfected."
Information system of life
To simply explain the origin of the genetic code, Chatterjee compares it to the evolution of personal computing – ironic, since the idea of computing originally came from trying to imitate information processing in living systems. First came Apple II, the first Apple computer. Then Macintosh computers built upon the Apple II by adding a graphical interface.
Today, we have iPhones, which allow us to carry more information in our pockets than NASA had during the Apollo missions. As the computer systems have been modified and refined, so has the information processing on them – just as the genetic code has been modified and refined with the increasing complexity of the lifeforms it comprises.
"However, the analogy ends there," Chatterjee said. "We know very well that in 50 years, this iPhone would be obsolete. But once life's computing machinery and its software fully evolved, they have remained the same for the last 4 billion years. Isn't that amazing?"
Chatterjee and Yadav point out that life is more sophisticated than any manmade computer where the software/hardware dichotomy is blurred and integrated. They find this computer analogy too simplistic.
"Both the informational (RNA/DNA) and functional polymers (proteins) in the translational machinery can be viewed as highly mobile nanobots, which are fully equipped with both the information and the material needed to accomplish their task," Chatterjee said. "These nanobots 'know' how to put themselves together by self-assembly or by cooperation with other molecules."
Information-directed mRNA and protein synthesis are remarkable feats of early protocells. All of the information is stored in RNA genomes, and when a new protein or mRNA is needed, the information is read and used to direct its construction. Some essential proteins, which perform central tasks, remain unchanged for billions of years.
"The beauty of this information system of life is that if there is any minor spelling mistake in A, U, C and G during translation in protein synthesis, this mistake – called mutation – will create variation among population: the raw material for Darwinian natural selection," Chatterjee said. "Because of this occasional spelling mistake in the software during the last 4 billion years, today we see the biodiversity of life. However, the genetic code remains unchanged."
Proteins, Chatterjee said, are regarded as one of the "nanobots" of a cell. They do most of the work – such as controlling metabolism, transport, communication, structure, catalysts and many aspects of cell function – and they are constructed for many different functions. With the availability of proteins, there was a gradual evolution of the components of protocells.
"Life is more than a computer," Chatterjee said. "Unlike computers, life creates its own custom-made components. No computer can achieve this remarkable feat. Thus, a significant part of the process for creating an organism's components is essentially bootstrapped from its own DNA and mRNA. An early protocell innovated the most powerful technology ever created on this planet."
After spending more than 10 years investigating the origins of life, Chatterjee is proud to have found a potential answer to one of mankind's biggest questions, but he emphasizes that this information system analogy is not the complete story.
"Life is the most sophisticated and durable computer system in the universe, which can create its own copy," he said. "In our computers, we have to upgrade our software every year or so and buy a new model every few years. But for life, the code was so well designed by evolution and became so near foolproof, and the translation machine so sophisticated, that they did not need any upgrading; they are still working perfectly."