By comparing data about genome sizes of more than 1,000 publicly available genomes from the three extant domains of life, researchers from Harvard University's Origins of Life initiative, Los Alamos National Laboratory, and SFI quantified regularities and global genomic differences across domains, finding mathematical relationships that link Prokaryotes (one-celled bacteria and archaea) with Eukaryotes (one- and many-celled organisms containing nuclear DNA).

The results offer fresh insights on several fronts and will impact ecology, genome sequencing, medicine, genetics, evolution, and synthetic biology, says Juan Perez-Mercader of Harvard, an SFI External Professor:

  • They shed new light on the way DNA is organized in a genome and how this organization is linked to the generic use of information by all living systems.
  • They suggest an essential difference between the way information is handled by Prokaryotes and Eukaryotes and why Prokaryotes cannot be larger than a particular size.
  • They may help clarify the reasons for the existence of so-called "junk DNA," the largest portion of our genomes.
  • And they shed new light on one of the most important transitions in the evolution of life: the emergence of Eukaryotes about 2 billion years ago.

The authors' interpretation is based both on the properties of the Benford distribution and on the properties of information as a physical quantity. (The Benford distribution, which is central to this work, has a long and colorful history in science and mathematics. It is known to fit a wide variety of data. More on the Benford distribution here.)

In the study, published in the May 18 issue of PLoS One, the authors examine how the number of Open Reading Frames (ORFs) varies with overall genome size for each domain. ORFs are regions of the genome that are transcribed and translated into proteins; they are the precursors of messenger RNA (mRNA) and are closely related to genes. ORFs contain all of the coding DNA in the genome (called cDNA) and some of the genome’s non-coding DNA, called ncDNA (sometimes referred to by the unfortunate label "junk DNA").

Read the PLoS One paper (May 18, 2012)

Together, cDNA and all the ncDNA form the entire genome and provide each organism with its particular features. ORFs, cDNA, and ncDNA are common measures for assessing the functional complexity of a living system.

The authors find that all of these features can be accommodated by a single unifying description suggested by computation theory when put together with some recent work in statistics. This description results if ORFs are distributed according to the Benford probability distribution, which gives rise to a specific logarithmic form for the dashed curve that represents the Eukaryote data in the figure below. For the smaller genomes, such as small Eukaryotes and the Prokaryotes, the logarithmic form of the Benford distribution is equivalent to a straight line which the authors find is surprisingly consistent with the Prokaryote line. (This, the authors note, can be interpreted in a simple way by using information theory.)

The authors relied on data from more than 1,000 genomes publicly available in 2010. While the number of ORFs in the data set varied by a factor of more than a hundred, genome sizes varied by a factor of ten thousand. Due to the large differences between number of ORFs and genome sizes, it is most effective to display the data by plotting the logarithm of the number of ORFs versus the logarithm of the genome size (measured in thousands of DNA base pairs), as shown in the figure above. The Bacteria (in blue) lie along the straight red line, as do the Archaea (in magenta). Examples of Bacteria are shown above the line. The Eukaryotes (in orange) display a very different non-linear behavior for large genomes (as indicated by the dashed black line), but merge smoothly with the Prokaryotes for small genomes. Examples of Eukaryotes are shown below the curves. Prokaryotes hug the diagonal red line because their genomes are composed almost entirely of coding DNA (with only a small percentage of ncDNA). Another feature of the Prokaryotes is that the data end abruptly, perhaps indicating a maximum genome size for them. In other words, Prokaryotes "hit the wall" when their genomes get large. (Credit: Sharon Mikkelson, Los Alamos National Laboratory) 

 

One can think of the ORFs in a genome as the bearers of the information used in the "biological computations" executed by the cDNA present in every form of life, say the authors. Because the genomic data used in the study is well fitted to the Benford distribution, they suggest that life, through its evolution, has optimized the cDNA computing channel to be as efficient and error free as is allowed by the external and internal environment in which life happens.

This means that living systems have maximized the rate at which the information content of their genomes is transmitted. This is accounted for by the Principle of Maximal Information, which is a universal feature of information applying to all systems, natural and artificial, and expresses the restriction that “you cannot get more information out than you put in.” In the figure, this limitation is reflected by the red line, which for a given genome size indicates the maximum rate at which information can be handled. The authors expect that, except for statistical fluctuations, no organism can lie above the red line delimiting the maximum information handling ability of a genome allowed by the rules of information.

For organisms with small genomes this linear, mostly cDNA-based scheme works very efficiently, but increasing genomic size and organismal complexity (associated with the presence of many additional interactions and components in a cell) degrades the ability of cDNA alone to control and efficiently process the extra information. Prokaryotes with increasing genome size eventually "hit the wall" of viability and cannot get larger, also noted in the figure. In order to remain viable a more complex organism must "invent" more sophisticated forms of information handling.

The findings of the study imply that a Prokaryote and its “wetware” can get by with roughly 90 percent of its genome being cDNA. For genome sizes larger than about 8 million base pairs, life as a bacterium becomes progressively more difficult. The change in wetware architecture for Eukaryotes allowed a huge increase in information control via a much larger fraction of ncDNA in the genome that enables the control of non-linearities associated with a larger genome size and is constrained by the rules of information and the Benford distribution for the ORFs.

The “information management crisis” that developed as Prokaryote genomes became larger may have expedited the appearance of the domain of Eukaryotes in the history of life. In Eukaryotes, ncDNA exerts control to the maximum possible levels allowed by the properties of information itself. It does this in a way that is compatible with the operation of an underlying cDNA channel. Thus, say the authors, nature is efficient and uses the principle summarized by the quip “If it ain’t broke, don’t fix it!”

The study concludes that information theory allows genomic data to be interpreted in a way that encompasses all three domains of life, and serves as a vehicle for the common understanding of some basic genomic properties.

“Life is no different in its use of information from any other physico-chemical or human-engineered system,” says Perez-Mercader. “Life follows the same rules, but life employs them using chemistry. This study raises a multitude of questions, and spotlights an area with multiple applications that collaborations of biologists, chemists, engineers and physicists can explore together.”

Read the PLoS One paper (May 18, 2012)