July 14, 2022 -- Using ribosome profiling (Ribo-seq), researchers from 20 institutions worldwide have identified more than 7,200 previously unrecognized gene segments of the human genome that may code for new proteins. The initiative could build on the understanding of gene-protein relationships gained 20 years ago through the Human Genome Project and reveal factors that contribute to traits and diseases.
In a July 13 correspondence to Nature Biotechnology, researchers presented an attempt to bring together data on open reading frames (ORFs), spans of DNA that may code for proteins, into a standardized catalog. The goal of the international consortium is to encourage the scientific community to integrate the data into the major human genome databases.
The first sequencing of the human genome revealed around 20,000 genes that code for proteins. Since then, researchers have questioned whether the initiative identified all the DNA that contains instructions for making proteins. Because protein-coding regions are identified by comparing DNA from different species, researchers may have missed relevant regions that arose in relatively recent evolutionary steps.
Independently, researchers have been discovering ORFs that may code for proteins, leading some of them to come together to assemble their findings into a standardized catalog. The work entailed finding ways to combine data generated in different ways from multiple labs.
"Our intention is for the Ribo-seq phase I catalog to be seen as a pragmatic interim solution to a long-term problem. We believe that reference annotation databases can advance both scientific and clinical research through the propagation and standardization of Ribo-seq ORF datasets, even -- and perhaps especially -- while the phenotypic impact of these features remains uncertain," state the authors of the correspondence.
Having created the catalog, the team annotated it onto GENCODE, gene sets that are used by the National Institutes of Health's Encyclopedia of DNA Elements (ENCODE) public research consortium. The annotation is the first step in a push to raise the profile of ORFs and thereby set the stage for more research based on the discoveries of the potentially important spans of DNA.
"For too long, the scientific community has been mostly left in the dark about these ORFs. This is the point at which they enter the mainstream of genomic and medical science -- an effort which we expect to have wide-ranging ripple effects," said Jonathan Mudge, PhD, from the European Molecular Biology Laboratory-European Bioinformatics Institute who co-led the effort, in a statement.
The researchers believe the ORFs, which are exclusive to primates, are likely to contribute to human traits and diseases, including rare conditions and common afflictions such as cancer. However, they are yet to show which ORFs relate to which traits and diseases. Integrating the ORFs into genome databases is intended to support work to understand the role of the spans of DNA.
In the initial dataset, GENCODE has classified 10 ORFs as protein coding. The evolutionary profile of many of the ORFs "remains hard to interpret," the researchers said, in part because "distinguishing ORF selection at the protein and DNA levels can be especially difficult for very small regions, and Ribo-seq ORFs are typically much smaller than those of known annotated proteins."