Summary

Scientists have uncovered thousands of DNA segments that were missing from the reference sequence of the human genome.

The Human Genome Project, which concluded in 2003, sought to completely and accurately determine the sequence of the three billion DNA letters that exist in every human cell. The project sequenced DNA segments from several individuals and stitched together a reference sequence that could serve as a gold standard for future genetic studies. But seven years later, many gaps remain in that reference genome. In particular, the original project overlooked large blocks of DNA that exist in some people, but not in others.

The message we got is that if genome projects depend totally on these new technologies to sequence these missing pieces of the human genome, you’re not going to pick up the full range of human genetic variation—and that’s an unacceptable loss.

Evan E. Eichler

A new study of nine individuals with ancestors from different parts of the world has uncovered thousands of DNA segments that were missing from the reference sequence. “We’ve been trying to find the DNA regions that are in some humans but not all humans,” said Evan Eichler, a Howard Hughes Medical Institute (HHMI) investigator at the University of Washington. “That’s the only way to understand the full spectrum of genetic diversity for every spot of the human genome.” The study was published online April 18, 2010, in the journal Nature Methods.

The genetic code differs from one person to another at an average of about one in every thousand DNA letters. At these variable sites, one person might have a G at a particular location while another person has an A, for example. Over the past few years, geneticists have realized that human DNA also varies on larger scales. Their research shows that it is possible for some people to have a unique piece of DNA thousands or even millions of letters long that is not present in other people.

These types of differences among individuals are known as structural variation, and contribute to variations in risk for many human diseases, including lupus, prostate cancer, schizophrenia, and autism. “The more we have learned about structural variation, the more we’ve been struck by its importance to human health and disease,” said Eichler.

Much of the structural variation in different human populations may have arisen by chance as modern humans migrated from eastern Africa over the past 100,000 years and colonized the rest of the world. But other structural variants might be related to adaptations different populations made as they moved to new environments. These are questions that are still waiting to be answered, Eichler said.

To find structural variants that might account for the missing DNA in the current reference genome, Eichler and his colleagues carefully surveyed the genomes of four people with Nigerian ancestors, two with Chinese and Japanese ancestors, two with European ancestors, and one of unknown ancestry. To sequence these individuals’ DNA, they used the same sequencing technology used to produce the first draft of the human genome in 2001. They found more than 700 DNA regions exceeding a thousand letters in length that were missing from the reference sequence, along with thousands of smaller regions.

Many of these regions were in parts of our DNA that are not thought to influence the form or function of proteins in our bodies. But hundreds were in or near genes that encode the instructions for making proteins, as well as in areas of our DNA that have not changed much over evolutionary time, suggesting that they have an important biological function.

For example, the team found a segment of the lactase gene—which encodes the protein that humans use to digest milk—that is not present in the reference sequence. Among the nine individuals in the new analysis, the segment was present only in those of non-European ancestry. In many parts of the world, most people stop producing lactase sometime in childhood. Many Europeans, however, retain the ability to digest milk as adults; scientists suspect this might be because it helped their pastoral ancestors rely on milk from livestock. The function of the newly discovered segment of the lactase gene is still unknown, but Eichler says its variable presence among different populations “makes you wonder if it has some functional significance for lactase persistence.”

Eichler says the new results will have immediate implications for today’s DNA sequencing projects. Many of these projects rely on what are called next-generation technologies that sequence short segments of DNA and then rely on computer programs to assemble the segments into a whole sequence. But these computer programs use the reference sequence to put the pieces together, and Eichler’s analysis has shown that these programs often make mistakes in variable regions of the genome.

“That’s bad karma,” says Eichler, “because it’s telling us that we can’t trust the next generation methods alone to work out particular regions of the genome. The message we got is that if genome projects depend totally on these new technologies to sequence these missing pieces of the human genome, you’re not going to pick up the full range of human genetic variation—and that’s an unacceptable loss.”

The new results from Eichler’s study help solve the problem by providing a scaffold for assembling the DNA segments from next-generation technologies. Then assays can be developed to look for structural variation in regions that might be involved in disease. “Once you know where a piece of DNA is located and what it is, you can devise an assay to detect it,” Eichler said. “You don’t know what you’re missing or its importance if you can’t assay for it.”

Scientist Profiles

For More Information

Jim Keeley 301.215.8858 keeleyj@hhmi.org