1000 Genomes Project reaches new frontiers in human genetics
27 Oct 2010
The 1000 Genomes Project, a major international collaboration
to build a more detailed map of human genetic variation and genetic
association with diseases, has completed its pilot phase.
The original human genome project only gave relatively crude detail
of the human genome and no indication of the variation between
Since then, with technological advances, the cost and resources
needed to map the genome has reduce many fold — from over $1bn to
just tens of thousands of dollars — enabling many research teams to
study genetic associations with diseases. Over 1000 regions on the
genome have now been associated with traits such as disease
susceptibility, response to medication or physical characteristics.
However, recent research has highlighted important gaps in the
databases that contain all this genetic information. To fill the
gaps, the 1000 Genomes Project has undertaken a thorough and
systematic investigation of genetic variation between individuals
The results of the pilot phase are now published in the journal
Nature  and freely available through the European
Molecular Biology Laboratory's European Bioinformatics Institute
(EMBL-EBI) and the US National Center for Biotechnology Information
Launched in 2008, the Project first conducted three pilot studies
led by Paul Flicek, to determine the best strategy for
characterising more than 95% of the genetic variants that can be
found in 1% or more of three different geographic population groups
(Europeans, East Asians and West Africans).
Disease researchers will use the catalog, which is being
developed over the next two years, to study the contribution of
genetic variation to illness. In addition to distributing the
results on the
Project’s own web sites, the pilot data set is
available via the Amazon Web services (AWS) computing cloud to
enable anyone to access this unprecedentedly large data set, even if
they do not have capacity to download it locally.
A previous public project, the International HapMap Project,
provided an initial database of over 3 million human DNA variants
present in 270 DNA samples. Information and methods developed by the
HapMap Project fuelled a first generation of so-called “Genome Wide
Association Studies” (abbreviated GWAS) that have localized over 600
novel genetic risk factors for common diseases such as diabetes,
heart attack, inflammatory bowel disease, breast cancer,
schizophrenia, and other disorders. These studies were limited by
technology, however, to studying a subset of more common DNA
variants (those with frequency greater than 5-10%).
The 1000 Genomes Project exploits next-generation DNA sequencing
technologies to develop a much more complete database — one that
goes much lower in frequency, and one that is extended to more human
populations. This database will contain all forms of variation —
single letter changes (termed SNPs), small insertions and deletions
(termed “indels”) and large changes in the structure and copy number
of chromosomes (termed “copy number variations”). This integrated
map is a novel contribution, as previous studies have focused
exclusively on one form of DNA variation (even though each of our
genomes contains all variety of variation).
“The increased resolution of the 1000 Genomes map will provide
researchers with far more detailed sequence information beyond
common variants, including millions of less-common and rare
variants”, said Elaine Mardis, PhD, co-director of the Washington
University Genome Center and member of the project steering
committee. “Researchers who have found regions of the genome
associated with disease will be able to look at this data to see an
almost complete set
of genetic variants in those regions that
might contribute directly to disease.”
The project partners, working in nine different centres, plan to
sequence the genomes of more than 2500 people from five large
population groups by the project’s completion in 2012.
Considering that one person’s genome contains around 3 billion
DNA base pairs, that’s a lot of data. In this pilot phase alone, a
total of 4.9 terabases of DNA sequence were generated (1 terabyte is
1000 gigabases, about the size of 300 human genomes).
"The amount of information delivered by this first stage of the
project is remarkable," said Richard Durbin of the Sanger Institute
in the UK. "In less than two years, we identified 15 million
single-letter changes, 1 million small deletions or insertions and
20,000 larger variants. The majority of these variants — around 8
million — had never been seen before. This is the largest catalogue
of its kind, and having it in the public domain will help maximise
the efficiency of human genetics research."
Collecting, storing and analysing this data would be impossible
without highly sophisticated computing resources — specialised
software and huge amounts of processing power and data storage
Thanks to innovations in DNA sequencing technology, genomic data
is being generated at rates previously unimaginable to life
scientists. This poses significant challenges not only for storing
and moving the information among different partners, but also for
its analysis. The EBI group developed a robust new computing
platform and several software innovations that made this pioneering
project possible, and will also pave the way for other sequencing
projects on an even larger scale.
"Having a systematic catalogue of human variation changes the way
we can study human genetics, much in the same way as having a
catalogue of human genes did," said Dr Flicek. "Among other things,
it also gives us a platform for analysing the connections between
genes and an individual’s disease risks." The results of the
collaboration extend well beyond the scope of the 1000 Genomes
Project, he said, and represent the beginning of a new era in human
genetics using genome-wide sequencing.
"This work shows the power of very recent advances in sequencing
to generate maps of genetic variation that bridge different scales,"
added Jan Korbel from EMBL in Heidelberg, Germany, who helped
analyse the larger variants. "It’s an exciting first step, which
paves the way for looking at the relationship between genetic
variations and diseases like cancer."
Uses of the project
The uses of Project data will be many. All of the variants described in the pilot study can now be
tested for their association with any given disease or trait (eg
susceptibility to addictive behaviour such as smoking). Indeed, the
data are already being used to inform a number of medical studies.
The results of the pilot study offer a much deeper, more uniform
picture of human genetic variation than was previously available,
and offer new insights into functional variation, genetic
association and natural selection in humans.
One clear use is to track
down the causal mutations underlying initial localizations from
GWAS. A second is making it possible to test less common DNA
variants for contributions to disease. And a third is to help
identify rare mutations that cause strongly inherited diseases: in
studies aiming to find such rare mutations, it is very helpful to
have a complete database of common variants that can be screened out
to focus attention on those mutations that are unique to an
individual or family.
But before such uses could be realized, many technical and
analytical challenges had to be overcome. These were the focus of
the pilot projects.
Pilot projects — testing essential aspects of project feasibility
The first pilot project involved sequencing the genomes of six
people (two nuclear families each with two parents and a daughter)
at high coverage. Each sample was sequenced an average of 20-60
times, and using a variety of sequencing technologies. Previous
“personal genomes” were each based on only a single sequencing
method, and thus were limited to what that method could detect.
By using multiple methods, the Project has uncovered not only a
more complete picture of DNA variation in these individuals, but
also learned about the strengths and limitations of each of the
current technologies. These data also served as a comparison group
for the genome sequences analyzed in the other pilot projects.
The six genomes were sequenced by academic centers in China,
Germany, the UK, and the US, as well as by three companies, using
platforms from the companies: 454 Life Sciences, a Roche company;
Applied Biosystems, an Applera Corp. business; and Illumina Inc. All
of the platforms were able to sequence 85-90 percent of a genome and
produce high-quality data.
The second pilot project sequenced the genomes of 179 people at
low coverage — an average of three passes of the genome. Although
sequencing costs are dropping, it is still very expensive to
sequence the genomes of hundreds of people deeply enough to find all
of the genetic variants in each genome accurately.
An alternative approach is to sequence many genomes at light
coverage, and then combine the data from many people to discover
genetic variants that they share. The results of the pilot project
confirmed that this strategy is effective and will allow the project
to meet its goal of discovering sequence variants that are shared
with other people.
The third pilot project involved sequencing the coding regions,
called exons, of 1,000 genes in about 700 people to explore how best
to obtain a detailed catalog in the approximately 2% of the genome
that is composed of protein-coding genes.
This Project provided unprecendented sample size to learn about
the patterns of rare variation in the human population.
Data analysis and access — the first major
release of biomedical data on the Amazon Web Services Cloud.
amount of data produced by the 1000 Genomes Project is unprecedented
in biomedical research. Currently, the total size
of the datasets is over 50 terabytes, or 50,000 gigabytes. That
corresponds to almost eight trillion DNA base pairs, or terabases,
of sequence data. Early in the project, merely copying the vast
quantities of data between the European Bioinformatics Institute
(EBI) in the U.K. and National Center for Biotechnology Information
(NCBI), part of the U.S.
National Library of Medicine in the U.S.
consumed large fractions of both groups' capacity on the Internet
for several days.
Researchers can freely access the 1000
Genomes Project pilot data through the 1000 Genomes website,
Researchers can download the data from NCBI at:
For many researchers and
institutions, especially those who lack the computer and analytical
power to study such a massive data set, an economical option is
being tested to access and analyze the pilot data.
datasets of the 1000 Genomes Project (7.3TB of data) are
available as a public dataset through Amazon Web Services (AWS) and
integrated into the company’s Elastic Compute Cloud (Amazon EC2
and Simple Storage Service, S3) As new data become available and
usage of this data increase on AWS, it is anticipated that
additional data sets will be available in AWS.
The cost to
researchers for computing through Amazon EC2 can be counted in tens
of dollars per day compared to the hundreds of thousands of
dollars it would cost to purchase the computer
needed to download and analyze this amount of data locally. Because
1000 Genomes Project data are publicly available from EBI and
NCBI, other companies that provide similar computing services are
also free to download and provide the data to their clients.
The 1000 Genomes Consortium. A map of human genome variation from
population scale sequencing. Published online in Nature on
28 October 2010. DOI: 10.1038/nature09534.