How accurate is the new Ion Torrent genome, really?

Jeudi, 21 Juillet 2011 13:57

How accurate is the new Ion Torrent genome, really?

E-mail

Rate this item

(0 Votes)

New sequencing technology Ion Torrent has made a splash with a paper in today’s issue of Nature. There’s no question the high-impact publication is a massive boost for the young platform, now nestled within the embrace of the giant Life Technologies (who acquired the startup for a surprisingly large price last August) and bracing for the impending launch of its most serious competitor, Illumina’s MiSeq.

The paper jumps the new platform through the standard hoops: some basic kicking-the-wheels, a test run on three bacterial genomes (Vibrio fisheri, Escherichia coli, and Rhodopseudomanas palustris), and then the traditional main event: the sequencing of a complete human genome. The genome in question is that of Intel co-founder Gordon Moore, the eponymous originator of Moore’s Law. There’s some pleasing symmetry here: Moore’s Law is frequently cited in the context of the massive decline in the costs of DNA sequencing; in addition, the Ion Torrent technology is based on the same kind of semiconductor technology pioneered by Moore. Refreshingly, the paper refers to Moore by name, which is a pleasant change from the rather affected pseudo-anonymity of other published genomes (e.g. Patient Zero).

Anyway I’m not going to comment at all here on the technical and bacterial work, which I have no doubt will be covered in detail by my esteemed colleagues Keith Robison and Nick Loman. My main interest in this paper is what it tells us about the ability of Ion Torrent as a potential platform for large-scale sequencing of human genomes, and a rival to current sequencing market leader Illumina. I also want to spend some time berating the authors of the paper for a thoroughly misleading piece of statistical sleight-of-hand that makes their accuracy numbers sound far better than they actually are.

What did they do?

The company sequenced Moore’s genome using their technology to an average coverage of 10.6x. This just means that on average each base in the genome was covered by 10.6 separate Ion Torrent reads, albeit with substantial variation: some bases had lots more reads, and some had fewer. You can see the distribution of read counts per base (in red), compared with the ideal distribution (a Poisson distribution, in green) in Figure 4b of the paper – I’ve copied a thumbnail to the right. It’s clear that there are plenty of positions in the genome with substantially less than 10 reads.

Let’s be very clear about this up front: by modern standards, this is a poor-quality genome. An average coverage of 10x means that most positions in the genome will be covered by at least one read – 99.21%, in this case – but in many of those locations, the number of reads will be too low to have any chance of accurately calling a heterozygous SNP (a base change where both different versions are present, one on the maternal and one on the paternal chromosome). This isn’t a function of the raw data quality – it’s simply a statistical consequence of sampling error at small sample sizes, that can only be overcome by additional sequencing.

It’s also an extremely expensive genome: even at this low coverage the sequencing burned through around 1,000 Ion Torrent chips, and in an NY Times piece yesterday sequencing guru George Church estimated the total cost of this project at around $2 million. That would be substantially lower at today’s prices, but still north of $200,000 for a poor-quality genome compared to less than $5,000 for a high-quality sequence from Complete Genomics. The yield of the Ion platform (in terms of bases per dollar) is of course going up rapidly, but I think it’s important to emphasise that Ion Torrent is not yet a remotely competitive technology for affordable whole human genome sequencing.

So how accurate is the genome sequence, really?

The authors attempted to explicitly estimate their error rate by sequencing Moore’s genome a second time using an independent technology: in this case, Life Technologies’ SOLiD platform, to a total coverage of around 15x. (The higher depth of the SOLiD sequencing understates the far higher yield from that platform compared to Ion Torrent; for this paper the authors ran over 1,000 chips on the Ion Torrent, whereas the SOLiD coverage was presumably achieved in a single run.) 15x coverage isn’t much better than 10x, so the SOLiD sequence would be expected to be missing plenty of heterozygous sites as well.

So, the authors have two separate low-coverage genomes, both of which would be expected to be missing plenty of SNPs – that means we would expect to see plenty of sites that differ between the two sequences (reflecting changes that by chance were detected by one platform but missed by the other). Yet the paper appears to cite a “validation rate” for the SNPs called by the Ion Torrent that is implausibly high:

To confirm the accuracy of our analysis, we also sequenced the G. Moore genome using ABI SOLiD Sequencing⁴³ to 15-fold coverage and validated 99.95% of the heterozygous and 99.97% of the homozygous genotypes (Supplementary Tables 1 and 2). [my emphasis]

There’s absolutely no conceivable way that a comparison between a 10x genome sequence and a 15x genome sequence could possibly result in a “validation rate” of 99.95% for heterozygous sites, at least not for any reasonable definition of the term “validation rate”. It takes some digging in the supplementary data to figure out what’s going on here. This is the definition of the term in the legend of Table S2, where the metric is referred to as the “percent same genotype”:

In cases where both datasets call the same type of SNP (heterozygote or homozygous variant) the proportion for which the genotype call is the same

The only way I can parse that sensibly is as follows: for sites that are called as heterozygous in both the Ion Torrent and SOLiD data, the “validation rate” is the proportion where the same two alleles are present. In other words, non-validated sites would only be sites where both platforms called a heterozyous SNP, but one platform said it was an A/G SNP while the other said it was an A/C SNP.

This is a near-useless metric, and does not correspond to any meaningful definition of the term “validation rate”. It gives us no information about what we actually want to know about, the proportion of sites where a SNP is called by one platform but not by the other – those are simply excluded from the comparison entirely. This is simply a measure of the platform’s ability to call the correct non-reference base at sites that are genuinely polymorphic, something that would be extremely high for virtually any half-decent sequencing technology. The only useful thing this metric does is provide a percentage with lots of convincing nines in it, which I’m sure the investors love, but I’m seriously perplexed that it managed to sneak past the manuscript reviewers.

Let’s take a more sensible definition of the term “validated”: for instance, let’s say it’s the proportion of sites called as heterozygous by Ion Torrent that also show some evidence of variation in SOLiD (we’ll generously say that the variant can be either homozygous or heterozygous in the SOLiD calls). Using this more plausible definition, the validation rate for Ion Torrent SNPs is just 88.0% at homozygous sites and 84.4% at heterozygous sites.

Ion Torrent could no doubt argue that this calculation is unfair to them: in many (probably most) cases, a discrepancy between Ion Torrent and SOLiD will be due to SNPs that were missed by the SOLiD technology, and thus aren’t really errors made by Ion Torrent. This is absolutely true, and in response I say: so do a proper job of validating your variants. Being a part of Life Technologies, one might imagine, should give the chaps at Ion Torrent a decent amount of access to SOLiD machines, and one more run of Moore’s genome on a SOLiD 4 would have given a far cleaner genome sequence for comparison. LIFE might even have one or two of those old capillary sequencing machines around that they used to sell: just 100-200 targeted capillary reactions around sites discrepant between the Ion Torrent sequence and a high-quality SOLiD sequence would have given plenty of data for an accurate estimation of the platforms real false positive and false negative rates.

Lack of proper validation is even more of an issue for larger structural variants. Here the authors steer clear of attempting to discover new variants, focusing instead on figuring out whether Moore carries any of the known structural variants called by the 1000 Genomes pilot project (PDF). Of 7,565 large deletions and inversions found by 1000 Genomes, the authors find evidence for 3,413 of them in Moore’s genome. That seems like a surprisingly large proportion to me, and it’s unclear how many of these calls are real: the authors report the results of a simulation using random genomic regions to estimate that 99.94% of their called events are real, but this number is not particularly meaningful as true deletion breakpoints are not well-represented by random chunks of the genome. And here there is absolutely no experimental evidence brought to bear – for instance, as far as I can tell no attempt was made to see how many of these apparent deletions also showed support in the SOLiD data, and certainly no attempt to independently validate the variants using a simple PCR assay.

All in all, a disappointing showing. This clearly isn’t a great genome sequence – it simply can’t be at 10x coverage, no matter how good the raw accuracy is – but the authors haven’t done enough experimental work to get a good sense of how accurate it really is. That means there’s very little we can say about the utility of Ion Torrent for whole human genome sequencing, apart from the fact that it’s currently too expensive to be practical.

What does Moore’s genome tell us about him?

Not much. The authors make a fairly cursory attempt at genome interpretation, pulling annotations from 23andMe’s database and OMIM, but their results aren’t particularly useful. That’s not a criticism, by the way: the point of this paper was demonstrating a sequencing technology, not a functional annotation pipeline. (Incidentally, 23andMe’s database was apparently used without any formal collaboration with the company, suggesting the researchers simply scraped the information from the company’s website: it’s intriguing to see one of the companies attacked by the FDA and Congress as “snake oil” being used as the go-to source for functional annotation.)

However, I note that the indefatigable Mike Cariaso has already run Moore’s genome through his interpretation pipeline Promethease – you can get the results here. It appears Moore has an increased risk of baldness (check), altered responses to various drugs, and a potentially highly elevated risk of age-related macular degeneration. However, nothing that he couldn’t have learnt from a 23andMe test, at less than 0.1% of the cost.

Where to next for Ion Torrent genomes?

This has been a pretty negative post, because I’ve focused solely on a section of the paper that – I’ll be frank – was done pretty badly. It’s not intended to be a critique of the Ion Torrent technology as a whole, and I’ll leave an evaluation of the technical merits of the platform to others who know it far better than I.

Still, I can’t help but wonder if Torrent made a mistake in including a human genome in this paper at all. I mean, I know it’s traditional, and sequencing Moore makes for some easy headlines, but the Torrent platform simply isn’t currently suited to whole-genome sequencing and won’t be until its yield improves substantially (there are clear signs in the paper that this is happening, albeit perhaps a little slower than we were promised). In sequencing a human genome with this early-stage, low-yield technology, Ion Torrent was forced into a dilemma of its own making: either spend an obscene amount of money to generate a high-quality sequence, or spend a simply lewd amount of cash to generate a crappy sequence. In the end they opted for the second approach, and I suspect they would have been better off simply leaving Moore’s genome out of the paper entirely.

In any case, I should emphasise that given the slow pace of publishing, this is a genome that was put together using the technology of maybe 12 months ago. There’s no question that Torrent technology has been improving over that time, and while it’s still not at the stage of competing with Illumina on cost right now, it’s certainly possible that this will be more viable in 12 months’ time. Hopefully the next genome sequence published using this technology comes complete with sufficient validation data to get a real impression of its quality.

Top image: The Ion Semiconductor Sequencing Chip. (Ion Torrent)

Authors:

Read 3711 times

Published in News Technologique-Tech News

More in this category: « Don't Mess With Pissarrachampsa (vidéo) Paris Plages en fait des tonnes »