It seems to me as though the authors are unfamiliar with bioinformatics standard practice - it's not like this is an unsolved problem, and the existing solutions are faster, more computationally elegant, and more flexible. Actually computing and searching for all strings with an edit distance of X away from the query is an incredibly poor way (Guess how long their strategy takes when you increase the edit distance to 5?) of of solving the alignment problem that throws out 40+ years of research on the problem of sequence alignment. The actual solutions to this problem solve it in tenths of a second in vastly superior ways.
BLAT[0] is the most obvious and preferred solution designed for alignment against a reference sequence- a 20 BP search against the human genome should essentially be instantaneous. BLAST [1] is more versatile and a bit slower than BLAT, but would also align these sequence sizes against the human genome in ~1-2 seconds, and is a traditional solution to the alignment problem, and has no license restrictions. BWA [2] and Bowtie [3] default settings could also be modified for the task (they're optimized for performing this task with larger strings on the order of thousands of times per second).
More generally, it would not be difficult to re-implement any of the algorithms behind these software implementations if the authors really wanted to. It's weird, this is the second post I've seen recently when software folks who are now working in the bioinformatics space have seemed completely unaware of both the basic algorithms we use in computational biology and their common implementations, like Smith-Waterman and Burrows–Wheeler. These are complicated problems with 40+ years of research behind them, and the actual solutions are elegant and fast algorithms which solve the problem in a far superior way within reasonable computational time.
Thanks for bringing up all of these alternatives. We definitely would have preferred to use an existing solution to building our own.
Unfortunately, a lot of the existing software is not intended for the search we're trying to do, or does not perform well under these conditions. We did in fact experiment with some of them before building our own. Bowtie, for example, doesn't allow more than 3 mismatches, and is also intended for alignments where there are very few matches (close to 1).
Since we need to be able to support multiple genomes (see Josh's comment), the amount of RAM we need to run a particular set of alignments is relatively important. Things like BLAT (which seems also intended for > 25 bases) need to keep the entire genome index in memory. This means that we would need to spin up a lot more servers to handle parallel requests, especially with different genomes.
FWIW our search is only a couple hundred lines of C++, and does the search with very little memory requirements.
Fair points about both the memory requirement and Bowtie - also, 20 BP is really small, so while BLAT seems to find 20 BP hits correctly, it doesn't find 20 BP hits with mismatches. Why be so memory stringent? This is genomics, after all... If you're adverse to rolling your own, I'd look into BWA-MEM as a solution (which has more adjustable mismatch scoring than Bowtie), or figure out how Primer-BLAST seems to do specificity checks for small sequences so well.
We care about reducing memory usage because it gives us greater flexibility in our infrastructure. We're offering this to hundreds of scientists a day - not huge numbers in the grand scheme of things, but usage is bursty and by paying customers. This means we can mix in CRISPR work with our other processing infrastructure, and we can choose between many servers with low memory vs a few servers with high memory.
Maybe genomics doesn't have to be so memory hungry. 8)
How is a fast approximative algorithm better than a slightly slower exact one?
Most software done by bio"informaticians" seems sub-par, both in implementation and in theoretical properties. I'd suspect this is caused by the fact that a lot of people actually don't come from CS, but biology or other fields.
We are talking about a brute force, exhaustive search method versus algorithms which have significantly better time complexities. I don't understand how the word "approximate" applies to these algorithms which are largely or entirely deterministic, but sequence alignment is an inherently approximate question. You want your answer to be approximate, because mismatches and gaps are real things. In fact, finding the "approximate" answers are what the OP is after - places in the genome that look like, but aren't exactly, the query sequence. And as soon as you expand your query length or edit distance even a little, finding that becomes completely infeasible. In fact, the OP lucked out because this is a very special circumstance (this method is completely computationally infeasible for 99/100 applications of this problem).
Rabin karp with rolling hashes is actually (not exactly but almost) what tools like rsync, or bittorent use to find chunks and differences in files. So it scales really well.
The algorithms you cite come from a time when computers looked rather different in their architectures.
Linearly scanning the entire genome from ram for example, could be significantly better than performing multiple index lookups from disk.
Many problems of bioinformatics (the text processing) aren't that hard or special anymore, a genome is tiny compared to the amount of data we have lying around elsewhere.
This is exciting stuff. However, the sad takeaway to me is the broken patent system is already stifling what can be done with this innovation.
Consider this: the patent was awarded to a group that could not have invested more than a few thousand dollars of incremental time and resources (in fact, probably the majority of the costs were in the patent application and process itself). And yet the license is worth billions.
Patents were created - and the US Patent Charter still states this - to encourage and enhance the economic stature of the nation. Instead we use patents to throttle it. Imagine if this patent were only good for a few years or up to license fee commensurate with the incremental investment needed to produce and validate the research (even if this fee were 3x, 5x, 10x, etc. of the costs). Everyone could contribute to the work and the pace of innovation accelerates. Instead we've got a couple universities (and, inevitably, follow-on corporate licensors) locking it down for all but the publicly funded and litigous.
There is so much opportunity out there, there are so many brilliant minds, eager innovators, and great startups. Why do we shoot ourselves in the foot with patent nonsense that hasn't been significantly rethought since (in the US) its 18th-century English law origins?
I would not stress about this. Patenting works of nature seems increasingly tenuous even in this broken patent environment. Will we really be able to place patents on ubiquitous biological systems that are 500 million years old and present in billions of copies even in our own guts.
> There is so much opportunity out there, there are so many brilliant minds, eager innovators, and great startups. Why do we shoot ourselves in the foot with patent nonsense that hasn't been significantly rethought since (in the US) its 18th-century English law origins?
Yep, GPUs were the first thing we looked at. :) (And to be fair, our current approach is pretty close to a brute-force search!)
The main reasons we decided against a GPU-based approach were cost and scalability:
- We support a dozen+ reference genomes (eg for difference species), and plan to support a lot more (including eventually supporting custom genomes that users provide). Assuming we want to support a few concurrent searches against the same genome, we'd need a few GPUs per genome, and this gets expensive pretty quickly on AWS.
- Our fleet is now non-homogeneous, and now if machine X fails we need to restore machine X' with the same set of genomes.
- If certain genomes are more popular than others, we'll likely have GPUs spun up that aren't being used much (only one lab might be investigating a certain genome, for example). I suppose you could swap genomes in and out of memory as they're accessed, but again it's more complex to manage resources.
- Our current approach allows us to add genomes ad-hoc - hypothetically, you could point us to your own genome on S3 and we'd be able to work with it.
We hint at it towards the end, but we're actually switching to AWS Lambda soon - based on early calculations, it could cost us as little as $50 a month to run everything!
Second, you can encode the search as a DFA - look up the Aho–Corasick algorithm. Then just run the DFA over the genome. It means that you don't need to match every string at every position. If you've read AAAAA and your string starts with CCCCC with an edit distance of 4, you can skip ahead for a while before you need to start reading again.
Third, you could build a suffix tree (O(n) preprocessing), and then use the standard fuzzy string matching algorithm using suffix trees on it.
Why did BLAST not work for you? In the comments here you keep mentioning memory, but we've never had that be the bottleneck (and we use BLAST for things much larger than the human genome).
I'm really concerned that the team behind a bioinformatics tool is talking about searching sequences without even a mention of BLAST. It should have been solution #1!
We use the scoring function published by Hsu et al[1], which most scientists seem to be using. This function takes into account both the number of mismatches and where they occur in the guide. There's a more readable version here: http://crispr.mit.edu/about .
Haha, Vineet appreciates the compliment, I think...
I mentioned it in a comment below, but our constraints are a bit different once you start supporting multiple genomes - we could've been clearer about that in the post for sure.
Nice post. What tools do you use to understand which line of the code is creating a bottleneck? For instance, the "if statement has too many false positives" in Solution #4. Do you locate these bottlenecks simply by manual inspection of the code?
A lot of these are actually pretty easy to spot with tools like gcov. To determine false positives, we can just look at how many times each "if" statement was hit, and compare those to the final count.
Have you thought about using the hammond distance, instead of the array?
It should give you the same answer in 3 CPU instructions on registers instead of 2 array lookups and arithmetic in ram. XOR, hammond weight/bitcount and and equality check.
We actually do this in our "expensive" check. The reason that we use the 2 arrays is because in 2 array lookups, we can check for matches for all 200 guides. I probably could've made that clearer in the post - the array contains matches for ALL of the guides.
PAM sites? I am working on a genome assembly now and I want to try to identify potential g-rna sites around genic regions. This post is pretty helpful.
I left PAM sites out of the blog post (it's actually mentioned briefly in the footnotes), as it made the problem slightly more complicated.
The final algorithm actually keeps track of the last 20 bases + PAM length, and checks both the edit distance and PAM before deciding if something is a match. The Benchling CRISPR tool will do this for you :)
Couldn't you use one of the available short read aligners (BWA,bowtie...) available for this also? Most of the aligners use some kind of FM-index for indexing the genome.
We messed around with bowtie - it seems like most of these are optimized for the number of alignments being small (i.e. close to 1). Unfortunately, the number of matches for a 20 base guide on the human genome is closer to 1000.
BLAT[0] is the most obvious and preferred solution designed for alignment against a reference sequence- a 20 BP search against the human genome should essentially be instantaneous. BLAST [1] is more versatile and a bit slower than BLAT, but would also align these sequence sizes against the human genome in ~1-2 seconds, and is a traditional solution to the alignment problem, and has no license restrictions. BWA [2] and Bowtie [3] default settings could also be modified for the task (they're optimized for performing this task with larger strings on the order of thousands of times per second).
More generally, it would not be difficult to re-implement any of the algorithms behind these software implementations if the authors really wanted to. It's weird, this is the second post I've seen recently when software folks who are now working in the bioinformatics space have seemed completely unaware of both the basic algorithms we use in computational biology and their common implementations, like Smith-Waterman and Burrows–Wheeler. These are complicated problems with 40+ years of research behind them, and the actual solutions are elegant and fast algorithms which solve the problem in a far superior way within reasonable computational time.
[0] http://genome.ucsc.edu/cgi-bin/hgBlat
[1] http://blast.ncbi.nlm.nih.gov/
[2] http://bio-bwa.sourceforge.net/
[3] http://bowtie-bio.sourceforge.net/index.shtml