It is a widely known that you can trade memory for speed in some circumstances. For instance, instead of doing lengthy computations each time, pre-compute, store results in a memory and lookup the result at a run time. Another example is caching: RAM or disk. A less known fact is that sometimes you can significantly speed things up by the price of allowing more latency. For instance, instead of sending data over the wire, burn them on DVDs and ship them.
Here is an interesting technique which is often get used in crawlers and was described in the IRLBot paper: IRLbot- Scaling to 6 Billion Pages and Beyond. The problem is that for each new URL, you need to determine among other things, if you have already crawled it. It is logical that you will need to have a database of seen URLs. The number of lookups and/or inserts per second will determine a potential crawler bottleneck. A conventional technique of a DB lookup per URL will soon hit a wall of scale as a number of Seen URLs reaches into billions. Instead, lookup request can be batched and than executed all at once by doing merge. Merge is a much for complex operation than lookup. However, lots of requests may be executed in one merge operation while each request will require a separate lookup.

0 comments:
Post a Comment