Monday, September 8, 2008

Relational Databases are Considered Harmful for IR Systems

When building a new system, most developers would throw in  a relational database without thinking too hard bout it. After all, RDBMS are designed to deal with data. However, they may be not a good fit for some kind of systems. I'm talking about systems which deal with terabytes of (text) data. I'm referring to such systems as IR (Informational Retrieval) systems, but I'm not limiting this discussion only to IR. Of course, there are tons of systems where SQL server is still useful. Let's talk about concrete examples of both. We will use a web search engine as an example of an IR system. We will use a banking software (ie the software which bank runs) as an example of a traditional system. I'm going to briefly compare two in several dimensions. Later on, I will go into details on each.

  WEB Search Engine Bank System
Number of simultaneous writers one thousands
Number of simultaneous readers thousands millions
Cost of a Data Loss (per GB of data) low extremely high
Data Consistency not required required
Data Volume Terabytes Gigabytes
Transactions Few long-running Multiple short-running

 

In a bank system, thousands of users make deposits, transfers and so on. At the same time, in a web search engine there is only one (or a handful) of processes which update the index.

A Web search engine has an extremely high number of simultaneous users. A banking system has a moderate number.

A loss of any data my be catastrophic for bank and may potentially disrupt the business. A search engine can afford loosing some of its index. At the very worst, it will recrawl affected pages.

It is important that data is consistent across all bank databases. A canonical transactional processing example involves a money transfer from one account to another. Transactions like this should be either completed or not. Money should be withdrawn from one account and deposited into another. Cases when the money is only withdrawn or only deposited are not allowed. For the web search engine, on contrary, it is quite alright when the index doesn't reflect the latest changes on the web. The update may be lagging several days. Different copies of the index may have different versions of the same page and this is fine.

In a banks there are lots of simultaneous short running transactions  such as money transfers, withdrawals, deposits. A web search engine does transactions on big batches of data. I must say, this item is more of an implementation detail.

Finally, the volume of data in a search engine is much higher.

Clearly, a RDBMS supports banking scenario well. A wealth of thought has been put into allowing multiple simultaneous transactions.Under no circumstances, these transactions will put a database into inconsistent state. A typical RDBMS allows setups with high protection against data loss. These properties of RDBMS are not needed for a search engine. The also come with a high price both in terms of money and performance. Paying this price when not needed is hardly justified. For instance, typical SQL server configuration is several times more expensive then a typical commodity server. SQL server overhead in terms of a disk storage may be an order of a magnitude (SQL overhead, RAID mirroring, dump space, log space, etc..).

What can be used as an alternative to RDBMS for building large IR systems. I'm not sure as I'm only starting in this field. Feel free to suggest. One alternative would be to build your own custom data structures. I'm going to demonstrate this on several examples in coming posts. Right now, I'm thinking about describing an inverted index and a web crawler. Feel free to suggest your own.

1 comments:

ng2000 said...

Newdatabases.com hosts free msaccess databases look-alikes for windows. Might offer something helpful.