Search This Blog

28 March 2009

Opinion and Speculation: Log Structured vs Traditional Block

At least since early 2006, at the MySQL Athens meeting where I first met and listened to Jim Starkey, I have been of the firm conviction that log structured databases will be the future for disc-based storage devices. Their largely sequential write pattern ideally suits modern drives which are optimized to write whole tracks of data at a time. I even proposed such a project to Monty and Brian at that meeting. Of course, they said that I should go ahead and write one but circumstances contrived against me and I never really progressed much beyond the experimental proof of concept stage, my time having been occupied with the aborted Amira project and providing some assistance to the early Falcon project.

For that reason, when I first heard of PBXT, I was very excited. I have told many people to keep an eye on that project because although it was slower, it will catch up and then surpass traditional block storage databases such as InnoDB.
It's taken a while but ... a big THANK YOU to Paul and his team at Primebase for ensuring that I do not have to eat my words for all the talking up I have done about PBXT for the past couple of years

The game-changer in the near future is Flash storage and other solid state media. Such technologies mean that there is no seek/head settle time. Optimization strategies like clustered indexes become obsolete. However, Flash does have some overhead in writing and it is preferred to write whole flash blocks at a time. Right now, I believe that most Flash media use 64 KB blocks but as Flash media increases in size and performance, by increasing the bit-width, the effective block size of the media will increase. Log-structured storage can optimize for this because all writes are consolidated into a block and there will be little penalty for the index to be scattered across many segments because of there being no seek penalty.

After Flash... Let us imagine a future where we have some form of ultra-fast memristor based storage which supplant DRAM, Flash and disc media, then all this talk about database storage engines becomes practically moot ... just wire up the memristor memory to your 64bit CPU with it's 64bit address bus. Provide record version control, perhaps by some form of in-memory index, perhaps vaguely log-structured but no need for contiguous segments in order to scale on NUMA architectures (which now Intel is transitioning to with their new HyperTransport inspired workalike).


knielsen said...

I'm wondering if it is really true that SSD will make obsolete the use of clustering? SSD will optimize the random reads against a clustered index as well as the random reads against a heap-based table. So reading 50 rows from a single clustered block will still be much faster than reading 50 rows from 50 different blocks. And further, index scans, which can be expensive with clustering as the index blocks will tend to not be in sorted order on disk, will be much improved for clustered index due to eliminating the seek penalty, while there will not be a similar effect for non-clustered.

Of course, the general improvement of random access time that SSD gives will remove the need to do any optimization at all in many medium-loaded databases.

But agree, PBXT, and log-structured in general, is very interesting, and something to keep an eye on.

I'd really like to see a shared-nothing storage engine using log-structured storage. Like NDB, but geared towards InnoDB-type applications, and using BTree disk storage rather than in-memory hash tables. Log-structured btrees provide very fast sorted scans of data that could be very useful to do on-line DDL and replica-cloning. But as you say, it requires a _lot_ of time to build such a thing.

J Chris A said...


what you describe, down to the replication, does take a lot of time to implement. Luckily it's been in progress for a few years and it's called CouchDB.