Friday, January 23, 2015

Writing a low level graph database

I've been interested in graph databases for a long time, and I've developed several applications that offer an API close enough to a graph API, but with relational storage. I've also played with off the shelf graph databases, but I thought it would be fun to try an implement my own, in Haskell of course.
I found that in general literature is quite succinct on how database products manage their physical storage, so I've used some of the ideas behind the Neo4J database, as explained in the Graph Databases books and in a few slideshows online.
So I've written the start of a very low level graph database, writing directly on disk via Handles and some Binary instances. I try to use fixed length record so that their IDs translate easily into offsets in the file. Mostly everything ends up looking like linked lists on disk: vertices have a pointer to their first property and their first edge, and in turn these have pointers to the next property or edge. Vertex have pointers to edges linking to and from them.
I've also had some fun trying to implement an index trie on disk.
All in all, it's quite fun, even though I realize my implementations are quite naive, and I just hope that the OS disk caching is enough to make performance acceptable. I've written a small benchmark using the Hackage graph of packages as sample data, but I would need to write the same with a relational backend.

If anybody is interested in looking at the code or even participate, everything is of course on Github!

2 comments:

Unknown said...

BTW I've tried to write binary driver in haskell for OrientDB.
But I stopped by understanding format some commands. The doc of binary protocol is very poor.

The idea to write native haskell graph db is very interesting.

Unknown said...

I also find this idea very interesting. Had a quick look at the code but did not try to do something with it, although I would be eager to contribute and use.

I believe the greatest problem with graph databases is providing an efficient query engine, which implies some kind of efficient storage/analysis of same queries: graph traversal can get pretty inefficient if you have to jump over a whole disk, I guess.

Do you have experience/plans implementing such queries analyser and executor?