NoSQL and It’s Importance

Just attended a conference at thoughtworks office in Delhi. It was a great talk. Neal Ford was phenomenal and he really showed how technical presentations should be given. They do not have to be boring. To my surprise he has also written a book about presentations.

Anyways, coming to the point. Talk started with introduction to No SQL, what it is and what kind of use cases it might be fit. As expected lot of people were from RDBMS background so it was very hard for them to initially understand the concept of No SQL.

Fortunately that was not the case with me as I have been exploring these technologies for last couple of years and I have delivered some successful projects using Neo4J and MongoDB.

So I would like to put my thought process forward.

No SQL means that people in SQL world should look out for alternative persistence technologies when need arises. Lot of times when data needs to be stored SQL does not provide a natural way of storing it.

Take for example hierarchical data and unstructured data. Many to Many relationships are not a pretty sight anyways.

I found SQL to be very limited in features and capabilties when it comes to storing hierachical data.

All you can do is create child parent relationship and do recursive queries. As we all know 7 out of 10 times in a big application database is the first culprit and you need real experts to fine tune SQL Queries when you start feeling that application is not behaving upto expectations and users are drifting away.

No SQL can be divided primarily in four categories:

Document Based (Mongo, Couch)
Key Value (Redis, Memcache, Dynamo, Riak)
Columnar Database (Cassandra, HBase)
Graph (InfiniteDB, neo4J)

Out of these 4 graph database have a unique place and easier to decide at least in my experience. Whenever data is hierarchical and relations can not be modeled using RDBMS easily one can go for neo4J. Hierarchical data may require deep traversals and RDBMS definitely does not rock at this.

Document databases are easiest to use and MongoDB is a sheer pleasure to work with. It gets up and running very easily and have most features compared to any other database when it comes to querying.

So I will divide this post into some headings.

When to use No SQL

My answer is always. Hardly there is any application today which does not have unstructured data. Everybody wants to grow so it is most likely that sooner or later you are going to generate data that will be large. Be it from social media, your own click stream capture. Storing Web logs or whatever. You want lot of users to come to your site. More the merrier so yes you will generate lot of data.

So having a polyglot persistence built in right from the beginning in application is gonna help you at later stage.

It’s easier to define what kind of use cases No SQL is not a good fit rather than finding good use cases (except big data).

When you need strong ACID support (Financial information specifically). Payments, User registration then I will never think about storing these in a No SQL. Risk is just too great.

Some people argue like one gentleman at the conference that amazon is using Dynamo for storing user cart information. May be it can be used. But I will not agree with this 100%. Reason is simple. All NoSQL databases are eventual consistent. That means due to replication there is a delay in syncing the data on multiple machines.

So when you run a query you do not know which copy of data will be returned whether that’s latest information or old information. So if you use Mongo may be there is a chance in theory that user will see his old cart and not latest one and next time he looks at his cart he might be seeing latest one. I would not want this. So consider this use case out of scope for mongodb.

Some NoSQL dbs like Riak provide vector clocks but they have their own problems.

http://docs.basho.com/riak/latest/references/appendices/concepts/Vector-Clocks/

So one has to be very careful in such scenarios.

Take another use of promotional campaigns. Lot of companies do promotional campaigns and they need to store these huge emails and they even track their performances then it is a definite use case for a NoSQL. Data is huge..it does not have to be transactional and if we loose some data due to some node failure we will not loose our job.

In No SQL world two principles are very prevalent.

Prefer redundancy over normalisation. Disk is cheap theortecally infinite and No SQL due to in built horizontal scalability have no problem handling data. So when you have to optimize your query do not change your schema but you can store redundant data in separate table suitable for this query only.
Design schema for your queries. Write down use cases and design your data storage accordingly. Do not try to do otherwise as in SQL world.
Design your app for Consistency .relations/rules/data quality are all handled in application as NoSQL does not guarantee this. There are no joins and locks are at row level.

Let’s look at some of the use cases for each database.

MongoDB

It should be first choice by default..more so when you do not have much experience with No SQL world. Most close to RDBMS supported by excellent client drivers and easily integrated with any stack PHP, Java, Python, Node JS you name it.

It can be used as a general purpose database. Supports secondary indexes. Shards easily

Only problem I find with MongoDB is versioning. I never know what version of data is going to be returned to me.

Neo4J

Most suitable for Social Graphs. Deep traversals. Recommendations. Implemented a subject hierarchy using this and traversals were damn fast. Provides excellent Apis in Java..supports REST. No other fully open source graphDB comes to close to this one.

Hypergraph is comparable but lets down in Apis compare to Neo4J especially traversals.

Column Oriented

Cassandra and HBase

Both are column based with minor differences here and there. Cassandra was developed by Facebook and became an Apache incubator later on. HBase sits on top of hadoop.

Their use case I see is only one. When you have lots of data. Hundreds of TB to PB and you just want to do Map Reduce though Cassandra provides CQL. You will know when you have that much data.

Key Value Pair

e.g Memcache/Redis – They are damn good at what they do. Primarily used in caching layer they can server data to your clients insanely fast. Can shard on hundreds of servers easily and redis even provides many useful Data Structures in built.

So here in short I just provided my experience with No SQL.

Comments are welcome.

Blog