Paulo Ortins Paulo Ortins
Menu
Search
NoSQL Databases, why we should use, and which one we should choose

Introduction

In the last years, relational databases have been the only option when we talk about data persistence. Our unique choice have been which database we should use. Should we use a SQL Server? Should we use a MySql? Oracle? Even in these cases, some choices come by default. E.g. if we are using .NET, we almost always work with Sql Servers, if we are using Java we almost always use Oracle, ruby-mysql, python-mysql/postgre and so on.

The reason is obvious, relational databases are in the field for decades, they proved to be robust for most of the applications. We can rely on them to take care of concurrency, transactions and so on. But if relational databases are reliable as I’m saying why they are losing market to NoSQL databases? Relational Databases have some problems that NoSQL Databases are resolving.



Problems with Relational Databases



Impedance Mismatch

We use to write software using Python, Ruby, Java, .NET. What they have in common? They are object-oriented languages. But we persist the data using MySQL, Postgre, Oracle and SQL Server. What they have in common? They are relational databases. Can you spot the difference? Impedance Mismatch is the name we gave to this difference. Our memory structures are object-oriented and our databases are relational, every time we need to save or retrieve data we need to make a conversion. ORM (Object Relational Mapping) Frameworkds, like Hibernate, Entity Framework, make easier to map objects and relational databases but it’s still a issue, principally when we need high performance queries.



Applications are getting bigger

Web applications are increasing in scale. We have to store more data, we have to serve more users and we need more computing capability. To handle this scenario we have to scale. We can scale in two ways. We can scale up, that is buying better machines, more disk, more memory and so on. Or we can scale out, that is buy a lot of small machines and use them in a cluster. In big applications scale up is not an option. Bigger machines are more expensive and they have a limit, we don’t have a machine that can handle the traffic from Google or Facebook. Given this context, we need new databases, since relational database are not designed to run on clusters. Yes, you have clustered relational databases, but they work sharing a disk, that isn’t the scenario we want to have when we’re building a cluster. Some of the companies who needs to handle a lot of traffic like Google, Facebook, Amazon started to develop databases that are designed to run on clusters and this was the beginning of NoSQL era.



NoSQL Era

Nowadays, there are a lot of NoSQL databases, MongoDB, Redis, Riak, HBase, Cassandra and so on. And each one has at least one of these characteristics.

  • NoSQL databases don’t use SQL, some of the has query languages like MongoDB and Cassandra
  • Usually they are open-source projects
  • They we’re built to run on clusters
  • Schemaless, you don’t have rigid schema defining the data structure



Types of NoSQL

NoSQL databases can be divided in 4 types. Key-value, Document-Oriented, Column-Family Databases and Graph-Oriented Databases. Let’s see what are each one of these types, his characteristics and where we should be using them.



Key-Value Databases

What are: A key-value store works like a simple hashtable that we are used to use in traditional languages. You can add, retrieve and delete data through keys. Since they use primary key access they tend to have a good performance and are easily scalable.

Examples: Riak, Redis, Memcached, Amazon’s Dynamo, Project Voldemort

Who’s using: GitHub (Riak), BestBuy (Riak), Twitter (Redis and Memcached), StackOverFlow (Redis), Instagram (Redis), Youtube (Memcached), Wikipedia (Memcached).

When we should use:

  • To store user information, like Session, Profiles, Preferences, Shopping Cart and so on. These info are often associated to a id(key). This case is exactly the best scenario to use a key-value database.

When we shouldn’t use:

  • If we need to query the data by value instead by keys. There is no way to query a key-value database by value.
  • If we need to save relationship between data. We can’t relate data between two or more keys in a key-value database.
  • If we need transactions. In a key-value database, we can’t roll back a operation if a failure occurs.



Document-Oriented Databases

What are: Document-Oriented databases store data as documents. Documents can be defined as a set of maps, collections and scalar values. Documents are like rows, but unlike rows that have to have the same schema, documents can be totally different between themselves. These documents can be stored using XML, JSON or JSONB.

Examples: MongoDB, CouchDB, RavenDB

Who’s using: SAP (MongoDB), Codecademy (MongoDB), Foursquare (MongoDB), NBC News (RavenDB)

When we should use:

  • Logging. In a enterprise environment, each application has different logging info. Document-oriented databases don’t have a fixed schema. So we can use them to store all these different info.
  • Analytics. Since they are schemaless, we can store different metrics and new metrics can be added without schema changes.

When we shouldn’t use:

  • If we need to have transactions between documents. Document-oriented databases don’t support transaction between documents, if we need it, we shouldn’t use a document database.



Column-Family Databases

What are: Column-Family databases store data in column families. A column family can be defined as groups of related data that are often queried together. Let me give a example. When we have a Person class we often access their name and age together but not his salary. In this case, name and age belong to one column-family and salary belongs to another one.

Examples: Cassandra, HBase

Who’s using: Ebay (Cassandra), Instagram (Cassandra), NASA (Cassandra), Twitter (Cassandra and HBase), Facebook (HBase), Yahoo!(HBase)

When we should use:

  • Logging. Since we can store data with different columns, each application can write their info with their own column families.
  • Blogging Platforms. We can store each info in different column families. For example, tags in one family, categories in another one, posts in another one and so on.

When we shouldn’t use:

  • If we need ACID transactions. Cassandra doesn’t support transactions.
  • Prototyping. If we analyze the Cassandra data structure, we can see that this structure is based in the pattern we expect to retrieve the data. When we are designing a prototype, we can’t predict how will be the query pattern and once it changes we will have to change the column families design.



Graph-Oriented Databases

What are: Graph databases allow us to store data as graphs. Entities can be represented as vertices and the relationships between these entities can be represented as edges. In a example, we could have 3 entities. Steve Jobs, Apple and Next. And two edges called “Founded by” that relate Apple to Steve Jobs and Next to Steve Jobs.

Examples: Neo4J, Infinite Graph, OrientDB

Who’s using: Adobe (Neo4J), Cisco (Neo4J), T-Mobile (Neo4J)

When we should use:

  • Connected Data. If we have data that are connected through relationship, we have a good case to use a graph database, the vertices can be people, cities, companies and edges can be “lives in”, “employed by” and so on.
  • Recommendation Engines. If we represent data in graph databases, they can be used to make recommendations like “people who bought this item also bought these items” like Amazon and Netflix.

When we shouldn’t use:

  • Data model not suitable. Most of the cases are not suitable for graph databases since operations involving the whole graph are not trivial.

arrowNo Responses Yet

Leave A Comment