Graph databases

In today’s ever evolving world of distributed computing we want answers of complex questions as soon as possible. That too for huge volume of data.

Take a look at following scenario:

  • How many products were ordered last friday between 10am to 12pm that are offered by each supplier?
  • How many products are returned that are offered by each supplier during last quarter?

Traditionally used RDBMS systems need to deal with complex joins and scalability issues for managing huge volume data in a fault tolerant way which is very tough to achieve. And database schema restrict from accommodating ever changing business needs. On the other hand, available NoSQL databases solve the scalability issue and offer flexibility with schema, but you have to manage joins at application level. Moreover, you have to manage consistency(C), availability(A) and partition tolerance(P) trade offs that may not fulfil transactional requirements of your business.

Graph databases allow to represent relationships in equally important way as entities – by presenting entities as nodes(vertices) and relationships as edges along with their own set of properties. However, it does not impose the same set of properties on all the instances of nodes or edges so, you have great level of flexibility in modeling your data. Your logical data model becomes physical data model by representing your data as an arbitrary connected graph.

You can traverse such graph to answer complex queries most efficiently – even if you have hundreds of billions of vertices and edges. Graph databases are great at analysing relationships such as social graphs or proximity analysis. Traversing by connecting them through relationships such as finding friend-of-friend or number of hops to reach specific node becomes way more performant than dealing with multiple joins in RDBMS. Number of steps you need to take while traversing the graph is the only factor that affects your query latency!

Interestingly these distributed graph computing frameworks, such as Titan, relies on external persistent store like Cassandra, Hbase or DynamoDB. It can efficiently manage hundreds of concurrent graph updates or traversals and help you achieve best of sql as well as nosql. Moreover you can integrate it with indexing engine such as Elasticsearch or Solar to support range-based queries. So, if you already have hadoop infrastructure in place, you can run Titan on top of it get all the benefits of graph algorithms. It provides you with an open data model that can evolve as your application grows.

Some of the use cases

  • Supply chain management needs to manage the details of retail-outlets, airports, warehouses and their relationships
  • Product Recommendation engines need to manage customers, products purchased by those customers, product features and based on customer’s social circle’s choices
  • Calculating Page Ranks
  • Finding shortest paths

Apart from that, depending upon your business needs, you may choose to partition data between graph database and either columnar database or relational database. So, graph databases can empower your big data solutions with flexibility, scalability, agility and reliability.

Write a comment