Most engineers are pretty familiar with the traditional relational database model. You have a few tables, and you make joins to avoid inefficient many-to-many relationships.
Recently, I was introduced to graph databases, specifically, Neo4j
Graph databases are great for when you have a lot of unstructured data that can be related in many different ways, which is why it’s commonly used in social media (Facebook has their own in-house graph). On my most recent project, I worked with a team to build a recipe recommendation app. We decided to use a graph database because of the number of relationships we were working off of.
Let’s look at a couple of recipes as an example:
takes 10 minutes to make
Eggs, Butter, Salt, Pepper
takes 3 hours to make
Butter, Sugar, Vanilla extract, Eggs, Salt, All-purpose flour
A graph database has two primitive types, nodes and relationships. The things to remember are:
- Nodes have relationships with each other.
- Nodes can have properties and/or types
- Relationships can also have properties.
If you look at the (simplified) recipes above, there are a lot of different things that go into it. At a base level there are:
- Amounts of ingredients
- Time to make
- Yield (or number served)
- Recipe Names
This doesn’t even begin to go into other important details like dietary restrictions, seasonality, flavor, cuisine, or course.
The thing we grappled with at the beginning was – how do we represent a recipe, its qualities, and its ingredients in a graph? Are ingredients nodes with relationships to recipes, or are they properties? What about more specific factors like dietary restrictions? We looked at three different options:
Option 1: Recipes are the only nodes, everything is an attribute of the recipe.
This was the simplest, but least efficient option, especially given the goals of our app. We needed to be able to query the DB by Ingredient, and have aspirations to be able to query it with dietary restrictions as well. It looks something like this:
Option 2: Everything is a node and we represent the qualities of the recipe as a relationship.
There isn’t a good technical reason for us not to do this, but there was also no reason for us to specifically do this. We decided that this wasn’t a great representation because every recipe (or at least the vast majority of them) has a “time to make” and a “yield” which are formatted in a similar way and can be expressed as a single property. Ingredients, on the other hand, vary in length depending on the recipe. (scrambled eggs have 4, butter twists have 7). They have a quality of the relationship – there is a specific amount of eggs and butter that go into a recipe for scrambled eggs. If we were to have gone for option 2, it would look something like this:
Option 3: Somewhere in-between.
This was the option we decided to go with because of many of the reasons listed in option 2. We decided to apply a rule: if the relationship between two nodes would have a unique property, or there wasn’t a consistent format for how the property could be reflected on a node (for example, a recipe has a certain amount of an ingredient, and each recipe has a different number of ingredients), we would create two nodes and a relationship between them to show the connection. So we created a system like the one below for recipes to ingredients (note, to keep these from being a jumbled mess of arrows, I separated the ingredient relationships from the recipe-to-ingredient relationships, but think of the ingredient nodes as being the same nodes with multiple relationships.) :
And this for ingredients (The ingredient relationships also have properties that indicate the number of times they co-occur with the ingredient they are connected to):
Using this structure allowed us to create relationships between ingredients, so we could calculate how often they co-occur, which would be a critical component to how we recommended recipes.