Delicious, delicious graph databases.

Most engineers are pretty familiar with the traditional relational database model. You have a few tables, and you make joins to avoid inefficient many-to-many relationships.

Recently, I was introduced to graph databases, specifically, Neo4j

Graph databases are great for when you have a lot of unstructured data that can be related in many different ways, which is why it’s commonly used in social media (Facebook has their own in-house graph). On my most recent project, I worked with a team to build a recipe recommendation app. We decided to use a graph database because of the number of relationships we were working off of.

Let’s look at a couple of recipes as an example:

Scrambled Eggs:
serves: 4
takes 10 minutes to make
Eggs, Butter, Salt, Pepper

Butter Twists:
serves: 8
takes 3 hours to make
Butter, Sugar, Vanilla extract, Eggs, Salt, All-purpose flour

A graph database has two primitive types, nodes and relationships. The things to remember are:

  • Nodes have relationships with each other.
  • Nodes can have properties and/or types
  • Relationships can also have properties.

If you look at the (simplified) recipes above, there are a lot of different things that go into it. At a base level there are:

  • Ingredients
  • Amounts of ingredients
  • Time to make
  • Yield (or number served)
  • Recipe Names
  • Instructions

This doesn’t even begin to go into other important details like dietary restrictions, seasonality, flavor, cuisine, or course.

The thing we grappled with at the beginning was – how do we represent a recipe, its qualities, and its ingredients in a graph? Are ingredients nodes with relationships to recipes, or are they properties? What about more specific factors like dietary restrictions? We looked at three different options:

Option 1: Recipes are the only nodes, everything is an attribute of the recipe.
This was the simplest, but least efficient option, especially given the goals of our app. We needed to be able to query the DB by Ingredient, and have aspirations to be able to query it with dietary restrictions as well. It looks something like this:

only recipes are nodes

Option 2: Everything is a node and we represent the qualities of the recipe as a relationship.

There isn’t a good technical reason for us not to do this, but there was also no reason for us to specifically do this. We decided that this wasn’t a great representation because every recipe (or at least the vast majority of them) has a “time to make” and a “yield” which are formatted in a similar way and can be expressed as a single property. Ingredients, on the other hand, vary in length depending on the recipe. (scrambled eggs have 4, butter twists have 7). They have a quality of the relationship – there is a specific amount of eggs and butter that go into a recipe for scrambled eggs. If we were to have gone for option 2, it would look something like this:

Everything's a node!

Option 3: Somewhere in-between.

This was the option we decided to go with because of many of the reasons listed in option 2. We decided to apply a rule: if the relationship between two nodes would have a unique property, or there wasn’t a consistent format for how the property could be reflected on a node (for example, a recipe has a certain amount of an ingredient, and each recipe has a different number of ingredients), we would create two nodes and a relationship between them to show the connection. So we created a system like the one below for recipes to ingredients (note, to keep these from being a jumbled mess of arrows, I separated the ingredient relationships from the recipe-to-ingredient relationships, but think of the ingredient nodes as being the same nodes with multiple relationships.) :

Neo4j_graph

And this for ingredients (The ingredient relationships also have properties that indicate the number of times they co-occur with the ingredient they are connected to):

neo4j graph vis of ingredients

Using this structure allowed us to create relationships between ingredients, so we could calculate how often they co-occur, which would be a critical component to how we recommended recipes.

Predicting the NCAA Tournament Results

There are a ton of people who are more knowledgable about college basketball and statistics than I am, who have tried, and weren’t that successful at predicting the outcome of the NCAA tournament.

I’m going to try anyway. At the very least, I’ll get practice with Node, Firebase and writing clean, readable code. If, by some chance I get a formula that works, I might be able to get $1 billion dollars out of it

This was sparked, primarily, by me being sick of how unpredictable my Tar Heels have been this year. Surely there is something that’s “causing” them to lose to borderline-division-2 schools, and beat three top-ten schools all in the span of a month?

so, DATA. So far in this project I’m using:
Firebase – cloud-based db and app hosting. I might switch to a neo4j graph database later if it turns out there are too many joins to want to deal with)
nodeJS – javascript server-side platform
CasperJS – ‘dom-liberator’ (but also used for testing)

Step one: Set up Node server and Firebase DB.
Firebase makes it really easy to get your DB set up in a Node server. Literally two lines:

Step two: set up getting data from a certain sports property’s API.

This sports property makes some, not very useful, information available on their free API. What they do offer for free is a URL to the player’s pages, which can be easily parsed to get their game logs. This was a simple GET request, posting the URLs with player names directly to my DB.

That’s where I am so far. Next, I’ll be using Casper to get data from the game logs I saved earlier.

Hi

My name is Kamla, which you probably found out if you came from my website. ┬áThis is my newest blog, where I’ll write about coding, code, what it’s like to be a marketer-turned-software developer, and other musings about the bay area and cooking (really more cooking experiments).