How graph databases work
Relationships
Typical relational databases map relations indirectly by means of joins, whereas graph databases map the same relations directly. One example of the use of a graph database involves the so-called Panama Papers. Tax fraud, money laundering, breaches of sanctions, and a number of other criminal offenses were levied against clients of the Panamanian offshore service provider Mossack Fonseca. In 2016, journalists associated with the International Consortium of Investigative Journalists (ICIJ) from 107 newspapers, television stations, and online media in 80 countries uncovered the crimes by jointly analyzing some 11.5 million leaked email messages, database entries, letters, fax documents, deeds of incorporation, loan agreements, bills, and bank statements with the help of a graph database.
Relations
Alice likes Bob who follows Carol and Dave on Twitter. Streetcar line 15 runs from Max-Weber-Platz via Giesing to Gr¸nwald. Mr. Huber is the superior of Mr. Meyer and Mrs. Schulze. Theresa liked the video that Mike recommended. Vehicle is the generic term for car, motorcycle, and truck. Modern man is related to early hominids and other species of great apes.
All these statements have one thing in common: They are about relations – about connections between people or things for which you could intuitively sketch something tree-like on a piece of paper. Such relationships are becoming more important in a world that is becoming increasingly networked, where everything is connected with everyone.
It is certainly no coincidence that relations are at the heart of the business ideas of corporations that are now among the most valuable and powerful in the world – Facebook and Google, for example. In the first case, it's about relationships between people who know each other; in the second, it's about relationships between websites that link to each other. The often-quoted page ranking is the measure of these connections. Networking is the common denominator between these models, and relations are its underpinnings. This is also true for countless other applications where relations play the main role, be it in purchase recommendations, dating portals, fraudulent transaction detection, genetics, logistics, and so on and so forth.
Of all data structures, one is particularly well suited to representing relationships – the graph. Graph databases, in turn, can store graphs, expand or shrink them, update and rearrange them, and answer questions about their content. Other developments have certainly attracted more public attention, but, out of the spotlight somewhat, the trend curve of the "graph database" search term on Google has been rising steadily for the past 10 years.
In parallel, a great variety of graph databases have emerged, including the originals like Neo4j [1], distributed databases like Titan [2], free databases like JanusGraph [3], proprietary databases like Amazon Neptune [4], and established relational or NoSQL databases that have added the ability to handle graphs. An example of the latter is the DataStax Enterprise Graph [5] database, which combines the open source Cassandra [6] database with the also free Apache TinkerPop [7] graph computing framework to create a successful product.
Today, graph databases are in production use in almost all industries in leading global corporations, including – to quote just a few names from customer lists – NASA, Walmart, Siemens, Google, Samsung, eBay, IBM, Airbnb, Novartis, Volvo, and many others. If you want to look into graph databases, a short crash course in graph theory will certainly do no harm (see the "Graph Theory" box).
Graph Theory
This branch of mathematics was founded almost 300 years ago by the Swiss mathematician Leonhard Euler. He found a solution to the Königsberg (now Kaliningrad) bridge problem, which considered the question as to whether one can tour the city by crossing each of the seven bridges over the Pregel River exactly once. Euler proved that this was impossible. The potential routes through the city form a graph – even if Euler did not call it that at that time (Figure 1).
Generally speaking, a graph is a structure of vertices and edges that connect nodes. Applied to information processing, the vertices always represent an element (person, thing, place, date, organization, category, etc.), and the edges represent the relationships between two vertices. The edges can be directional, in which case they are referred to as arcs, and the graph with these directed edges is called a digraph. The edges also can have assigned attributes (e.g., path lengths or costs), in which case it is referred to as a weighted graph. The same graph can often be represented in different ways, and the equivalent representations are referred to as being isomorphic (Figure 2).
If a representation has edges that only cross the nodes, the graph is planar. An application in architecture states that if the paths between rooms form a planar graph, the rooms can be on one floor; otherwise, they would have to be spread over several floors, and you would need stairs.
The number of edges that meet in a node is known as the degree of the node. Euler found that you can find a way through all nodes in which no edge repeats (an Euler circuit) if, and only if, the degree of each node is an even number, which is not the case in the Königsberg bridge graph. The mean value of all degrees of the nodes of a graph is the branching factor. If all legal moves that occur in the course of a chess game are shown as a graph, the branching factor is between 31 and 35. The chess player can therefore choose between, on average, 31 and 35 possible rule-based moves at any point in the game. The branching factor in Go is 250, which is why, for example, it took much longer for computers to outperform humans in Go.
A common optimization problem is to modify a graph with the least amount of intervention possible in such a way as to create an Euler circuit (Eulerization), wherein the degree of each node must be even. A circuit that visits every vertex exactly once and starts and ends at the same vertex is known as a Hamiltonian circuit, whereas a Hamiltonian path visits every vertex exactly once, but it does not have to start and end at the same place. Finding a Hamiltonian path is relevant for all pick-up and delivery routes, for example.
In short, a finite sequence of the type node-edge-node-… is a path. If the starting point is equal to the end point, the path is closed; otherwise, it is open. A closed path with different vertices and edges is known as a circuit. A simple graph without circuits is a tree. Visiting each vertex and edge in a graph is a traversal; a tree traversal is a special case of the graph traversal.
A sample application of the paths concept can be found in project management, whose processes can be represented as a directed graph, in which the edges are usually weighted with a time duration. A route that contains the longest chain of interdependent steps determines the minimum project duration and is known as the critical path; the corresponding scheduling method is the critical path method (CPM). Any delays on this path have a direct effect on the overall project duration.
RDF and LPG
Graphs can represent many situations: computer networks, subway maps, organizational charts, flowcharts, molecular structures, decision paths, circuit paths, social networks, family trees, knowledge content, supply chains, and much more. Computer science uses several coding models, including the resource description framework (RDF) model [8] and the labeled property graph (LPG) model introduced by the Neo4j database.
The RDF model comes from the semantic web environment, which aims to enrich data with contextual information. RDF describes graphs as triples with the form subject-predicate-object. The first part of each triple is the subject node, followed by the predicate, which represents the edge and describes the nature of the relationship in more detail. At the end is the object, which can be a simple literal value. An example that could show a simplified section of a relationship network on Twitter might look like this:
Hans – follows – Werner
Werner – follows – Hans
Thomas – follows – Hans
In reality, however, the subject, predicate, and object in the RDF model are unique Unified Resource Identifiers (URIs), which are needed to mix information from different sources. (The universally known URLs are a special form of URI for the designation of Internet pages). The elements of an RDF triple, the URIs, are plain references – or simple literals in the case of some objects – that have no internal structure. One advantage is their uniqueness.
A bibliographic reference to the website http://www.example.org/index.html stating the author, date of creation, and language could look like Listing 1 in the RDF model. In the example, the URI http://purl.org/dc… conforms to the Dublin Core metadata standard.
Listing 1
RDF Triple Examples
<http://www.example.org/index.html> <http://purl.org/dc/elements/1.1/creator> <http://www.example.org/staffid/94640> . <http://www.example.org/index.html> <http://www.example.org/terms/creation-date> "May 10, 2019" . <http://www.example.org/index.html> <http://purl.org/dc/elements/1.1/language> "de" .
LPG is slightly different, in that the nodes and edges have an internal structure. Each node and edge is its own small key-value store that stores its attributes. Different types of nodes or edges can contain a different number of these key-value pairs. Because the attributes are not extra nodes, unlike the RDF model, the representation is more compact. Additionally, individual properties of relationships and multiple relationships of the same type between subject and object can be more easily mapped in LPG format. However, the meaning of the freely definable key-value pairs attached to the nodes and edges is not defined uniquely in the same way as in the RDF model with URIs. A sample of a schema for storing Twitter tweets in LPG would look something like Figure 3.
Graph Databases
Some graph databases use the RDF model and others use the LPG model. Additionally, databases can be distinguished according to whether they are native graph databases that use a storage engine designed specifically for graphs or non-native graph databases that serialize the graphs in some form and then store them with a proven relational or NoSQL storage engine. Both have advantages and disadvantages. For example, native databases are usually faster, whereas non-native databases rely on a long known and well understood and optimized technique.
Query languages are another distinguishing feature. Unlike the relational world, no single standard such as SQL has been established; rather, a number of different query languages have emerged, including the widely used Cypher, Gremlin (as used in this article), SPARQL, GraphQL, and others. Although graph databases are mainly used for online transaction processing (OLTP), graph compute engines process large batch jobs with graph data (online analytical processing, or OLAP).
Both techniques are best suited for scenarios in which the object is less about retrieving discrete facts than related things, or relationships. Anything that can be represented as a network or tree is a meaningful task for graph databases, which includes anything that has to do with finding patterns or hotspots in a network. Such cases can only be represented inefficiently and with poor performance in the relational model because the relationships have to be represented by joins and special tables and cannot have any special properties.
Buy this article as PDF
(incl. VAT)