Advantages of data analysis with graph databases
Relationship Status
The volume and variety of collected data is constantly growing, which prompts enterprises, for certain use cases, to turn to a new generation of database technology. The structure and query language of graph databases allow the correlation and recognition of current data in real time. Furthermore, graph databases offer massive speed advantages when it comes to evaluating special datasets.
Limits of Relational Databases
Most enterprises, through websites and other methods, build huge databases. Analyzing these databases can reveal important information about customers, helping to predict user behavior, determine optimal pricing, and overcome operational challenges. For some time now, the hurdles for the use of graphs have been falling as a variety of functions from modern query languages have become available. On top of this, most cloud vendors now offer graph technology as a service in addition to the proven relational database management system (RDBMS) options.
RDBMSs were created in the 1970s for storing, processing, and retrieving structured data, such as records of financial transactions that are stored in a tabular format, and they are still widely used in organizations. However, data volume is now a genuine problem, and developers are increasingly looking for other approaches. NoSQL databases and key/value stores can solve part of this challenge, but they do not provide the analytical capabilities that organizations need to turn data into actionable insights.
Additionally, other fields of application centering on Big Data (e.g., fraud detection, supply chain management, risk analysis, or recommendation generation) also cause headaches because individual data structures have to be created in relation to each other. The graph database has been around for several decades as a concept, but recent innovations in storage and processing performance and the evolution toward Turing completeness [1] have meant that the graph is often the best (and sometimes only) approach to addressing these challenges.
An RDBMS in itself is helpful and good, but if you want to implement graph-style solutions (e.g., in SQL), you will encounter difficulties. SQL databases lack native support for edges, which means that additional work is required to create one-to-many and many-to-many connections. As a result, analyzing relationships across more than three or four connections becomes computationally expensive. The integration of new or changed data classes means creating new table structures and is therefore a cost factor that should not be underestimated.
Moreover, creating data connections, correcting errors, and maintaining queries for data connections can often be extremely difficult. With a graph query language, a developer can write complex queries that are easier to build and debug because the relationships modeled by the data faithfully reflect real-world situations. Some of these languages are specifically designed for analysis, so it is not necessary to leave the query language repeatedly for compute tasks.
Although NoSQL databases are ideal for storing and retrieving unstructured data, they suffer from the same query language limitations as SQL. Because of their structure, relationship analyses of more than two or three hops also require multiple table scans, which means that the processing time increases rapidly the more complex the queries become.
Advantages of Graph Databases
Graph databases can do many things that RDBMSs cannot do, or have difficulty doing. Conceptually, graphs model data more naturally than do RDBMSs. In a graph database, objects can be set in relation to each other and linked accordingly, rather than being forced into a standardized table format.
Many of these challenges in modeling complex information networks are solved by graph databases that use database schemata and a graph query language. The basic structure of a graph database is node-edge-node, which makes it possible in practice to represent a relationship such as "object A is connected to object C by edge B." With a well-designed graph solution, developers can add properties to these three elements, creating an environment with context information for all data.
In this structure, the hop (i.e., the transition from one node to the next) is the basic unit for calculations. From this hop, a relationship is calculated that is then returned as a value. Acquiring and processing such hop values are the basic components of a graph query, and given a graph query language with Turing completeness, the calculations required for complex analyses can be grouped with the hop values.
Because of the structure of a graph database, the index search for join operators does not encounter the performance bottlenecks typical of SQL because the connection information is specified directly on input. Therefore, no further calculations need to be performed for a graph query against the data.
This property is only found in native graph databases and is referred to as index-free adjacency, which allows a traversing rate of several million nodes per second and is why the response times are several orders of magnitude faster than for linked queries in relational databases. A good example is computing the shortest path in a route calculation. This function is also increasingly being used in machine learning and artificial intelligence (AI) scenarios.
Native graph databases with massively parallel processing capabilities that enable rapid data compression and decompression can deliver results in seconds that would take several hours to compute with traditional database technology.
Graph databases support a number of algorithm classes that are simply not feasible in an RDBMS:
- Path algorithms, which find the shortest path between nodes and evaluate paths (e.g., shortest path, cycle detection, and minimum spanning tree).
- Centrality algorithms, which rank nodes according to the degree of their connection or the central position of a node by edge weighting (page rank and proximity centrality).
- Community algorithms, which can determine how a group is clustered or divided (connected components, label propagation, triangular counting, and Louvain modularity).
- Similarity algorithms, which determine how similar a node is to its neighbors (cosine similarity, Jaccard similarity).
- Classification algorithms, which predict the classification of a given node according to previously classified nodes (k -nearest neighbor, cosine similarity).
Such algorithms can be used to optimize existing applications and develop entirely new solutions for companies that analyze huge pools of data.
Graph databases are characterized like neural networks, and thus machine learning, by the interconnection of properties, which is why graphs are a valuable tool for AI algorithms.
Applications in Logistics
Graph databases allow organizations to analyze data in a format that better reflects the relationships between the objects underlying the data, allowing developers to use meaningful approaches that address issues such as page ranking, social media links, customer analysis, fraud detection, real-time product recommendations, and risk assessment.
For example, many supply chain management solutions facilitate work in specific areas, such as storage and transportation, but approaches that cover all aspects are rare. The datasets required for supply chain management are inherently large, stored in isolation, and distributed across different systems on both the material and production side.
Graphs allow developers to do justice to the essence of supply chain management. Their holistic data model is based on the actual relationships between the individual elements of a supply chain, such as "plant A buys component B from supplier C" or "component B is delivered by carrier D to plant A." If thousands of such relationships are linked together, a typical use case for a graph is created.
Most companies that use graph solutions for supply chain management use them in combination with their existing enterprise resource planning systems to gain real-time insights into planning, pricing, and resource allocation.
Buy this article as PDF
(incl. VAT)