Graph database Neo4j discovers fake reviews on Amazon
Digital Detective
Graph databases do not use the relational tables and join commands of traditional relational databases. Instead, they look for relations between nodes and support queries that would be slow or even impossible to process in their relational counterparts. In this article, I take advantage of the graph database structure to create an algorithm that detects fake product reviews on Amazon with a Neo4j instance in a Docker container.
Fraudulent Reviews
On closer inspection of a product on Amazon that has consistently earned five-star ratings, it often turns out that many of the reviewers are professional lackeys. The text obviously betrays that the author did not even use the product (Great product, fast delivery! ). If you then search for further reviews from the same customer, you will often find other five-star reviews that look very similar. The problem is so evident on Amazon that customers rub their eyes in amazement wondering why the online giant doesn't intervene.
Graph databases can help identify such shenanigans. Several criteria can help detect patterns in the typical behavior of fraudsters and expose them. Does a single customer write hundreds of five-star ratings? Suspicious. Does a product have many of these boilerplate reviews? There could be something wrong with that. Do the members of a gang of fraudsters all review the same products?
If the alarm bells go off for only one of these criteria, you might not necessarily suspect misuse, but two or more increases the likelihood of fraud. Further investigation would then be worthwhile to see whether the intent is to rip off customers.
Detection Algorithm
The last of the previously mentioned criteria seems interesting from a programming point of view. How does an algorithm find groups of users who all rate the same products without having any clues as to which users they are?
Listing 1 shows a fictitious YAML list of products with the names of evaluators. A similar list could be obtained with real data from the Amazon website with the official API or a scraper.
Listing 1
reviews.yaml
reviews: product1: - reviewer1 - reviewer2 - reviewer3 - reviewer7 product2: - reviewer1 - reviewer2 - reviewer4 - reviewer8 product3: - reviewer3 product4: - reviewer4 - reviewer7 product5: - reviewer5 - reviewer8 product6: - reviewer6
The human eye immediately recognizes that a dubious duo consisting of reviewer1
and reviewer2
obviously reviewed the products product1
and product2
together. If the data were only available in a relational data model, it would be very time consuming to discover this connection in a very large database in something less than an infinite amount of time.
With graph databases that simply traverse along the relations between nodes instead of juggling relational tables and computationally expensive join commands, it is relatively easy to program smart algorithms. I discovered graph databases six years ago and featured them in an
article [1]; however, the development of the genre has not stood still, which calls for a new look.
Prettified
The Go program presented in this issue converts the YAML list from Listing 1 into a graph that shows which products were evaluated by which persons.
To do this, it sends commands to a locally installed Neo4j database, which, when the program has run, displays the graph shown in Figure 1 with the relations between products and reviewers. The screenshot is taken from the window of a web browser, which uses http://localhost:7474 to point to a Neo4j installation that conveniently provides not only the server in a container, but also a web interface for graphically enhancing the data.
Buy this article as PDF
(incl. VAT)