« Previous 1 2 3 Next »
Distributed Linear Algebra with Apache Mahout
Matrix Math
Engine Agnosticism (and Why It Matters)
Engine agnosticism is important to many organizations, many of whom don't even realize it. In just the 2000s, clusters have moved from Hadoop to Spark, and from Spark to Kubernetes. Teams and organizations often find themselves in an uncomfortable position of being pinned to outmoded technology, rather than bringing in costly consultants to port business-critical algorithms and methods from an old system to a new one.
The Mahout project lived through the migration from Hadoop to Spark and incorporated the lessons learned into its very fabric, making it simple to port algorithms from any arbitrary platform to any other (Figures 3 and 4).
Note that the code after the import statements requires no changes. Therefore, a team that uses one back-end engine is able to migrate code onto another engine, without having to change the code performing the mathematical operations and avoiding the more error-prone part of porting code from one platform to another.
Mahout Use Cases
Mahout is for organizations who have statistical methods to run on distributed datasets but want to minimize their exposure to the technical debt that arises from writing algorithms against a specific engine that may or may not have a successful future. Mahout allows organizations to switch their systems of record while having a minimal effect on their data outputs.
Mahout also lets users add linear algebra concepts to data stores that have either weak or nonexistent implementations for linear algebra concepts, such as Apache Spark. (Spark's linear algebra works fine in single-node deployments but has issues scaling to larger distributed datasets.)
Mahout in the Wild
A major used car marketplace in North America used the Mahout codebase when creating their car recommendation system. This recommendation engine is based on Mahout's Correlated Cross-Occurrence (CCO) analysis. The CCO algorithm is very similar to the more popular co-occurence (CO) algorithm, but it also incorporates other attributes of the user into its recommendations; in more technical parlance, it is multimodal.
Mahout has been used in many situations where customer privacy and intellectual property concerns keep them from being published, but many researchers and practitioners have built recommenders, similarity engines, and other predictive models at scale with the use of its tools.
A paper written by T. Grant [2] illustrates another Mahout use case. During the outset of the COVID-19 pandemic, CT scans were shown to be as good as, or in some cases superior to, RT-PCR tests. A major issue however was the high dose of radiation they delivered. With Mahout, Grant showed how "noisier," "low-dose" CT scans could be quickly and easily denoised, with about five lines of Mahout code.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)