Workflow-based data analysis with KNIME
Analyze This!
Recommended Reading
The point of the exercise is to recommend articles that are closest to the reader's preferences. For example, if the reader is particularly interested in hardware and security, it would be a good idea to suggest articles from these categories – or even articles that are in both categories at the same time. The current workflow has already explored the extent to which a certain reader has a preference for each category, and the result is available in the form of a vector (Figure 5).
You can create a very similar vector for each article, where the columns of those categories contain a 1 , to which the article is assigned. Category columns without connection to the article, on the other hand, contain a 0 . The One to Many node makes this possible: The transformation of the article-heading assignments (Table 2) into a representation by a binary vector per article (Figure 10).
Table 2
Categories Table
Article ID | Category |
---|---|
Article 11 | Hardware |
Article 11 | Software |
Article 31 | Development |
… |
Once the two vector types have been created, the Similarity Search node can simply determine a distance between the vector of a reader (which represents its preferences) and the vector of an article (which indicates to which categories it is assigned). The smaller the distance, the more the reader's preference corresponds to the categories in which the article is classified.
From all articles that a certain reader has not yet read (see Row Reference Filter node), it is now easy to determine the article that has the shortest distance to the reader's preferences. This article is finally recommended for reading.
A loop (consisting of the Chunk Loop Start and Loop End nodes) corresponds to a For loop across all rows of the table. The loop determines the smallest distance of the vectors for each reader. The overall result with one article recommendation per reader (Figure 11) could then be written back to a database with the help of the Database Writer node.
Another Example: Machine Learning
Automatic detection of patterns in large datasets is one of KNIME's prime objectives. Machine learning has hit the headlines mainly in connection with autonomous cars, along with the dubious machinations of large corporations that are eager to collect data, but artificial intelligence and machine learning can generate added value in many areas because their algorithms identify structure in apparently random data.
From the very beginning, the developers of KNIME attached great importance to the integration of the latest machine learning algorithms, starting with support vector machines and simple neural networks. Today, KNIME has also mastered newer methods, such as random forests and deep learning.
The many native KNIME machine learning nodes can also be combined easily with numerous other machine learning tools that are also integrated with KNIME, including scikit-learn [6], R algorithms [7], H2O [8], Weka [9], LIBSVM [10], and the deep learning frameworks Keras [11], TensorFlow [12], and DL4J [13].
Supported by the various possibilities to import, export, visualize, and manipulate data, KNIME offers a platform that maps the entire analysis process as a graphical workflow.
Extended Scenario
The online fictional publisher also offers newsletters on five different topics. For those readers that have registered, the aim is to make these newsletters more appealing to potentially interested readers. Instead of flooding every reader with the newsletters on all five subjects, the publisher wants to move forward in a slightly more intelligent way.
On the basis of known reader preferences and experience with previous subscribers, newsletter recommendations are only to be sent to readers who might genuinely be interested in one of the five topics. Not only does this plan save in terms of the volume of mail to be sent, it also avoids annoying readers who are completely uninterested in some or all topics on offer and who might even reject all of the publisher's offerings as a result.
Known preferences for certain topics, as determined from reader evaluations of journal articles, serve as the data basis (refer to Figure 5). Additionally, information about who has signed up for a newsletter subscription is available as a new data source. This information can be retrieved through a REST interface, which provides the data in JSON format.
Buy this article as PDF
(incl. VAT)