Workflow-based data analysis with KNIME

Analyze This!

Controlling and Configuring the Flow

The column loop still has a problem: The learner needs to know which column is to be predicted; however, it cannot be hardwired into the model, because the column name changes with every iteration (sub0, sub1, etc.). Here is where flow variables come into play.

Flow variables are basically key-value pairs that the user sends through the workflow with the data. These variables can be seen, for example, in the window for the output data of a node in the Flow Variables tab. Within the loop, the name of the column currently being processed is available as the currentColumnName flow variable.

Now you need to configure RProp MLP Learner so that it uses the value of this variable to determine the column with the class information (i.e., sub0, sub1, etc.). To do this, you open the configuration dialog for the node and click on the Flow Variables tab. You will see a list with settings.

RProp MLP Learner has two entries per setting: one for internal use and one for normal use. They can be identified by the selection box to the right. In this box, you can specify a flow variable to be used instead of the currently selected value for this setting. If you want to use currentColumnName as the class column, select this entry under classcol (Figure 17).

Figure 17: The class column selection box that assigns the value of a flow variable.

Saving and Sharing Models

The next step is to save the newly learned model so that you can reuse it later. Output from the RProp MLP Learner (i.e., the model) neural network is formatted in the Predictive Model Markup Language (PMML) [14]. This standardized format for predictive models is widely used in machine learning and therefore well suited for exchanging models between different platforms and tools.

The PMML Writer node is required to write the model to disk. However, because it runs within the loop, the file name in the configuration dialog depends on the current iteration, or the column to be learned. The tool must therefore combine a folder and a dynamic file name to form a path.

This chore is best accomplished with the Create File Name node. Its input is graphically marked with an unfilled red circle, indicating that it is an optional input for flow variables. You need to connect it to Column List Loop Start , but at first glance, this does not have a compatible output "port." However, appearances are deceptive. The flow variable outputs and inputs of most nodes are simply hidden. They can be made visible by right-clicking a node and selecting the Show Flow Variable Ports menu item.

In the Create File Name configuration dialog you can now specify any folder and set .pmml as the file extension. The Base file name , on the other hand, is defined by the current column (sub0, sub1, etc.), which can be found in the currentColumnName flow variable. You can assign it to the configuration field by clicking the small button next to the text field, checking the Use variable checkbox, and selecting the flow variable in the list next to it.

The output from the node must be connected to the corresponding input of the PMML Writer node. To do this, the flow variable input does not necessarily have to be visible; you just need to drag a connection to the upper left corner of the node. In the configuration dialog for the PMML Writer node, you can then use the flow variable button to select the filePath variable again under Output location to save the model at this location.

The loop ends with a Variable Loop End node connected to the flow variable output of PMML Writer . Now KNIME iterates through the loop once for each of the five subscription columns, and five prediction models end up in the folder specified in Create File Name . Figure 16 shows the workflow for teaching and saving models.

Prediction

The five models determined will only be truly useful if you use them to predict whether readers without a newsletter subscription might be interested in a subscription and which one.

The best way to accomplish this is to create a new workflow that first loads the previously determined models with the List Files node. The filter must be .pmml and file extension(s) . The folder to be searched is, of course, the one to which the prediction models were written earlier.

The result is a table in which each row corresponds to a matching file in the folder. If you connect them to a Chunk Loop Start node, the subsequent part of the workflow runs separately for each file. With the use of a Table Row to Variable node, the individual rows can be converted into flow variables, in which each column becomes a variable. The URL variable can then be used in a PMML Reader node, the counterpart to PMML Writer , as a value for the input file.

The model at the output of the node can now be passed to a MultiLayer Perceptron Predictor along with the data to be predicted (i.e., the preferences of readers without subscriptions). The predictor applies the model to the new data and predicts whether the reader might be interested in the subscription represented by the model.

You will only want to keep the column with the prediction from the output table of the forecasting node. For this you need the Column Filter node, in which you select the Wildcard/Regex Selection option in its configuration dialog and enter Prediction (sub*) as a pattern. The output from the Column Filter node is linked to a Loop End (Column Append) node to obtain a table (Figure 18) that predicts for each newsletter subscription whether a reader will have any interest. Figure 19 shows the workflow for this process.

Figure 18: The table of prediction results.
Figure 19: The workflow for predicting readers' interests in newsletter subscriptions.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus