An introduction to a special tool for transforming data formats
Data Lego
The process of preparing data is called "extract, transform, and load" (ETL), and specialty tools like Jaspersoft ETL help you carry out these tasks. In this article, I introduce the community version 6.0.1 of Jaspersoft ETL, which is available as a free download [1].
Data Gold
Panning for gold is tedious. Nuggets of precious metal don't simply sit around on a riverbed. Instead, the prospector has to sift through multiple pans of sand and stones, often retrieving just a few flakes for all of the trouble expended. Data is the equivalent of gold in today's world. Given this reality, you must tediously filter huge volumes of data to extract tiny particles of information that have real value.
When looking for gold, a prospector first has to get a fix on the location of a deposit and then get access to it. The same can be said for data: The process of locating a promising source involves procedures like aligning fonts, converting values and data formats, and importing results into databases. These preparatory steps need to be performed before you can effectively pan for gold in the form of data.
Converting Files
One of the easiest exercises for learning the ETL tool involves converting an input file to a different format. A simple example of a text-based book list should suffice for presenting the first practical steps (Listing 1). Columns separated by semicolons have a particular width. The lines are not in a particular order.
Listing 1
Book List (Excerpt)
Year;Number;Author ;Title ;Publisher ;Pp ;Price 2015;06; Stepanov, Rose ;From Mathematics to Generic Programming ;Edison Wesley ;0320;28 2015;09; Whithurst, Jim ;The Open Organization ;Harvard Business Review ;0227;16 2015;03; Kofler, Kühnast, Scherbeck ;Raspberry Pi: Das umfassende Handbuch ;Galileo ;1064;40 2016;04; Heinle, Stefan ;Heimautomatisierung MIT KNX, DALI, 1-Wire und Co. ;Rheinwerk ;1260;50 2016;09; Blasband, Darius ;The rise and fall of software recipes ;Reality Bites Publishing ;0360;25 2016;11; Luckhaus, Stefan ;Produktivitäts- und Leistungsmessung ;Pass Consulting ;0125;20 2016;01; Troche, Christoph ;FRITZ!Box ;Mitp ;0240;20 2015;02; Will, Torsten T. ;Einführung in C++ ;Galileo ;0520;25 2015;10; Keller, Müller ;Pentaho und Jedox ;Hanser ;0435;60 2015;03; Ristic, Ivan ;Bulletproof SSL and TLS ;Feisty Duck ;0507;46 ...
To convert to a different format, you must first set up a "job," which is accomplished by starting the TIBCO Jaspersoft ETL tool and selecting the Create Job
option from the context menu for Job Designs
in the left pane. After entering a name for the job in the dialog box, a four-pane window opens (Figure 1):
- Top left is a file structure for the jobs and their components.
- The center section has an area with a gridded background in which to assemble jobs from components.
- Top right is a menu of components.
- The bottom window has several tabs used to configure the components and start the jobs.
Now it is time to acquire the data to be processed. In this example, the data source is the text file in Listing 1. You need to use a component of type tFileInputDelimited
, which opens the comma-separated values (CSV) file, reads it in line by line, separates the fields according to a defined separator and schema, and passes on the fields to the next component for further processing.
Drag the tFileInputDelimited
input component from the menu on the right to the center of the gridded assembly area. Double-clicking on the component opens a new dialog box in which you should define the properties of the component (Figure 2). The definitions include the path to the input file and the separators. With a single click on the Edit schema
button, you will come to a window in which you can set up the data type for each field of the CSV file.
From the right-hand column, select the component to be used to present the result and drag it onto the assembly area. This example uses a component of type tFileOutputExcel
. After double-clicking on this component, you can define the name and path of the output file as above before connecting the components from left to right by clicking on the handle of one icon and pulling a line to the next icon (Figure 3). Now the only thing left to do is start the job, and voilà, an Excel spreadsheet containing the comma-separated data appears as specified.
Sorting Files
The next level of complexity for a job like this is to insert between the input and output components a component that will process the input data before it is output. For example, sorting would be a very simple method to insert here.
In the example at hand, try sorting the unordered list of books according to year and issue number. To do this, you should place the tSortRow
component between the input and output modules. The component is found in the right-hand menu under the Processing
option.
Afterward, you should connect the components as before (Figure 4). This sets up the path for the data from the input CSV file to the sorter and from there to the output file. Now you can double-click to configure the sorting criteria in the columns of the schema.
Once the sorting criteria have been defined, you can start the job. After a few seconds, you will see a short notification in the assembly area about the number of processed lines and the elapsed processing time.
Buy this article as PDF
(incl. VAT)