Lead Image © J.R. Bale, 123RF.com

Lead Image © J.R. Bale, 123RF.com

Data virtualization using JBoss and Teiid

Uniformity Drive

Article from ADMIN 26/2015
By
JBoss Data Virtualization software applications receive a uniform interface to all data – regardless of its source.

Agile methods are increasingly used in today's IT environments, particularly for provisioning large amounts of data. Legacy access to data sources is a thing of the past. In this article, I outline the advantages of the underlying methods of JBoss Data Virtualization software, explain how to install it, and conclude with an initial data integration project.

Business Intelligence (BI) applications are responsible for analyzing, evaluating, and presenting data sets. The requirements for these applications have become more varied and complex over time. For example, users need to be able to view and analyze data in real time, rather than relying on historic data. This data is already outdated in many scenarios and no longer of any value. Additionally, users of BI applications have varying requirements and therefore sometimes need to create completely different reports on the basis of present data. The data itself is, of course, distributed across several sources, and each of these sources is a kind of isolated repository. Applications therefore need to be able to query several of these sources and do so using various interfaces. Employee data might exist in one SQL database and the employees' expense reports in Excel files. Different data sources therefore need to be enabled to process an employee expense report and store the results in a database. Depending on the desired report, this could involve a large amount of manual work.

The Limits of Proven BI Methods

In the past, such problems were solved by the extract, transform, and load (ETL) process. The relevant data were extracted from different sources, adapted, and transformed before being imported into a target database. The process, however, is quite complex, because all the data must be replicated and the transformation process is prone to error. Changes to the data models in the source systems can necessitate major adjustments to the transformation mechanisms. Real-time access with this kind of data transformation is not possible either.

The Big Data trend undoubtedly poses a huge challenge for today's applications. The task in these cases is processing large volumes of data that are either structured, unstructured, multi-structured, or semi-structured. Of course, you must accomplish all this virtually in real time. Such sets of data are preferably provided via NoSQL databases such as MongoDB. The Hadoop framework known from the Apache project plays an important role in processing the data.

Additionally, large amounts of data can come from various sources that are not always local. For example, almost all data resides on cloud-based systems, especially in the area of social media. Access to this type of data involves very different requirements than for locally available data, such as how the data can be transferred in a secure manner. Encryption plays a major role here. The duration of the transfer also often poses a problem with large data sets because of the possible high latencies.

Classic BI systems have a hard time implementing all the requirements just mentioned. The architecture of such systems usually comprises a variety of databases. The data travels from one data repository to the next as ETL jobs and, in doing so, passes through a wide variety of staging areas. In the past, batch-based processes for transforming data worked well. Increasing requirements are slowly putting an end to this method, however, and new procedures and processes are needed to process large amounts of data safely and easily with low latency.

More Agile Applications Through Data Virtualization

The problems just described can be solved using new methods and processes. The key here is "real-time data integration" based on data virtualization. With this technology, a kind of virtual data hub is slotted between the various data sources and the applications that need to access the data. This hub provides transparent access to the data – no matter where it comes from or what interface provides it. The applications themselves also access the data via a uniform interface, such as a Java database connectivity (JDBC) technology or a web service. The data can come from a database, a Hadoop cluster, an XML file, or virtually any other source (Figure 1). It no longer matters where the data comes from; data access is abstracted by the virtual data hub.

Figure 1: A data virtualization server abstracts access to different data sources.

The principle is similar to the use of metadirectories for authenticating users. Metadirectories also provide a unified view of different authentication sources, such as an LDAP server, an Active Directory, or the like, thus ensuring that the user only sees a single interface for logging in to a system or an application. The virtual data hub assumes the role of a metadirectory for data virtualization.

Data virtualization provides the great advantage that data is integrated when it is needed and without having to first copy it to a target database. This is a completely different approach to the previously described ETL process. A virtual database (VDB) that links the physical data sources with a specific "view" of the data they contain makes this possible. If an application needs a specific set of data, a particular view is used on the original data source. The data remains in situ and does not need to be copied first. This is a huge advantage over the ETL process, especially with big data.

Furthermore, for the approach presented here, developers only need to take care of providing the integration logic with the data; they have no need to provision the transformed data in an additional database. This method not only saves time, it also means that fewer infrastructure services need to be provided. The data itself can, of course, be modified because access is to the actual sources and not to the results of a transformation. Data from similar sources can be collated in different ways when developing the integration logic.

For example, views can be produced to combine multiple, complexly interrelated database tables in a single table and then make them available to the application. In this context, it is also important to mention that the integration logic can help unify different data formats. For example, data sets can contain phone numbers in different formats. Using an appropriate model, these can be converted to a uniform format and are then available as part of the virtual database.

Finally, it should be noted that a virtualization hub can control access to the source data in a very granular way. The hub can be viewed in this context as a kind of data firewall and can thus help implement compliance requirements.

Data Integration Using JBoss Data Virtualization

The JBoss Data Virtualization software, formerly known under the name JBoss Enterprise Data Services (EDS), is currently the only open source software on the market that offers such a form of data integration. It practically works as a virtualization hub upstream of the different data sources and provides applications with a uniform view of the data. For this, it looks as if the data comes from a single source. Figure 2 shows the different software components.

Figure 2: The JBoss Data Virtualization software consists of various components.

The JBoss Data Virtualization Server runs as a process within the JBoss Enterprise Application Platform (EAP) and has several tasks: For one, it is responsible for managing the virtual databases (VDBs). VDBs provide a uniform view of data from different sources. A VDB consists of a source model containing a view of the data source. The model contains information about the actual source data's structure and properties and about what the data that is made available to the applications looks like.

The server contains an access layer that determines how the VDBs can be accessed. The software provides JDBC, ODBC, or web services (SOAP/REST) interfaces. A query engine ensures optimal access to the individual data sources based on the existing source and view models. This happens through "Translators" and "Resource Adapters."

The Teiid Designer visual tool [1] allows users to create VDBs. The tool is available as a plugin for the JBoss Developer Studio graphical development environment. Furthermore, a Java API in the form of the Connector Development Kit can be used to adapt the Translators and Resource Adapters to the existing data sources.

As well as these two core components, various administrative tools help manage the environment. For example, AdminShell provides command-line-based access to the JBoss Data Virtualization Framework. Users of the JBoss Enterprise Platform will already be familiar with the management console for managing the application server.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus