Legally compliant blockchain archiving

All Together Now

Weaknesses of Conventional Archiving

For legal reasons, a compliance archive system should have a complete and unchangeable inventory of all relevant raw data. As already mentioned, the archive marks the most secure and highest quality data storage within a company. However, because of technologic limitations, this information was only partially usable and of very limited use for evaluations or analyses. Classical archive systems are limited to simple search functions based on header data, because data analysis is not their technical focus, which explains the existence of analytics platforms that enable evaluations of subject-specific data.

However, the use of these systems is always caught between cost, speed, and data quality. As a rule, expensive in-memory approaches are only used for a small part of the total data and for a very limited time window. Historical data is often aggregated or outsourced for cost and performance reasons, which reduces the accuracy of analyses over time as information is lost through compression techniques. Common NoSQL stores, on the other hand, provide inexpensive models for high-volume processing but do not guarantee ACID (atomicity, consistency, isolation, and durability) compliance for processing individual data. For analytical use cases, data is usually transferred in a stack-oriented "best-effort" mode, which affects both the real-time processing capability and the quality of the data itself.

Data Processing in the Blockchain Archive

Because historical data is often aggregated or outsourced for cost and performance reasons, which reduces the accuracy of the information over time, one goal in the design of the Deepshore solution was therefore to utilize the entire data depth of the previously stored information for analytical use cases. Cost-optimized NoSQL and MapReduce technologies are used, without sacrificing the depth of information of the original archive-quality documents.

In the course of this process, the outlined vulnerabilities of current approaches are addressed, in which large amounts of data are managed either relationally, in-memory, or in NoSQL stores (i.e., always in the area of conflict between quantity, quality, performance, and costs). The blockchain-based tool is able to resolve this dilemma by methodically combining various services and using the maximum data quality of an archive for analytical purposes, without having to rely on reduced and inaccurate databases or expensive infrastructures.

By extending upstream processing logic, the system can also parse the content of structured data when it is received and write the results to an Indexing Service database. This processing necessarily has nothing to do with establishing the archive status of stored information, which means that the construction of an index layer can also take place at a later point in time, independent of the revision-proof storage of the raw data.

Most important is that the Indexing Service is based on the raw data in the archived state and not built up beforehand, because it is the only way to ensure that the extracted information is based on an unchangeable database. To achieve this state, there are two major differences to traditional data lake or DWH applications. These paradigms are summarized in the common data environment (CDE) model and are explained below.

On the one hand with the Deepshore solution, transferring data to the Indexing Service in a read-after-write procedure is obligatory, ensuring that every record is written before the responsible process can assume that the write process is correct and complete. The procedure could therefore look as follows:

  1. Parse the data.
  2. Set up the indexing structure.
  3. Write the structure to the Indexing Service (database).
  4. Read the indexing structure after completion of the write process.
  5. Compare the result of the read operation with the input supplied.
  6. Finish the transaction successfully, if equal; repeat the transaction, if not equal.

The read-after-write process also establishes a logical link between the raw data in the Storage Service/Verification Service and the Indexing Service. Each archive data record has a primary key that comprises a combination of a raw data hash (SHA256) and a UUID (per document) and is valid for all processes in the system (Figure 2). Thus, it is possible at any time to restore individual data records or even the entire database of the NoSQL store from the existing raw data in the archive store. Accompanying procedures, which cannot be outlined here, ensure that the database is always in a complete and unchangeable state.

Figure 2: Determining the database hashes at two points in time protects the blockchain from unauthorized changes.

Manipulation Excluded

Once the desired index data has been stored completely and correctly in the Indexing Service, the only question that remains to be answered is how to prevent the database from being manipulated in a targeted or inadvertent manner. The second component of the CDE model serves this purpose. The database is considered at a time t when n records have already been written to the database. Now all n data records at time t are selected, and the result is converted into a hash value. The result is the hash value of all aggregated database entries in data slice 1 at time t . This hash value is located in a block of the blockchain as a separate transaction and is therefore protected against a change.

In the next step, the procedure is repeated, whereby the run-time environment uses a random generator from a defined interval to determine exactly when time t +1 occurs. Thus, t +1 is not predeterminable by a human being. In the next data slice 2 (e.g., from t to t +1), the first new transaction n +1 and x other transactions are written, which are not part of data slice 1 at time t . All new transaction data is then selected, and the result is converted to a new hash value 2, which is also stored in the blockchain (Verification Service).

Both hash values now reflect the exact status of the data at time t or of all data newly added after time t up to time t +1. A comparison of the raw data from the database is now possible at any time against the truth of the blockchain. The Verification Service provides a secure view of the database. Such a check can also be carried out at random, one data slice at a time, in the background. If the comparison reveals a discrepancy, the database can be restored from the raw data of the Storage Service in the Indexing Service.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Data security and data governance
    Protecting data becomes increasingly important as the quantity and value of information grows. We describe the basics of data security and governance and how they intertwine.
  • SQL Server 2022 and Azure
    SQL Server 2022 focuses on even closer collaboration between on-premises SQL servers and SQL functions in Azure, including availability and data analysis. We highlight the innovations of the database server and the interaction with versatile and powerful Azure services.
  • Energy efficiency in the data center
    Storage systems are one of the biggest factors in power consumption, so data storage can make a massive difference in operating costs. We look at how you can achieve savings through technologies such as flash, tiered storage, or even cloud-native container environments.
  • MarkLogic and SGI Announce DataRaptor
  • What's new in SQL Server 2016
    The focus in SQL Server 2016 is on mobility, cloud usage, and speed, with improvements to in-memory processing and security.
comments powered by Disqus