Let the hardware do the heavy lifting
Iron Power
Ethernet has been scaling faster recently. Whereas each successive Ethernet speed bump used to take seven to 10 years, it now happens in two to three years. Furthermore, the migration to flash and persistent memory is proving to be the killer app for this increase in Ethernet speed. Today, it takes only three NVMe SSD adapters to fill up 100GB – something that was unthinkable just three years ago. In the midst of all of this, the processors have had a difficult time keeping pace.
As a result, system performance and cost can benefit by offloading some repetitive, well-defined CPU-intensive data path tasks into dedicated hardware. This trend will only exacerbate with the arrival of 200/400-gigabit Ethernet (GbE) in 2019, and 800GbE shortly after that; hence, the window for CPU offload will remain open for the foreseeable future. Hardware assistance with tasks has the additional benefit of lowering system costs tremendously (every 10% increase in CPU ~ $500 in acquisition costs) and allowing the user to trade CPU costs against other components or against scale of an installation for no net loss to performance.
Network adapters these days do much more than Ethernet processing and have become feature-rich with each passing generation. What started as a transport layer checksum, large send offload (LSO) and large receive offload (LRO) have evolved into the integration of well-known protocols, such as remote direct memory access (RDMA), Internet small computer systems interface (iSCSI), transport layer security (TLS), and so on. As the roadmap for neuronal processing units (NPUs) and network adapters converge (you can now find a number of ARM cores in most network adapters), this opportunity for hardware assist is only enhanced. By using dedicated hardware, the system's overall power consumption is further reduced by providing the opportunity to use smaller CPUs.
It is important to note that hardware does not substitute for a software implementation and that the software implementation is always required because it builds in a second source option to the hardware-assist silicon for enterprise customers, which is a requirement before any new technology is adopted. Hardware assist is hence a performance option for that segment of the market that wants extreme or lower cost performance, and it requires and must coexist with a software implementation.
Why TLS?
TLS has gained enormous importance in Internet security and is one of the most commonly used security protocol for HTTPS, web browsers, FTP, SMTP, and content delivery networks. TLS combines symmetric and asymmetric procedures, as well as mechanisms for verifying the authenticity and integrity of data streams and messages. A secure and reliable session is established between client and server over TCP (Figure 1a).
The TLS session begins with a handshake to negotiate the cipher suite and generate a master secret. The cipher suite is a combination of authentication, encryption, and message authentication code used to negotiate the security settings. Secure transmission is achieved by encryption algorithms, such as the Advanced Encryption Standard (AES). Symmetric encryption is used to encode transmitted data using a key exchanged between sender and receiver. Each key is valid only for one connection, and only the communicating users who have the key can access that data.
Data integrity is maintained by message authentication code (MAC). A hash can only be interpreted by senders and recipients who have the key, ensuring that the data arriving from a source containing the key has not been manipulated in transit. Thus, TLS deals with CPU-intensive operations of data encryption, decryption, authentication, and network layer processing.
A hardware cryptographic accelerator, trusted platform module (TPM), or network adapter with a built-in crypto engine are commonly used these days. These accelerators have the required crypto algorithms implemented in hardware and their respective drivers register with the operating system (OS). Applications calling crypto APIs use the appropriate transform object to make an entry into hardware where data is encrypted, decrypted, or authenticated, and then returned to host. The server CPU is thus relieved of compute-intensive operations.
The effect of this data loopback over the network adapter, however, is the longer data path. Figure 1b shows the data path in coprocessor mode as used by the crypto accelerator. An extra pass over the PCIe bus adds latency to packet processing and reduces the total throughput capability. Experiments in our lab and observations of TLS using a coprocessor and crypto accelerator show that, although CPU cycles are saved, application throughput is limited by packet loopback time and PCIe bandwidth. This analysis provides motivation to avoid loopback and implement the entire TLS processing in hardware. The coprocessor mode of encryption acceleration is thus more suited for traffic that is not intended to go on the wire (e.g., data-at-rest encryption, etc.).
Inline TLS
Inline TLS/SSL is a method of offloading intensive encryption processing from a system running an application on a network interface controller (NIC) [1]. This NIC is equipped with TLS/SSL cryptographic capability and the ability to encrypt and/or compute MACs inline, as well as send the payload. This capability increases the efficiency of the TLS/SSL encryption process by performing the encryption in-line with the sending of data, thereby cutting the memory bandwidth required to perform the TLS/SSL send operation by one third and significantly lowering the associated latency [2], as well as reducing the requirements to traverse the PCIe bus more than the minimum required amount. Two thirds of all Internet traffic is estimated to involve media streaming (e.g., Netflix, Amazon Prime, YouTube, etc.), and these streams are increasingly being encrypted.
Applications using TLS start by setting up a TCP connection, and the OS-provided socket APIs push the state to the NIC. The hardware-assisted TLS socket descriptor has a one-to-one association with the TLS session. Application packets pass through the inline driver unencrypted and are processed by the TLS engine. Similarly, received packets decrypted by the TLS engine are handed over to the driver. For ease of implementation, the host can use any standard TLS implementation, such as OpenSSL, to handle the TLS handshake. Handshake messages are exchanged over the connection, and symmetric keys are calculated at either end of the TLS session for crypto operations.
All data within a TLS session is framed into the well-defined TLS record protocol. Encryption is done on a per-record basis; however, several messages of the same type can be placed together in the same record. Change cipher spec (CCS) protocol messages modify the encryption settings by generating the cipher keys. Hence, a new record should begin immediately afterward, so new settings are used. The encryption settings program the cipher keys on hardware, with unique keys for the transmit and receive paths. The Finish message is a new record sent immediately after the CCS message, and the new key can be used for encryption and authentication. Once the key is programmed and the offload engine is enabled, data and future handshake records over the same session are encrypted and decrypted by the TLS engine (Figure 2).
Packet Flow
In this section, I delve into packet processing on the transmit side in more detail. An application processor initiates and terminates the TCP connection, and a TLS session is established, with the handshake handled by the application layer. The TLS implementation in the host maintains the complete protocol state machine. Sessions keys are calculated according to Diffie-Hellman (DH)/elliptic-curve DH (ECDH)/RSA and are programmed on hardware in accessible memory of the TLS engine for record encryption and decryption.
TLS implementation in the host breaks the application data or plain text into the maximum fragment size (MFS) record length, each categorized as a standard defined record type (handshake, alert, CCS, or application data) (Figure 3). After symmetric keys are programmed, inline TLS functionality is enabled; henceforth, the TLS layer bypasses all crypto operations, and plain text is sent to the TLS driver. Each record comprises a 64-bit sequence number to protect against replay attack and out-of-order packets and to provide data origin authentication and integrity protection using a hash-based MAC (HMAC) algorithm.
The sequence number is initialized at the start of the new connection and incremented for each subsequent protocol data unit (PDU). An initialization vector is associated with each record. The driver uses record length to determine the number of records it can buffer to create one segment. A segment is typically larger than the record length (the default size is 16KB) and lets multiple records be sent in one crypto request to the hardware, which saves a call made over the PCIe interface.
The crypto request, comprising the record addresses in a scatter-gather list (SGL), record length, and per-record initialization vector (IV), is wrapped in a language understood by the hardware. The TLS request also carries information about cipher-negotiated key length information and maximum fragment size. The TLS engine supports multiple cipher suits, such as AES-CBC, AES-GCM, and AES-CCM of multiple key sizes (e.g., 128, 192, and 256). See the "Choosing GCM and Performance Run" box for the reasons GCM was chosen.
Choosing GCM and Performance Run
Galois/counter mode (GCM) is a symmetric key cryptographic block cipher widely adopted for its efficiency and performance. Hence, using Advanced Encryption Standard (AES)-GCM as a transport layer security cipher is highly recommended. GCM provides both confidentiality and data origin authentication. The encryption and decryption operations are parallelized and take fewer cycles. Message authentication occurs on the cipher text and therefore is preferred over other authenticated encryption with associated data (AEAD) ciphers, such as CCM. GCM is implemented efficiently in hardware and integrated into the TLS engine.
In a few experiments with two machines connected back-to-back with a 100Gbps (100G) network adapter and OpenSSL, our lab tested application s_time
on the client and s_server
on the server, with and without inline TLS enabled and using the AES256-GCM-SHA128 cipher suit. The numbers we observed were astounding. The 100G network adapter with inline TLS delivered 98Gbps of Tx and Rx, with 20% CPU usage, whereas the same test without inline TLS delivered 70Gbps with greater than 90% CPU usage. The inline encryption and decryption on both ends of the session reduced overall packet latency and saved valuable CPU cycles, thereby increasing application throughput. These findings validated the benefits of inline TLS for general use and demonstrated significant improvements over current implementations.
The TLS engine obtains the record data from the host by direct memory access and processes the data record-by-record. The transmit key associated with the TLS session is read from the key store, and the packet is encrypted using GCM. For GCM mode, GHASH is the function used for MAC calculation and uses the sequence number as additional authenticated data for replay protection. The calculated authentication tag is appended to the payload. A 5-byte TLS header, comprising the record type, protocol version, and record length, is added to the resulting encrypted text, making the PDU ready for transmission.
The TLS engine forwards the PDU to the lower layer hardware-assisted TCP (Figure 4), which is not aware of any TLS processing, and processes encrypted data as any other protocol data. TCP provides reliable data transmission and operates in store-and-forward mode. The PDU is stored for retransmission and removed from memory when an ACK is received.
On the receive side, TCP parses the incoming data stream for a 5-byte TLS header to determine the record length and record type. One complete record is extracted and forwarded to the TLS engine. Handshake records are sent to the application processor for state machine processing, the receive key is programmed at the end of CCS exchange, and the TLS engine is enabled. The receive key associated with the incoming data stream connection is used for decryption and HMAC integrity verification. A plain text record, along with the TLS header, is copied to a user buffer and consumed by the application (Figure 5).
If receive keys are not programmed and the application data is received, the encrypted PDU is forwarded, untouched, to the application processor and decrypted by the host. The interval between when the CCS protocol and Finish messages are received is a key programming phase, and incoming packets may well be buffered until the TLS engine is enabled. MAC or decryption crypto errors are sent to the application processor, where the TLS state machine alerts and resets the connection.
Buy this article as PDF
(incl. VAT)