+ All Categories
Home > Documents > Hardware/Software Adaptive Cryptographic...

Hardware/Software Adaptive Cryptographic...

Date post: 03-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
25
Research Article Hardware/Software Adaptive Cryptographic Acceleration for Big Data Processing Chunhua Xiao , 1,2 Lei Zhang , 1 Yuhua Xie , 1 Weichen Liu, 1,2 and Duo Liu 1,2 1 Department of Computer Science, Chongqing University, Chongqing 400044, China 2 Key Laboratory of Dependable Service Computing in Cyber Physical Society of Ministry of Education, Chongqing 400044, China Correspondence should be addressed to Chunhua Xiao; [email protected] Received 9 February 2018; Revised 23 July 2018; Accepted 6 August 2018; Published 27 August 2018 Academic Editor: Jun Zhou Copyright © 2018 Chunhua Xiao et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Along with the explosive growth of network data, security is becoming increasingly important for web transactions. e SSL/TLS protocol has been widely adopted as one of the effective solutions for sensitive access. Although OpenSSL could provide a freely available implementation of the SSL/TLS protocol, the crypto functions, such as symmetric key ciphers, are extremely compute- intensive operations. ese expensive computations through soſtware implementations may not be able to compete with the increasing need for speed and secure connection. Although there are lots of excellent works with the objective of SSL/TLS hardware acceleration, they focus on the dedicated hardware design of accelerators. Hardly of them presented how to utilize them efficiently. Actually, for some application scenarios, the performance improvement may not be comparable with AES-NI, due to the induced invocation cost for hardware engines. erefore, we proposed the research to take full advantages of both accelerators and CPUs for security HTTP accesses in big data. We not only proposed optimal strategies such as data aggregation to advance the contribution with hardware crypto engines, but also presented an Adaptive Crypto System based on Accelerators (ACSA) with soſtware and hardware codesign. ACSA is able to adopt crypto mode adaptively and dynamically according to the request character and system load. rough the establishment of 40 Gbps networking on TAISHAN Web Server, we evaluated the system performance in real applications with a high workload. For the encryption algorithm 3DES, which is not supported in AES-NI, we could get about 12 times acceleration with accelerators. For typical encryption AES supported by instruction acceleration, we could get 52.39% bandwidth improvement compared with only hardware encryption and 20.07% improvement compared with AES-NI. Furthermore, the user could adjust the trade-off between CPU occupation and encryption performance through MM strategy, to free CPUs according to the working requirements. 1. Introduction As we enter the big data era, network data demonstrates an explosive growth [1, 2]. More and more transactions, such as e-commerce and net-banking, require the transfer of sensitive information via the Internet, and the security is becoming more and more important for web applications [3, 4]. e Transport Layer Security (TLS) protocol and its successor Transport Layer Security (TLS) can be used to secure applications that communicate over a network and are currently predominant protocols for sensitive accesses. e most widely deployed, freely available implementation of the SSL/TLS protocol is the OpenSSL library [5]. e core library of OpenSSL implements basic cryptographic functions and provides various utility functions. However, its crypto functions, such as symmetric key ciphers and hash algorithms, are extremely compute-intensive operations [6, 7]. OpenSSL does these expensive computations through soſtware implementations and it may not be able to compete with the increasing need for speed and secure connections for web services [8]. Among the many solutions pursued to over- come this problem, one is to apply hardware accelerators to perform the cryptographic (crypto) operations. Implemented in hardware, these crypto accelerators are tamper-proof and difficult to clone, thus providing added security bonus [9]. In literature, there are lots of works published with the objective of SSL/TLS hardware acceleration [10–16]. Some of them focus on the dedicated hardware design for accelerators. Hindawi Security and Communication Networks Volume 2018, Article ID 7631342, 24 pages https://doi.org/10.1155/2018/7631342
Transcript
Page 1: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Research ArticleHardwareSoftware Adaptive Cryptographic Acceleration forBig Data Processing

Chunhua Xiao 12 Lei Zhang 1 Yuhua Xie 1 Weichen Liu12 and Duo Liu 12

1Department of Computer Science Chongqing University Chongqing 400044 China2Key Laboratory of Dependable Service Computing in Cyber Physical Society of Ministry of Education Chongqing 400044 China

Correspondence should be addressed to Chunhua Xiao xiaochunhuacqueducn

Received 9 February 2018 Revised 23 July 2018 Accepted 6 August 2018 Published 27 August 2018

Academic Editor Jun Zhou

Copyright copy 2018 Chunhua Xiao et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Along with the explosive growth of network data security is becoming increasingly important for web transactions The SSLTLSprotocol has been widely adopted as one of the effective solutions for sensitive access Although OpenSSL could provide a freelyavailable implementation of the SSLTLS protocol the crypto functions such as symmetric key ciphers are extremely compute-intensive operations These expensive computations through software implementations may not be able to compete with theincreasing need for speed and secure connection Although there are lots of excellent works with the objective of SSLTLS hardwareacceleration they focus on the dedicated hardware design of acceleratorsHardly of them presented how to utilize them efficientlyActually for some application scenarios the performance improvement may not be comparable with AES-NI due to the inducedinvocation cost for hardware enginesTherefore we proposed the research to take full advantages of both accelerators and CPUs forsecurity HTTP accesses in big data We not only proposed optimal strategies such as data aggregation to advance the contributionwith hardware crypto engines but also presented an Adaptive Crypto System based on Accelerators (ACSA) with software andhardware codesign ACSA is able to adopt crypto mode adaptively and dynamically according to the request character and systemload Through the establishment of 40 Gbps networking on TAISHAN Web Server we evaluated the system performance inreal applications with a high workload For the encryption algorithm 3DES which is not supported in AES-NI we could getabout 12 times acceleration with accelerators For typical encryption AES supported by instruction acceleration we could get5239 bandwidth improvement compared with only hardware encryption and 2007 improvement compared with AES-NIFurthermore the user could adjust the trade-off between CPU occupation and encryption performance through MM strategy tofree CPUs according to the working requirements

1 Introduction

As we enter the big data era network data demonstratesan explosive growth [1 2] More and more transactionssuch as e-commerce and net-banking require the transferof sensitive information via the Internet and the security isbecoming more and more important for web applications[3 4] The Transport Layer Security (TLS) protocol and itssuccessor Transport Layer Security (TLS) can be used tosecure applications that communicate over a network andare currently predominant protocols for sensitive accessesThe most widely deployed freely available implementationof the SSLTLS protocol is the OpenSSL library [5] Thecore library of OpenSSL implements basic cryptographic

functions and provides various utility functions Howeverits crypto functions such as symmetric key ciphers andhash algorithms are extremely compute-intensive operations[6 7] OpenSSL does these expensive computations throughsoftware implementations and it may not be able to competewith the increasing need for speed and secure connections forweb services [8] Among the many solutions pursued to over-come this problem one is to apply hardware accelerators toperform the cryptographic (crypto) operations Implementedin hardware these crypto accelerators are tamper-proof anddifficult to clone thus providing added security bonus [9]

In literature there are lots of works published with theobjective of SSLTLS hardware acceleration [10ndash16] Some ofthem focus on the dedicated hardware design for accelerators

HindawiSecurity and Communication NetworksVolume 2018 Article ID 7631342 24 pageshttpsdoiorg10115520187631342

2 Security and Communication Networks

For example work in [10] implemented AES acceleration forembedded systems Work in [11] presented a crypto hashSHA-2 logic core with reconfigurable hardware Research in[12] accelerated elliptic curve cryptography through a verycheap FPGA In [13] the cipher functions used in the SSL-driven connection including Scalable Encryption Algorithm(SEA) Message Digest Algorithm (MD5) and Secured HashAlgorithm (SHA2) are accelerated in the VLSI Cryptosystemthrough FPGA Others mount all processes for SSLTLSciphered communication into a single FPGA or ASIC suchas works in [14ndash16] These works showed a great perfor-mance improvement compared with software encryptionwith crypto lib Nonetheless these studies concentrated onthe implementation of the hardware itself and hardly referredto how to utilize crypto accelerators efficiently with leastcost not to talk about the design methodology for takingfull advantages of both accelerators and CPUs Furthermoremost hardware accelerations for SSLTLS are specified forembedded systems [10ndash13] which could not satisfy the highvolume and high concurrent accesses requirements in bigdata age

Although a lot of work has been done for hardware accel-eration how to utilize them efficiently in the new applicationscenario of big data is still an open problem Besides withAES-NI (Advanced Encryption Standard Instruction Set orthe Intel Advanced Encryption Standard New Instructions)the software crypto computations could also be acceleratedwhich could provide comparable performance with acceler-ators for web applications for small data requirements [17]Therefore we proposed the research to take full advantagesof both accelerators and CPUs to ensure the security HTTPaccess in big data Firstly we are in need of implemen-tation of a prototyping system with multiple CPUs andencryption accelerators on the basis of which software andhardware computing overhead are analyzed in detail Thenwe proposed interrupt aggregation and data aggregation toreduce the working cost of hardware invocation Finally weput forward adaptive scheduling and MM to optimize theutilization of both software and hardware crypto enginesThemain contributions of this work are as follows

(i) To the best of our knowledge this is the first Adap-tive Crypto System based on Accelerators (ACSA)with software and hardware codesign which is ableto adopt crypto mode adaptively and dynamicallyaccording to the request character and system loadThe problems of interrupt optimization data aggre-gation and adaptive scheduling for rational resourceallocation are also carefully considered

(ii) We improved the kernel for ACSA to solve theadditional overhead caused by hardware accelerationA dynamic interrupt aggregation is added to thekernel to reduce the system cost induced by highfrequency of hardware interrupts

(iii) We proposed a resource allocation strategy withMM algorithm (maximize utilization with minimaloverhead) and adaptive scheduling to maximumoverall encryption bandwidths The designerusercan find the most reasonable process scheduling

strategy through MM for heterogeneous encryptionplatforms while adaptive scheduling is applicable toplay their respective advantages of hardwaresoftwareencryption

(iv) Through the establishment of 40Gbps networkingwe are able to evaluate the system performance inreal applications with the high workload on vari-ous benchmarks and system configurations For theencryption algorithm 3DES which is not supportedin AES-NI we could get about 12 times accelera-tion with accelerators For typical encryption AESsupported by instruction acceleration we could get5239 bandwidth improvement compared with onlyhardware encryption and 2007 improvement com-pared with AES-NI Furthermore the user couldadjust the trade-off between CPU occupation andencryption performance throughMMstrategy to freeCPUs according to the working requirements

(v) Proposed design methodology possesses universalproperties to some extent The design flow of ACSAand MM algorithm is applicable to other similardesigns with hardware acceleration for secure webaccess As long as the design is heterogeneous archi-tecture with CPUs and accelerators the proposeddesign methodology is applicable regardless of thetype of CPU and accelerator Besides this work isbased on the ARM server which can be furtheredfor energy efficiency exploration providing emergingsolutions for data center besides X86 based architec-tures

The remainder of this paper is organized as follows Section 2discusses the related works Section 3 shows the systemarchitecture and design methodology of ACSA in whichCrypto algorithm adaption and interrupt aggregation arealso introduced in this section Sections 4 and 5 discussthe methodology we proposed for adaptive scheduling andoptimal resource allocation We present the evaluation meansand experimental results in Section 6 Finally we concludeour discussions in Section 7

2 Related Work

Most effective works have been proposed for SSLTLS pro-cesses acceleration We distinguish them into three categoriesaccording to the hardware implementation

The first one focuses on the acceleration for a specificalgorithm in SSLTLS process Notable among them are[10ndash12] In work [10] Praveen Kumar B etal designed andverified the working AES crypto Verilog core which runson top of an embedded Linux distribution uClinux TheOpenSSL library has been ported and cross-compiled towork with only AES that processes data blocks of 128 bitsusing a cipher key of length 128 192 or 256 bits Work in[11] presented the implementation of a crypto hash SHA-2logic core in reconfigurable hardware and a throughput of644 Mbitssec could be achieved by this design In anotherwork [12] the researchers demonstrated a reconfigurablehardware accelerator for OpenSSLrsquos implementation of ECC

Security and Communication Networks 3

and showed how a low-cost hardware platform is sufficient todouble performance

The second kind of acceleration researches further inte-grates typical cipher algorithms in a single microchip Forexample Mohamed Khalil-Hani in work [18] integrates theAES-256 SHA-1 SHA-2 RNG and RSA-2048 cryptographichardware cores into one FPGA microchip for an embeddedsystem The work proposed by [19] developed a hardwareprototype of the cryptographic processor in FPGA tech-nology In [13] the cipher functions used in SSL-drivenconnection including Scalable Encryption Algorithm (SEA)Message Digest Algorithm (MD5) and Secure Hash Algo-rithm (SHA2) are accelerated in the VLSI Cryptosystemthrough FPGA

The third type mounts all processes for SSLTLS cipheredcommunication into one ASIC which is usually denoted asNetwork Security Processors (NSPs) NSP performs variouscryptographic operations specified by network security pro-tocols and helps to offload the computation intensive burdensfrom Network Processors (NPs) Literature [14] presents asecurity processor to accelerate cryptographic processing inmodern security applications which is capable of popularcryptographic functions such as RSA AES hashing andrandom number generation Research in [15] shows the10Gbps implementation of low-power SSLTLS acceleratoron 65 nm FPGA The usage of FPGAASIC enables highlyefficient processing and low-power consumption by usingparallel optimization and pipelined processing Work in [16]proposes a high-performance NSP Zodiac intended for bothIPsec and SSL protocols acceleration which is synthesizedwith a 018120583m CMOS technology

There is no doubt that these works make effective effortsfor the acceleration in security HTTP accesses Howeverthey concentrated more on the hardware implementationitself and hardly referred to how to utilize crypto acceleratorsefficiently with the least cost To the best of our knowledgewe did not find any clear illustration for the performancecomparison between hardware accelerators and AES-NI notto talk about the design methodology for taking full advan-tages of both accelerators and CPUs Furthermore mostworks especially types 1 and 2 mentioned above focus onthe embedded applications which could not satisfy the highvolume and high concurrent accesses requirements in bigdata age Although type 3 could achieve higher throughputwith Gigabit per second through application specific designit is lack in generality and flexibility compared with othersolutions Moreover it takes extensive design efforts for ASICor prototype board

Therefore we surveyed and analyzed different workingflows with SSLTLS firstly and make the working mode andperformance differences clear between hardware acceleratorsand AES-NI Based on the exploration of existing workswe proposed Adaptive Crypto System based on Accelerators(ACSA) with software and hardware codesign which is ableto adopt crypto mode adaptively and dynamically accordingto the request character and system load The problemsof interrupt optimization data aggregation and adaptivescheduling for rational resource allocation are also carefullyconsidered Through the evaluation with the high workload

on various benchmarks and system configurations we couldget overall performance improvement compared with onlyAES-NI or hardware accelerator solutions Furthermore theproposed design methodology is applicable to other similardesigns to perform a high resource efficiency system forsecure HTTP access

3 Adaptive Crypto System with Accelerators

In the big data age Web Server applications exhibit aremarkable characteristic of high concurrency andmass dataTo accelerate the encryptiondecryption process for theseapplications we proposed Adaptive Crypto System based onAccelerators for secureHTTPS access For clarity we denotedthis system as ACSA ACSA integrates multiple processingcores and crypto acceleratorsThe processing cores are ARMsin this research The reason for using the ARM core is forfurther exploration such as energy efficiency besides X86architecture [20ndash22] Although the server architecture isdifferent the proposed design methodology is practical toother similar acceleration systems

Different from traditional cryptosystems the encryptionprocess in ACSA is adjustable dynamically between hardwareengines and CPUs An Adaptive Scheduler is responsible forworking mode adoption This scheduler chooses reasonableencryptionway according to the request character and systemresources The data request aggregation and the MM strategyare considered carefully in our design to make full utilizationof both software and hardware computing engines In orderto ensure the versatility the Adaptive Scheduler also allowsthe user to choose the working mode flexibly

We also considered the quality of service We imple-mented a fault detection module to guarantee the reliabilityof hardware cryptoThis detection module works if hardwarecrypto computing is adopted If the hardware encryptionengine breaks down which could not provide the cryptofunction properly the hardware detection module will send afault signal to the mode-switching module At this point theencryption tasks in progress will be interrupted and workingmode changes seamlessly from the hardware encryptionto the software computing These interrupted jobs will bereexecuted through the software encryption units therebyensuring the clients can obtain the requested data correctly

To make full utilization of hardware accelerators ACSAworks at half-synchalf-async mode [23] As shown inFigure 1 the whole system divides into a synchronous work-ing level (the upper part) and an asynchronous level (thelower part) The two parts exchange information througha synchronousasynchronous communication layer In theupper part Web Server accepts requirements from differentclients generates multiple crypto tasks simultaneously andthen sends these tasks to the synchronousasynchronouscommunication layer Through this synchronous processACSA calls SSLTLS library to establish secure communica-tions In the lower part the hardware driver receives cryptotasks from interaction layer and sends them to the hardwareengines for encryption If the crypto computing finishedhardware engine notifies the driver through an interrupt andthen sends back cryptograph to syncasync layer via callbackfunctions so as to awaken the waiting process

4 Security and Communication Networks

Web Server

SynchronousAsynchronous Communication Layer

CryptoProcess

CryptoProcess

CryptoProcess

Synchronous Level

HAC Driver

Hardware Crypto Engines

Asynchronous Level

interrupt

generate multiplecrypto tasks

send tasksand sleep

send tasks

send tasks

callback functions

awake thewaiting process

middot middot middot middot middot middot

Figure 1 Logical architecture of ACSA

HAC0 HAC1 HAC9

Register

processingqueues

Controller

CPU0 CPU1 CPU15CPU2

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot middot middot middot

PQ0 PQ1 PQ2 PQ15

Figure 2 System architecture of ACSA

31 System Architecture of ACSA We choose a lightweightWeb Server TAISHAN as our hardware platformWe enabled16ARMs and 10HACs (HardwareAccelerationComponents)for our research ARMs are Cortex series CPUs HACs areconfigured as security accelerators in our ACSA whichsupport typical encryption algorithms such as AES-CBC[24] AES-GCM [25] 3DES [26] and SHA1 [26 27] 16processing queues (PQ) are responsible for software andhardware interaction and each queue is possible to receivemultiple requests issued by the engine driver The software(hardware driver) pushes tasks to crypto engines throughPQ to implement required acceleration tasks Each PQcan be independently enabled and its starting address andlength support individual configuration through registers

The designeruser is able to configure the correspondencebetween the PQ and CPU flexibly As shown in Figure 2 CPUutilizes a write pointer to inform the pending tasks in theprocessing queue The controller gets an entry from process-ing queue according to the results of WRR (Weighted RoundRobin) arbitration then resolves tasks and invokes HACsfor execution Once the computing finished an interrupt willgenerate to inform the upper layer for results return

32 Workflow of ACSA As shown in Figure 3 the workingflow of ACSA divides into 5 logic levels application layerfor HTTPS access OpenSSL layer for SSLTLS processingCryptoDev layer HAC driver and hardware layer withaccelerators among which the application layer and the

Security and Communication Networks 5

Clients

WebServer

Https ResponsesHttps Requests

CryptoDev

HAC Driver

Hardware Crypto Engines

Handshake API

Application layer

OpenSSL layer

CryptoDev layer

Hardware layer

User Space

Kernel Space

SSLTLS lib

Crypto API

Adaptive Scheduler Fault detection

Construct request

Crypto Engine

Softwarecryptoengine

fault signal

middot middot middot middot middot middot

Figure 3 Working flow of ACSA

OpenSSL layer work in the user space while the CryptoDevlayer and the HAC driver work in the kernel space

For the application layer we adopted the typical WebServer Nginx for HTTPS service Web Server monitorsclient requirements responds to clientsrsquo requests and isresponsible for load balance with multithread During thisprocessWeb Server authenticates clientsrsquo identity and checksfor the security through the crypto function provided byOpenSSL [28] If HTTPS connection is enabled Web Serverwill transfer required data to the lower logic layer for furtherprocessing

OpenSSL subsystem responds the crypto requirementsfrom Web Server and interacts with CryptoDev for HACsinvocation The roles of OpenSSL layer in ACSA include thefollowing

(1) Utilize provided toolkit for secure connection withthe SSLTLS protocols

(2) Integrate the Adaptive Scheduler for encryptionmode switching which enables the reasonable taskscheduling between software computing and hard-ware encryption

(3) Extend the feature of resource allocation strategyMM

(4) If software computing is adopted invoke the cryptog-raphy library to perform encryption requirements

(5) Interact with the lower layer CryptoDev If therequired data need to do hardware encryption

encapsulate the client requirements as EVP and thentransmit EVP blocks to CryptoDev through Cryp-toAPI After hardware becomes complete return thegenerated cryptograph back to the application layer

CryptoDev subsystem exists as a module in Linux kernelUpwards CryptoDev interacts with OpenSSL layer receivesdelivered require data and returns completed cryptographDownwards CryptoDev transforms received EVPs as thedata structure that can be identified by the HAC driverCryptoDev acted as the middle layer between synchronousand asynchronous communications

Memory copy consumes lots of system resource in hard-ware acceleration systems especially mass data interactionoccurred between hardware and software To solve thisproblem we used an efficient transmission scheme ldquozero-copyrdquo in CryptoDev layer We extended CryptoDev subsys-tem to support both AEAD (Authenticated Encryption withAssociated Data) and non-AEAD encryption requirementsFor all requirements original data and encrypted results areorganized as scatterlist The scatter virtual addresses of DMAbuffer are arranged in a list in our design In this way datatransmission between memory and HACs could be achievedthrough one DMA operation

HAC driver acts as a loadable module in Linux ker-nel and plays an important role in accelerator invocationThe driver performed four major works initialization ofhardware engines device loading and unloading encryptionalgorithms registration and interrupt processing The driver

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 2: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

2 Security and Communication Networks

For example work in [10] implemented AES acceleration forembedded systems Work in [11] presented a crypto hashSHA-2 logic core with reconfigurable hardware Research in[12] accelerated elliptic curve cryptography through a verycheap FPGA In [13] the cipher functions used in the SSL-driven connection including Scalable Encryption Algorithm(SEA) Message Digest Algorithm (MD5) and Secured HashAlgorithm (SHA2) are accelerated in the VLSI Cryptosystemthrough FPGA Others mount all processes for SSLTLSciphered communication into a single FPGA or ASIC suchas works in [14ndash16] These works showed a great perfor-mance improvement compared with software encryptionwith crypto lib Nonetheless these studies concentrated onthe implementation of the hardware itself and hardly referredto how to utilize crypto accelerators efficiently with leastcost not to talk about the design methodology for takingfull advantages of both accelerators and CPUs Furthermoremost hardware accelerations for SSLTLS are specified forembedded systems [10ndash13] which could not satisfy the highvolume and high concurrent accesses requirements in bigdata age

Although a lot of work has been done for hardware accel-eration how to utilize them efficiently in the new applicationscenario of big data is still an open problem Besides withAES-NI (Advanced Encryption Standard Instruction Set orthe Intel Advanced Encryption Standard New Instructions)the software crypto computations could also be acceleratedwhich could provide comparable performance with acceler-ators for web applications for small data requirements [17]Therefore we proposed the research to take full advantagesof both accelerators and CPUs to ensure the security HTTPaccess in big data Firstly we are in need of implemen-tation of a prototyping system with multiple CPUs andencryption accelerators on the basis of which software andhardware computing overhead are analyzed in detail Thenwe proposed interrupt aggregation and data aggregation toreduce the working cost of hardware invocation Finally weput forward adaptive scheduling and MM to optimize theutilization of both software and hardware crypto enginesThemain contributions of this work are as follows

(i) To the best of our knowledge this is the first Adap-tive Crypto System based on Accelerators (ACSA)with software and hardware codesign which is ableto adopt crypto mode adaptively and dynamicallyaccording to the request character and system loadThe problems of interrupt optimization data aggre-gation and adaptive scheduling for rational resourceallocation are also carefully considered

(ii) We improved the kernel for ACSA to solve theadditional overhead caused by hardware accelerationA dynamic interrupt aggregation is added to thekernel to reduce the system cost induced by highfrequency of hardware interrupts

(iii) We proposed a resource allocation strategy withMM algorithm (maximize utilization with minimaloverhead) and adaptive scheduling to maximumoverall encryption bandwidths The designerusercan find the most reasonable process scheduling

strategy through MM for heterogeneous encryptionplatforms while adaptive scheduling is applicable toplay their respective advantages of hardwaresoftwareencryption

(iv) Through the establishment of 40Gbps networkingwe are able to evaluate the system performance inreal applications with the high workload on vari-ous benchmarks and system configurations For theencryption algorithm 3DES which is not supportedin AES-NI we could get about 12 times accelera-tion with accelerators For typical encryption AESsupported by instruction acceleration we could get5239 bandwidth improvement compared with onlyhardware encryption and 2007 improvement com-pared with AES-NI Furthermore the user couldadjust the trade-off between CPU occupation andencryption performance throughMMstrategy to freeCPUs according to the working requirements

(v) Proposed design methodology possesses universalproperties to some extent The design flow of ACSAand MM algorithm is applicable to other similardesigns with hardware acceleration for secure webaccess As long as the design is heterogeneous archi-tecture with CPUs and accelerators the proposeddesign methodology is applicable regardless of thetype of CPU and accelerator Besides this work isbased on the ARM server which can be furtheredfor energy efficiency exploration providing emergingsolutions for data center besides X86 based architec-tures

The remainder of this paper is organized as follows Section 2discusses the related works Section 3 shows the systemarchitecture and design methodology of ACSA in whichCrypto algorithm adaption and interrupt aggregation arealso introduced in this section Sections 4 and 5 discussthe methodology we proposed for adaptive scheduling andoptimal resource allocation We present the evaluation meansand experimental results in Section 6 Finally we concludeour discussions in Section 7

2 Related Work

Most effective works have been proposed for SSLTLS pro-cesses acceleration We distinguish them into three categoriesaccording to the hardware implementation

The first one focuses on the acceleration for a specificalgorithm in SSLTLS process Notable among them are[10ndash12] In work [10] Praveen Kumar B etal designed andverified the working AES crypto Verilog core which runson top of an embedded Linux distribution uClinux TheOpenSSL library has been ported and cross-compiled towork with only AES that processes data blocks of 128 bitsusing a cipher key of length 128 192 or 256 bits Work in[11] presented the implementation of a crypto hash SHA-2logic core in reconfigurable hardware and a throughput of644 Mbitssec could be achieved by this design In anotherwork [12] the researchers demonstrated a reconfigurablehardware accelerator for OpenSSLrsquos implementation of ECC

Security and Communication Networks 3

and showed how a low-cost hardware platform is sufficient todouble performance

The second kind of acceleration researches further inte-grates typical cipher algorithms in a single microchip Forexample Mohamed Khalil-Hani in work [18] integrates theAES-256 SHA-1 SHA-2 RNG and RSA-2048 cryptographichardware cores into one FPGA microchip for an embeddedsystem The work proposed by [19] developed a hardwareprototype of the cryptographic processor in FPGA tech-nology In [13] the cipher functions used in SSL-drivenconnection including Scalable Encryption Algorithm (SEA)Message Digest Algorithm (MD5) and Secure Hash Algo-rithm (SHA2) are accelerated in the VLSI Cryptosystemthrough FPGA

The third type mounts all processes for SSLTLS cipheredcommunication into one ASIC which is usually denoted asNetwork Security Processors (NSPs) NSP performs variouscryptographic operations specified by network security pro-tocols and helps to offload the computation intensive burdensfrom Network Processors (NPs) Literature [14] presents asecurity processor to accelerate cryptographic processing inmodern security applications which is capable of popularcryptographic functions such as RSA AES hashing andrandom number generation Research in [15] shows the10Gbps implementation of low-power SSLTLS acceleratoron 65 nm FPGA The usage of FPGAASIC enables highlyefficient processing and low-power consumption by usingparallel optimization and pipelined processing Work in [16]proposes a high-performance NSP Zodiac intended for bothIPsec and SSL protocols acceleration which is synthesizedwith a 018120583m CMOS technology

There is no doubt that these works make effective effortsfor the acceleration in security HTTP accesses Howeverthey concentrated more on the hardware implementationitself and hardly referred to how to utilize crypto acceleratorsefficiently with the least cost To the best of our knowledgewe did not find any clear illustration for the performancecomparison between hardware accelerators and AES-NI notto talk about the design methodology for taking full advan-tages of both accelerators and CPUs Furthermore mostworks especially types 1 and 2 mentioned above focus onthe embedded applications which could not satisfy the highvolume and high concurrent accesses requirements in bigdata age Although type 3 could achieve higher throughputwith Gigabit per second through application specific designit is lack in generality and flexibility compared with othersolutions Moreover it takes extensive design efforts for ASICor prototype board

Therefore we surveyed and analyzed different workingflows with SSLTLS firstly and make the working mode andperformance differences clear between hardware acceleratorsand AES-NI Based on the exploration of existing workswe proposed Adaptive Crypto System based on Accelerators(ACSA) with software and hardware codesign which is ableto adopt crypto mode adaptively and dynamically accordingto the request character and system load The problemsof interrupt optimization data aggregation and adaptivescheduling for rational resource allocation are also carefullyconsidered Through the evaluation with the high workload

on various benchmarks and system configurations we couldget overall performance improvement compared with onlyAES-NI or hardware accelerator solutions Furthermore theproposed design methodology is applicable to other similardesigns to perform a high resource efficiency system forsecure HTTP access

3 Adaptive Crypto System with Accelerators

In the big data age Web Server applications exhibit aremarkable characteristic of high concurrency andmass dataTo accelerate the encryptiondecryption process for theseapplications we proposed Adaptive Crypto System based onAccelerators for secureHTTPS access For clarity we denotedthis system as ACSA ACSA integrates multiple processingcores and crypto acceleratorsThe processing cores are ARMsin this research The reason for using the ARM core is forfurther exploration such as energy efficiency besides X86architecture [20ndash22] Although the server architecture isdifferent the proposed design methodology is practical toother similar acceleration systems

Different from traditional cryptosystems the encryptionprocess in ACSA is adjustable dynamically between hardwareengines and CPUs An Adaptive Scheduler is responsible forworking mode adoption This scheduler chooses reasonableencryptionway according to the request character and systemresources The data request aggregation and the MM strategyare considered carefully in our design to make full utilizationof both software and hardware computing engines In orderto ensure the versatility the Adaptive Scheduler also allowsthe user to choose the working mode flexibly

We also considered the quality of service We imple-mented a fault detection module to guarantee the reliabilityof hardware cryptoThis detection module works if hardwarecrypto computing is adopted If the hardware encryptionengine breaks down which could not provide the cryptofunction properly the hardware detection module will send afault signal to the mode-switching module At this point theencryption tasks in progress will be interrupted and workingmode changes seamlessly from the hardware encryptionto the software computing These interrupted jobs will bereexecuted through the software encryption units therebyensuring the clients can obtain the requested data correctly

To make full utilization of hardware accelerators ACSAworks at half-synchalf-async mode [23] As shown inFigure 1 the whole system divides into a synchronous work-ing level (the upper part) and an asynchronous level (thelower part) The two parts exchange information througha synchronousasynchronous communication layer In theupper part Web Server accepts requirements from differentclients generates multiple crypto tasks simultaneously andthen sends these tasks to the synchronousasynchronouscommunication layer Through this synchronous processACSA calls SSLTLS library to establish secure communica-tions In the lower part the hardware driver receives cryptotasks from interaction layer and sends them to the hardwareengines for encryption If the crypto computing finishedhardware engine notifies the driver through an interrupt andthen sends back cryptograph to syncasync layer via callbackfunctions so as to awaken the waiting process

4 Security and Communication Networks

Web Server

SynchronousAsynchronous Communication Layer

CryptoProcess

CryptoProcess

CryptoProcess

Synchronous Level

HAC Driver

Hardware Crypto Engines

Asynchronous Level

interrupt

generate multiplecrypto tasks

send tasksand sleep

send tasks

send tasks

callback functions

awake thewaiting process

middot middot middot middot middot middot

Figure 1 Logical architecture of ACSA

HAC0 HAC1 HAC9

Register

processingqueues

Controller

CPU0 CPU1 CPU15CPU2

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot middot middot middot

PQ0 PQ1 PQ2 PQ15

Figure 2 System architecture of ACSA

31 System Architecture of ACSA We choose a lightweightWeb Server TAISHAN as our hardware platformWe enabled16ARMs and 10HACs (HardwareAccelerationComponents)for our research ARMs are Cortex series CPUs HACs areconfigured as security accelerators in our ACSA whichsupport typical encryption algorithms such as AES-CBC[24] AES-GCM [25] 3DES [26] and SHA1 [26 27] 16processing queues (PQ) are responsible for software andhardware interaction and each queue is possible to receivemultiple requests issued by the engine driver The software(hardware driver) pushes tasks to crypto engines throughPQ to implement required acceleration tasks Each PQcan be independently enabled and its starting address andlength support individual configuration through registers

The designeruser is able to configure the correspondencebetween the PQ and CPU flexibly As shown in Figure 2 CPUutilizes a write pointer to inform the pending tasks in theprocessing queue The controller gets an entry from process-ing queue according to the results of WRR (Weighted RoundRobin) arbitration then resolves tasks and invokes HACsfor execution Once the computing finished an interrupt willgenerate to inform the upper layer for results return

32 Workflow of ACSA As shown in Figure 3 the workingflow of ACSA divides into 5 logic levels application layerfor HTTPS access OpenSSL layer for SSLTLS processingCryptoDev layer HAC driver and hardware layer withaccelerators among which the application layer and the

Security and Communication Networks 5

Clients

WebServer

Https ResponsesHttps Requests

CryptoDev

HAC Driver

Hardware Crypto Engines

Handshake API

Application layer

OpenSSL layer

CryptoDev layer

Hardware layer

User Space

Kernel Space

SSLTLS lib

Crypto API

Adaptive Scheduler Fault detection

Construct request

Crypto Engine

Softwarecryptoengine

fault signal

middot middot middot middot middot middot

Figure 3 Working flow of ACSA

OpenSSL layer work in the user space while the CryptoDevlayer and the HAC driver work in the kernel space

For the application layer we adopted the typical WebServer Nginx for HTTPS service Web Server monitorsclient requirements responds to clientsrsquo requests and isresponsible for load balance with multithread During thisprocessWeb Server authenticates clientsrsquo identity and checksfor the security through the crypto function provided byOpenSSL [28] If HTTPS connection is enabled Web Serverwill transfer required data to the lower logic layer for furtherprocessing

OpenSSL subsystem responds the crypto requirementsfrom Web Server and interacts with CryptoDev for HACsinvocation The roles of OpenSSL layer in ACSA include thefollowing

(1) Utilize provided toolkit for secure connection withthe SSLTLS protocols

(2) Integrate the Adaptive Scheduler for encryptionmode switching which enables the reasonable taskscheduling between software computing and hard-ware encryption

(3) Extend the feature of resource allocation strategyMM

(4) If software computing is adopted invoke the cryptog-raphy library to perform encryption requirements

(5) Interact with the lower layer CryptoDev If therequired data need to do hardware encryption

encapsulate the client requirements as EVP and thentransmit EVP blocks to CryptoDev through Cryp-toAPI After hardware becomes complete return thegenerated cryptograph back to the application layer

CryptoDev subsystem exists as a module in Linux kernelUpwards CryptoDev interacts with OpenSSL layer receivesdelivered require data and returns completed cryptographDownwards CryptoDev transforms received EVPs as thedata structure that can be identified by the HAC driverCryptoDev acted as the middle layer between synchronousand asynchronous communications

Memory copy consumes lots of system resource in hard-ware acceleration systems especially mass data interactionoccurred between hardware and software To solve thisproblem we used an efficient transmission scheme ldquozero-copyrdquo in CryptoDev layer We extended CryptoDev subsys-tem to support both AEAD (Authenticated Encryption withAssociated Data) and non-AEAD encryption requirementsFor all requirements original data and encrypted results areorganized as scatterlist The scatter virtual addresses of DMAbuffer are arranged in a list in our design In this way datatransmission between memory and HACs could be achievedthrough one DMA operation

HAC driver acts as a loadable module in Linux ker-nel and plays an important role in accelerator invocationThe driver performed four major works initialization ofhardware engines device loading and unloading encryptionalgorithms registration and interrupt processing The driver

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 3: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 3

and showed how a low-cost hardware platform is sufficient todouble performance

The second kind of acceleration researches further inte-grates typical cipher algorithms in a single microchip Forexample Mohamed Khalil-Hani in work [18] integrates theAES-256 SHA-1 SHA-2 RNG and RSA-2048 cryptographichardware cores into one FPGA microchip for an embeddedsystem The work proposed by [19] developed a hardwareprototype of the cryptographic processor in FPGA tech-nology In [13] the cipher functions used in SSL-drivenconnection including Scalable Encryption Algorithm (SEA)Message Digest Algorithm (MD5) and Secure Hash Algo-rithm (SHA2) are accelerated in the VLSI Cryptosystemthrough FPGA

The third type mounts all processes for SSLTLS cipheredcommunication into one ASIC which is usually denoted asNetwork Security Processors (NSPs) NSP performs variouscryptographic operations specified by network security pro-tocols and helps to offload the computation intensive burdensfrom Network Processors (NPs) Literature [14] presents asecurity processor to accelerate cryptographic processing inmodern security applications which is capable of popularcryptographic functions such as RSA AES hashing andrandom number generation Research in [15] shows the10Gbps implementation of low-power SSLTLS acceleratoron 65 nm FPGA The usage of FPGAASIC enables highlyefficient processing and low-power consumption by usingparallel optimization and pipelined processing Work in [16]proposes a high-performance NSP Zodiac intended for bothIPsec and SSL protocols acceleration which is synthesizedwith a 018120583m CMOS technology

There is no doubt that these works make effective effortsfor the acceleration in security HTTP accesses Howeverthey concentrated more on the hardware implementationitself and hardly referred to how to utilize crypto acceleratorsefficiently with the least cost To the best of our knowledgewe did not find any clear illustration for the performancecomparison between hardware accelerators and AES-NI notto talk about the design methodology for taking full advan-tages of both accelerators and CPUs Furthermore mostworks especially types 1 and 2 mentioned above focus onthe embedded applications which could not satisfy the highvolume and high concurrent accesses requirements in bigdata age Although type 3 could achieve higher throughputwith Gigabit per second through application specific designit is lack in generality and flexibility compared with othersolutions Moreover it takes extensive design efforts for ASICor prototype board

Therefore we surveyed and analyzed different workingflows with SSLTLS firstly and make the working mode andperformance differences clear between hardware acceleratorsand AES-NI Based on the exploration of existing workswe proposed Adaptive Crypto System based on Accelerators(ACSA) with software and hardware codesign which is ableto adopt crypto mode adaptively and dynamically accordingto the request character and system load The problemsof interrupt optimization data aggregation and adaptivescheduling for rational resource allocation are also carefullyconsidered Through the evaluation with the high workload

on various benchmarks and system configurations we couldget overall performance improvement compared with onlyAES-NI or hardware accelerator solutions Furthermore theproposed design methodology is applicable to other similardesigns to perform a high resource efficiency system forsecure HTTP access

3 Adaptive Crypto System with Accelerators

In the big data age Web Server applications exhibit aremarkable characteristic of high concurrency andmass dataTo accelerate the encryptiondecryption process for theseapplications we proposed Adaptive Crypto System based onAccelerators for secureHTTPS access For clarity we denotedthis system as ACSA ACSA integrates multiple processingcores and crypto acceleratorsThe processing cores are ARMsin this research The reason for using the ARM core is forfurther exploration such as energy efficiency besides X86architecture [20ndash22] Although the server architecture isdifferent the proposed design methodology is practical toother similar acceleration systems

Different from traditional cryptosystems the encryptionprocess in ACSA is adjustable dynamically between hardwareengines and CPUs An Adaptive Scheduler is responsible forworking mode adoption This scheduler chooses reasonableencryptionway according to the request character and systemresources The data request aggregation and the MM strategyare considered carefully in our design to make full utilizationof both software and hardware computing engines In orderto ensure the versatility the Adaptive Scheduler also allowsthe user to choose the working mode flexibly

We also considered the quality of service We imple-mented a fault detection module to guarantee the reliabilityof hardware cryptoThis detection module works if hardwarecrypto computing is adopted If the hardware encryptionengine breaks down which could not provide the cryptofunction properly the hardware detection module will send afault signal to the mode-switching module At this point theencryption tasks in progress will be interrupted and workingmode changes seamlessly from the hardware encryptionto the software computing These interrupted jobs will bereexecuted through the software encryption units therebyensuring the clients can obtain the requested data correctly

To make full utilization of hardware accelerators ACSAworks at half-synchalf-async mode [23] As shown inFigure 1 the whole system divides into a synchronous work-ing level (the upper part) and an asynchronous level (thelower part) The two parts exchange information througha synchronousasynchronous communication layer In theupper part Web Server accepts requirements from differentclients generates multiple crypto tasks simultaneously andthen sends these tasks to the synchronousasynchronouscommunication layer Through this synchronous processACSA calls SSLTLS library to establish secure communica-tions In the lower part the hardware driver receives cryptotasks from interaction layer and sends them to the hardwareengines for encryption If the crypto computing finishedhardware engine notifies the driver through an interrupt andthen sends back cryptograph to syncasync layer via callbackfunctions so as to awaken the waiting process

4 Security and Communication Networks

Web Server

SynchronousAsynchronous Communication Layer

CryptoProcess

CryptoProcess

CryptoProcess

Synchronous Level

HAC Driver

Hardware Crypto Engines

Asynchronous Level

interrupt

generate multiplecrypto tasks

send tasksand sleep

send tasks

send tasks

callback functions

awake thewaiting process

middot middot middot middot middot middot

Figure 1 Logical architecture of ACSA

HAC0 HAC1 HAC9

Register

processingqueues

Controller

CPU0 CPU1 CPU15CPU2

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot middot middot middot

PQ0 PQ1 PQ2 PQ15

Figure 2 System architecture of ACSA

31 System Architecture of ACSA We choose a lightweightWeb Server TAISHAN as our hardware platformWe enabled16ARMs and 10HACs (HardwareAccelerationComponents)for our research ARMs are Cortex series CPUs HACs areconfigured as security accelerators in our ACSA whichsupport typical encryption algorithms such as AES-CBC[24] AES-GCM [25] 3DES [26] and SHA1 [26 27] 16processing queues (PQ) are responsible for software andhardware interaction and each queue is possible to receivemultiple requests issued by the engine driver The software(hardware driver) pushes tasks to crypto engines throughPQ to implement required acceleration tasks Each PQcan be independently enabled and its starting address andlength support individual configuration through registers

The designeruser is able to configure the correspondencebetween the PQ and CPU flexibly As shown in Figure 2 CPUutilizes a write pointer to inform the pending tasks in theprocessing queue The controller gets an entry from process-ing queue according to the results of WRR (Weighted RoundRobin) arbitration then resolves tasks and invokes HACsfor execution Once the computing finished an interrupt willgenerate to inform the upper layer for results return

32 Workflow of ACSA As shown in Figure 3 the workingflow of ACSA divides into 5 logic levels application layerfor HTTPS access OpenSSL layer for SSLTLS processingCryptoDev layer HAC driver and hardware layer withaccelerators among which the application layer and the

Security and Communication Networks 5

Clients

WebServer

Https ResponsesHttps Requests

CryptoDev

HAC Driver

Hardware Crypto Engines

Handshake API

Application layer

OpenSSL layer

CryptoDev layer

Hardware layer

User Space

Kernel Space

SSLTLS lib

Crypto API

Adaptive Scheduler Fault detection

Construct request

Crypto Engine

Softwarecryptoengine

fault signal

middot middot middot middot middot middot

Figure 3 Working flow of ACSA

OpenSSL layer work in the user space while the CryptoDevlayer and the HAC driver work in the kernel space

For the application layer we adopted the typical WebServer Nginx for HTTPS service Web Server monitorsclient requirements responds to clientsrsquo requests and isresponsible for load balance with multithread During thisprocessWeb Server authenticates clientsrsquo identity and checksfor the security through the crypto function provided byOpenSSL [28] If HTTPS connection is enabled Web Serverwill transfer required data to the lower logic layer for furtherprocessing

OpenSSL subsystem responds the crypto requirementsfrom Web Server and interacts with CryptoDev for HACsinvocation The roles of OpenSSL layer in ACSA include thefollowing

(1) Utilize provided toolkit for secure connection withthe SSLTLS protocols

(2) Integrate the Adaptive Scheduler for encryptionmode switching which enables the reasonable taskscheduling between software computing and hard-ware encryption

(3) Extend the feature of resource allocation strategyMM

(4) If software computing is adopted invoke the cryptog-raphy library to perform encryption requirements

(5) Interact with the lower layer CryptoDev If therequired data need to do hardware encryption

encapsulate the client requirements as EVP and thentransmit EVP blocks to CryptoDev through Cryp-toAPI After hardware becomes complete return thegenerated cryptograph back to the application layer

CryptoDev subsystem exists as a module in Linux kernelUpwards CryptoDev interacts with OpenSSL layer receivesdelivered require data and returns completed cryptographDownwards CryptoDev transforms received EVPs as thedata structure that can be identified by the HAC driverCryptoDev acted as the middle layer between synchronousand asynchronous communications

Memory copy consumes lots of system resource in hard-ware acceleration systems especially mass data interactionoccurred between hardware and software To solve thisproblem we used an efficient transmission scheme ldquozero-copyrdquo in CryptoDev layer We extended CryptoDev subsys-tem to support both AEAD (Authenticated Encryption withAssociated Data) and non-AEAD encryption requirementsFor all requirements original data and encrypted results areorganized as scatterlist The scatter virtual addresses of DMAbuffer are arranged in a list in our design In this way datatransmission between memory and HACs could be achievedthrough one DMA operation

HAC driver acts as a loadable module in Linux ker-nel and plays an important role in accelerator invocationThe driver performed four major works initialization ofhardware engines device loading and unloading encryptionalgorithms registration and interrupt processing The driver

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 4: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

4 Security and Communication Networks

Web Server

SynchronousAsynchronous Communication Layer

CryptoProcess

CryptoProcess

CryptoProcess

Synchronous Level

HAC Driver

Hardware Crypto Engines

Asynchronous Level

interrupt

generate multiplecrypto tasks

send tasksand sleep

send tasks

send tasks

callback functions

awake thewaiting process

middot middot middot middot middot middot

Figure 1 Logical architecture of ACSA

HAC0 HAC1 HAC9

Register

processingqueues

Controller

CPU0 CPU1 CPU15CPU2

middot middot middot

middot middot middot

middot middot middotmiddot middot middotmiddot middot middotmiddot middot middot

middot middot middot middot middot middot

PQ0 PQ1 PQ2 PQ15

Figure 2 System architecture of ACSA

31 System Architecture of ACSA We choose a lightweightWeb Server TAISHAN as our hardware platformWe enabled16ARMs and 10HACs (HardwareAccelerationComponents)for our research ARMs are Cortex series CPUs HACs areconfigured as security accelerators in our ACSA whichsupport typical encryption algorithms such as AES-CBC[24] AES-GCM [25] 3DES [26] and SHA1 [26 27] 16processing queues (PQ) are responsible for software andhardware interaction and each queue is possible to receivemultiple requests issued by the engine driver The software(hardware driver) pushes tasks to crypto engines throughPQ to implement required acceleration tasks Each PQcan be independently enabled and its starting address andlength support individual configuration through registers

The designeruser is able to configure the correspondencebetween the PQ and CPU flexibly As shown in Figure 2 CPUutilizes a write pointer to inform the pending tasks in theprocessing queue The controller gets an entry from process-ing queue according to the results of WRR (Weighted RoundRobin) arbitration then resolves tasks and invokes HACsfor execution Once the computing finished an interrupt willgenerate to inform the upper layer for results return

32 Workflow of ACSA As shown in Figure 3 the workingflow of ACSA divides into 5 logic levels application layerfor HTTPS access OpenSSL layer for SSLTLS processingCryptoDev layer HAC driver and hardware layer withaccelerators among which the application layer and the

Security and Communication Networks 5

Clients

WebServer

Https ResponsesHttps Requests

CryptoDev

HAC Driver

Hardware Crypto Engines

Handshake API

Application layer

OpenSSL layer

CryptoDev layer

Hardware layer

User Space

Kernel Space

SSLTLS lib

Crypto API

Adaptive Scheduler Fault detection

Construct request

Crypto Engine

Softwarecryptoengine

fault signal

middot middot middot middot middot middot

Figure 3 Working flow of ACSA

OpenSSL layer work in the user space while the CryptoDevlayer and the HAC driver work in the kernel space

For the application layer we adopted the typical WebServer Nginx for HTTPS service Web Server monitorsclient requirements responds to clientsrsquo requests and isresponsible for load balance with multithread During thisprocessWeb Server authenticates clientsrsquo identity and checksfor the security through the crypto function provided byOpenSSL [28] If HTTPS connection is enabled Web Serverwill transfer required data to the lower logic layer for furtherprocessing

OpenSSL subsystem responds the crypto requirementsfrom Web Server and interacts with CryptoDev for HACsinvocation The roles of OpenSSL layer in ACSA include thefollowing

(1) Utilize provided toolkit for secure connection withthe SSLTLS protocols

(2) Integrate the Adaptive Scheduler for encryptionmode switching which enables the reasonable taskscheduling between software computing and hard-ware encryption

(3) Extend the feature of resource allocation strategyMM

(4) If software computing is adopted invoke the cryptog-raphy library to perform encryption requirements

(5) Interact with the lower layer CryptoDev If therequired data need to do hardware encryption

encapsulate the client requirements as EVP and thentransmit EVP blocks to CryptoDev through Cryp-toAPI After hardware becomes complete return thegenerated cryptograph back to the application layer

CryptoDev subsystem exists as a module in Linux kernelUpwards CryptoDev interacts with OpenSSL layer receivesdelivered require data and returns completed cryptographDownwards CryptoDev transforms received EVPs as thedata structure that can be identified by the HAC driverCryptoDev acted as the middle layer between synchronousand asynchronous communications

Memory copy consumes lots of system resource in hard-ware acceleration systems especially mass data interactionoccurred between hardware and software To solve thisproblem we used an efficient transmission scheme ldquozero-copyrdquo in CryptoDev layer We extended CryptoDev subsys-tem to support both AEAD (Authenticated Encryption withAssociated Data) and non-AEAD encryption requirementsFor all requirements original data and encrypted results areorganized as scatterlist The scatter virtual addresses of DMAbuffer are arranged in a list in our design In this way datatransmission between memory and HACs could be achievedthrough one DMA operation

HAC driver acts as a loadable module in Linux ker-nel and plays an important role in accelerator invocationThe driver performed four major works initialization ofhardware engines device loading and unloading encryptionalgorithms registration and interrupt processing The driver

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 5: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 5

Clients

WebServer

Https ResponsesHttps Requests

CryptoDev

HAC Driver

Hardware Crypto Engines

Handshake API

Application layer

OpenSSL layer

CryptoDev layer

Hardware layer

User Space

Kernel Space

SSLTLS lib

Crypto API

Adaptive Scheduler Fault detection

Construct request

Crypto Engine

Softwarecryptoengine

fault signal

middot middot middot middot middot middot

Figure 3 Working flow of ACSA

OpenSSL layer work in the user space while the CryptoDevlayer and the HAC driver work in the kernel space

For the application layer we adopted the typical WebServer Nginx for HTTPS service Web Server monitorsclient requirements responds to clientsrsquo requests and isresponsible for load balance with multithread During thisprocessWeb Server authenticates clientsrsquo identity and checksfor the security through the crypto function provided byOpenSSL [28] If HTTPS connection is enabled Web Serverwill transfer required data to the lower logic layer for furtherprocessing

OpenSSL subsystem responds the crypto requirementsfrom Web Server and interacts with CryptoDev for HACsinvocation The roles of OpenSSL layer in ACSA include thefollowing

(1) Utilize provided toolkit for secure connection withthe SSLTLS protocols

(2) Integrate the Adaptive Scheduler for encryptionmode switching which enables the reasonable taskscheduling between software computing and hard-ware encryption

(3) Extend the feature of resource allocation strategyMM

(4) If software computing is adopted invoke the cryptog-raphy library to perform encryption requirements

(5) Interact with the lower layer CryptoDev If therequired data need to do hardware encryption

encapsulate the client requirements as EVP and thentransmit EVP blocks to CryptoDev through Cryp-toAPI After hardware becomes complete return thegenerated cryptograph back to the application layer

CryptoDev subsystem exists as a module in Linux kernelUpwards CryptoDev interacts with OpenSSL layer receivesdelivered require data and returns completed cryptographDownwards CryptoDev transforms received EVPs as thedata structure that can be identified by the HAC driverCryptoDev acted as the middle layer between synchronousand asynchronous communications

Memory copy consumes lots of system resource in hard-ware acceleration systems especially mass data interactionoccurred between hardware and software To solve thisproblem we used an efficient transmission scheme ldquozero-copyrdquo in CryptoDev layer We extended CryptoDev subsys-tem to support both AEAD (Authenticated Encryption withAssociated Data) and non-AEAD encryption requirementsFor all requirements original data and encrypted results areorganized as scatterlist The scatter virtual addresses of DMAbuffer are arranged in a list in our design In this way datatransmission between memory and HACs could be achievedthrough one DMA operation

HAC driver acts as a loadable module in Linux ker-nel and plays an important role in accelerator invocationThe driver performed four major works initialization ofhardware engines device loading and unloading encryptionalgorithms registration and interrupt processing The driver

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 6: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

6 Security and Communication Networks

Adaptive Scheduler

CPU

HAC1

HAC2

HACn

Init

Init

Init

Computing

Computing

Computing

Interrupt

Interrupt

Interrupt

irq 1

irq 2

irq n

Requests

middot middot middotmiddot middot middotmiddot middot middot

Figure 4 Interrupt processing flow without interrupt integration

provides API for the kernel to receive delivered data fromCryptoDev and constructs them as the data block that HACscould resolveThendriver delivered prepared encryptiondatato hardware engines throughPQsOnce interrupt signal fromHACs is detected the driver will get computing results andreturn to the upper layer for further transmission

Hardware accelerators are computing units for encryp-tiondecryptionThese components get tasks fromprocessingqueues and could compute in parallel HACs ask for interruptif encryption computing completed informing upper layer forresults return

33 Dynamic Interrupt Aggregation The interrupt is usefulfor results return in acceleration systems However we findthat the interrupt processing also induces lots additionaloverhead in Web Server applications if the concurrentrequirements are high As illustrated in Figure 4 computingtasks are assigned to HACs through the Adaptive Scheduler(please refer to Section 4 for the detail of Adaptive Scheduler)Accelerator works independently and one interrupt occursfor each encryption task to inform that the results are readyif computing is complete We analyzed the overhead inducedby each interrupt which includes the following

(1) CPU will invoke interrupt processing and need onecontext switch if interrupt is generated This over-head can be denoted as 119862119888119904

(2) Interrupt handling routine reads encrypted resultsand responds to the peripherals The cost induced bythese operations is expressed as 119862119894119903119902

(3) Once interrupt handling routine is complete theupper application will return and continue the follow-ing processing which generates one context switchagain This cost is recorded as 119862119888119904

Based on the analysis above the expense induced by oneinterrupt could be calculated as 119862119894119899119905119890119903119903119906119901119905 = 2 119862119888119904 + 119862119894119903119902 Ifthe requirements for hardware encryption are concentrativethe interrupt frequency will be very high High frequency ofinterrupt would incur considerable system cost for the wholeACSA system Therefore to reduce the overall system costwe proposed adaptive interrupt aggregation in this work

As shown in Figure 5 instead of one interrupt withone encryption hardware accelerators invoke interrupt if

Set the threshold of interruptaggregation N

Set the timeout threshold T

Waiting for the hardwareinterrupt

Interrupt number⩾ N

generate the interrupt

Y

Timeout N

Y

N

Wake up the upper layer

Figure 5 Working flow of adaptive interrupt aggregation

multiple encryption tasks are executed Similarly all thewaiting processing in the upper layer is wakened up throughthe aggregated interrupt to return the ciphertext For thescenario with fewer requirements we configure a timeoutthreshold to confirm the interrupt processing correctly

An aggregation query is extended in the improvedinterrupt handling to check the completion of hardwareencryption For all the detected tasks only one interrupt willbe invoked to inform the HAC drive that N encryptions arefinished with accelerators Here N is the number of finishedencryption tasks during the query period For a better trade-off between overhead and response delay the threshold Nis adjustable according to the system load If the workingload is high aggregation threshold N will be increasedautomatically to decrease the interrupt frequency OtherwiseN is decreased to improve the interrupt processing frequencythus decreasing the response delay

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 7: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 7

Before interrupt aggregation system overhead for ninterrupts will be

1198621 = 119899 times (2119862119888119904 + 119862119894119903119902) (1)

With proposed interrupt aggregation the overhead isreduced to

1198622 = 2119862119888119904 + 119899 times 119862119894119903119902 (2)

Through the calculation formula we can see an obviousoverhead reduction for context switch with the interruptaggregation This reduction trend is strengthening with theincreasing of interrupt number n for a certain applicationscenario

4 Adaptive Scheduler Based onHW-SW Codesign

As a heterogamous computing platform ACSA includesmultiple processing cores and hardware accelerators To takefull advantages of both accelerators and CPUs the AdaptiveScheduler is designed to realize optimal resource allocationThe scheduler calls OpenSSL lib or invokes acceleratorsfor crypto computing according to the different features ofapplication requests However as we know the invocationof hardware accelerators also induces external managementoverhead for a specific system Furthermore in both ARMand X86 based architectures most systems support crypto-graphic instructions such as AES-NI Therefore we firstlyanalyze the working flow in detail to explore the differencebetween cost and performance for the encryption mode ofAES-NI and accelerators Through this way we are ableto gather statistic of the time waste for each segment inthe working flow and figure out the hardware offloadingconditions for SSLTLS based connections

41 Overhead Analysis forWorkloadOffloading To guaranteethe universality we adopt the standard Crypto API frame-work in kernel for the invocation of hardware acceleratorswhich includes the overhead for Mode Switch and ContextSwitch The detailed processing steps and overhead areconcluded as follows

(1) Passing the key through ioctl and creating cryptosession ioctl (ctx-gtcfd CIOCGSESSIONampctx-gtsess)this process will induce twice Mode Switches (enter-ing into the kernel and back from the kernel to theuser space)

(2) Passing the request data (original data) through ioctlto the driver in kernel ioctl (ctx-gtcfd CIOCCRYPTampcryp) this step will generate another Mode Switch

(3) After the kernel submits the request to the driver itcalls function waitfor() and waits for the complemen-tation of hardware computing During this periodcurrent user program enters a sleep state and thekernel will invoke another process This stage willcause one context switch

(4) The hardware accelerator performs crypto computingasynchronously and interrupts after the execution is

completeThe interrupt processing results in a contextswitch

(5) Once the interrupt processing is completed (gener-ated in step (4)) function complete() will be invokedto notify the user program that the request has beenexecuted Then the kernel will schedule the submis-sion procedure of the user space for next schedulingcycle this will produce a context switch

(6) The process that submits the request gets encryptedresults and returns back to the user space ie the sec-ond ioctl returns generating a mode switch betweenthe user space and the kernel space

Based on the above analysis we decomposed the overheadfor HAC invocation into 3 parts hardware initializationtime 119905119894119899119894119905 accelerator execution time 119905119890119909119890119888 and the time forinterrupt processing and mode switch 119905119894119903119902 Therefore theoverhead for hardware acceleration could be denoted as 119905ℎ119908= 119905119894119899119894119905+ 119905119890119909119890119888+ 119905119894119903119902

To figure out the hardware offloading conditions forSSLTLS based applications we tested the execution timeof three working flows for different block sizes 119905119904119908 thetime needed for software computing with AES-NI (denotedas SW-NI) 119905ℎ119908 the time needed for hardware encryptionwith accelerators (denoted as HW) and the time neededwith NULL CIPHER in which the SW-NI means all theSSLTLS connection and encryption operations are per-formed through CPUs with available OpenSSL library Whilein HW working mode the crypto computing is offloaded tohardware accelerators and the access to hardware engines forsoftware is realized through CryptoDev methodology (pleasesee Section 3 for details) The NULL CIPHER mode meansonly do scatter walk for incoming scatter list and without anycrypto operation The time for scatter walk describes the costof mode switch between the user space and the kernel andthe scan for scatter list This overhead is essential to invokehardware accelerators NULLCIPHERmode is utilized to testthe general cost of Crypto API framework and calculate thecorresponding overhead of 119905119894119899119894119905 and 119905119894119903119902

We utilize the standard benchmark speed to get overheadtime in different working mode For a certain block size werecord total execution time for 1000 times encryption andcalculate the average value

42 Request Filtering and Data Aggregation As we analyzedbefore the utilization of acceleration induced additionaloverhead To make full advantages of hardware accelerationengines request filtering is applied firstly in the AdaptiveScheduler As illustrated in Figure 6 if the data size is smallthe scheduler chooses software encryption with AES-NI toavoid additional cost Otherwise hardware accelerators areinvoked through CryptoDev To further reduce the invo-cation overhead request aggregation could be followed ifhardware acceleration is adopted

The definition for block size threshold depends on thecost comparison Only if the additional overhead to invokehardware is less than SW-NI the offloading is practical Thatis to say the following criteria should be satisfied

119905119904119908 gt 119888ℎ119908 = 119905119894119899119894119905 + 119905119894119903119902

ie 119879119889119894119891119891 = 119905119904119908 minus (119905119894119899119894119905 + 119905119894119903119902) gt 0(3)

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 8: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

8 Security and Communication Networks

Table 1 The running time for accelerator invocation and AES-NI

Block Size(Bytes)

Execution time withAES-NI 119905119904119908

(us)

Execution time with hardwareacceleration 119905ℎ119908 (us)

Initialization119905119894119899119894119905

Encryption119905119890119909119890119888

Invocation Cost119905119894119903119902

64 0098 2188 1100 7552128 0156 1816 1180 7814256 0282 1546 1360 7854512 0517 1460 1660 76801024 0985 1918 2300 72822048 1936 1614 3600 75464096 3838 1874 6140 75068192 7653 1774 11280 780616384 15259 1866 21500 798432768 30479 2034 42000 863665536 60902 2128 82940 10212

WebServer

OpenSSL

Software encryptionwith AES-NISmall block sizeRequest

aggregation Large block size

CryptoDev

Hardwareaccelerators

n times grain_size

Request filtering

TCP package

Figure 6 Working flow of Adaptive Scheduler

Therefore we need to test the execution time of softwarecomputing 119905119904119908 and accelerator invocation cost 119888ℎ119908 in thecondition of the various block size and confirm the thresholdaccording to the difference between the two parameters

Here we take AES-128-CBC as an example to introducethe threshold determination for request filtering

According to the tested data as Table 1 we can draw outthe difference between 119905119904119908 and 119888ℎ119908 in the condition of variableblock size (as shown in Figure 7)

As we can see from the trend in Figure 7 when block sizelt 16 KB 119905119904119908 minus 119888ℎ119908 lt 0 the running time with AES-NI issmaller While block size ⩾ 16 KB 119905119904119908 minus 119888ℎ119908 gt 0 the overheadof software computing is bigger than accelerator encryption

Therefore we confirm the threshold in this example forworkload offloading as 16 KB If the request block size islarger than 16 KB the scheduler will invoke hardware foraccelerations otherwise only software working mode will beadopted

If hardware acceleration is adopted OpenSSL will do SSLprocessing for original data and then send the request toaccelerators through CryptoDev and hardware driver Foreach request OpenSSL will do segmentation if the requestdata is larger than grain size Here the grain size is definedas the unit of OpenSSL processing block For each grainblock OpenSSL will do compression MAC adding explicitIV and padding firstly and then delivery encapsulated grain

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 9: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 9

Figure 7 The difference between 119905119904119908 and 119888ℎ119908 in the condition ofvariable block size

block to the hardware drive through CryptoDev one by oneTherefore one invocation is needed for a block encryptionand the next data block will be processed only after theformer one is complete We demonstrated the example witha grain size of 16 KB in Figure 8 following the traditional SSLprocessing flow Assuming the original data size is 256 KBand the grain size is 16KB this request should be segmentedto 8 grain blocks according to the processing unit restrictionin OpenSSL So 8 invocations will be executed for hardwareencryption As we described before each invocation referstwicemode switch and triple context switch Totally the over-head for the 256 KB request will be 8times119888ℎ119908 This processingflow produces lots of additional overhead resulting in lowutilization for accelerators

To further reduce the invocation cost of hardware accel-erators request aggregation is proposed in this literature toimprove resource utilization Through aggregation multiplegrain blocks could be encrypted through one-time accel-erator invocation The design flow for data aggregation isdetailed as follows

(1) Modify the configuration file nginxcfg for Nginx sothat the function ngx ssl write() could get the datablock with a length of ntimesgrain size

(2) Extend the function ssl3 write byte() to support pro-cessing length to ntimesgrain size

(3) Strengthen function do ssl3 write() with data aggre-gation operation In this function the requested dataof ntimesgrain size follows the SSL processing flow todo compression MAC padding etc and then isaggregated as a single package for lower processing

(4) Increase the buffer size for data write to supportaggregated data storage

(5) Revise the encryption function evp cipher() andissue the aggregated data block to hardware for cryptocomputing

(6) Decompose the encrypted data into n segments afterhardware computing completed and add TCP headerfor each segment

(7) Send the n TCP packages through the calling ofssl3 write pending() one by one

As illustrated in Figure 9 for a request of n blocks (eachblock in grain size) proposed methodology firstly does datapreprocessing (MAC padding etc) then encapsulated the

n segments together issued to hardware drive subsequentlyand finally perform data encryption through one acceleratorinvocation Through this way the overhead of Mode SwitchandContext Switchwill reduce to 1n comparedwith the orig-inal solution and greatly improve the utilization of hardwareengines

5 Maximize Resource Utilization withMinimal Management Cost

For most Web Server applications high concurrent requestsneed to be processed in time that is whywe integratemultipleCPUs and accelerators in a single server As a heteroge-neous platform with both CPUs and hardware engines itis important to give the best play to respective comput-ing superiorities However how to make full utilization ofhardware engines with least system cost How to schedulethe concurrent multiprocess to get a best trade-off betweenperformance and overhead To solve these problems weproposedMMstrategy for resource allocation fromanoverallpoint of view The core of the methodology is to maximizeresource utilization with minimal management cost and geta best system performance through hardware and softwarecodesign

For a better description we defined referred parametersfirstly as shown in Table 2

The MM allocation strategy is shown as Figure 10 andAlgorithm 1 Assuming there are N CPUs available in thesystem in which 119873ℎ119908 CPUs are responsible for acceleratorsinvocation so the number of CPUs for software encryptionshould be 119873119904119908 = 119873 minus 119873ℎ119908 If there are P total availableprocesses in the system in which the number of processesactivated for hardware engines is 119875ℎ119908 and the number ofprocesses used for software computing is 119875119904119908

On a condition of the same CPUs number if the maxi-mum bandwidth of hardware encryption is recorded as 119861ℎ119908and themaximum bandwidth with AES-NI is denoted as119861119904119908then the theoretical maximum bandwidth with hardware andsoftware codesign should be

119861119898119886119909 = 119861ℎ119908 + (119873 minus 119873ℎ119908) times 119861119904119908 (4)

The core of MM algorithm is to find an allocation strategywith parameters119873ℎ119908 and 119875ℎ119908 so as to get a best system per-formance with available resource through hardware and soft-ware codesign The parameter 119873ℎ119908 found by MM algorithmshould be the least number of CPUs needed for acceleratormanagement while the parameter 119875ℎ119908 found by MM shouldbe the least number of processes needed Through MMstrategy we can make full utilization of hardware acceleratorswith least system resource occupation Thus the remainingCPUs are free for software computing with AES-NI or doother operations in need [29 30]

The major steps for MM strategies are as follows

(S1) Activate iCPUs (i=1 2 N) i processes for soft-ware encryption and get the maximum encryptionbandwidth at a condition of CPU loaded completely119878119875119894

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 10: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

10 Security and Communication Networks

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Crypto unit

Tcp data package

Encryption

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 8 16

164645

16464

Send plaintexts to thehardware one by one

Figure 8 Existing processing flow for a 64 KB data without data aggregation

16384

4816384

481638416

481638416 16

164645

16464

segmentation

Hash

Original data

Record Protocol Unit

Add explicit IV

Padding

Hardware Crypto Unit

Tcp data package

Add header

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

16384

4816384

481638416

481638416 16

164645

16464

Send all plaintexts to the hardware at once

plaintext

ciphertext

Figure 9 Proposed processing flow for a 64 KB data with data aggregation

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 11: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 11

Table 2 Parameters for MM algorithm

Parameter DescriptionN The number of available CPUs in the system119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for hardware encryption119875119905119900119905119886119897 The total number of active processes119867119875119894 119895 Themaximum bandwidth with hardware encryption at the condition of i CPUs and j processes and completely loaded119878119875119894 Themaximum bandwidth with software encryption at the condition of i CPUs and i processes and completely loaded119875119894 119895 The bandwidth difference between hardware encryption and software encryption (119867119875119894 119895 minus 119878119875119894)

Web Server

CPU

allocate allocate

AES-NI HAC HAC HAC

software encryption0MQ processes invoked for

hardware encryption0hQ processes invoked for

middot middot middot middot middot middot

BQ CPUs

middot middot middot

MQ CPUs

middot middot middot middot middot middot05N-i 05i+1 05i-1 051 05005i

Figure 10 CPUs and processes allocation with MM strategy

(S2) Activate iCPUs (i=1 2 N) increase the num-ber of processes j for hardware invocation and findthe maximum encryption bandwidth at a conditionof CPU loaded completely119867119875119894 119895

(S3) Calculate the performance difference betweensoftware computing and hardware encryption 119875119894 119895 =119867119875119894 119895 - 119878119875119894

(S4) For the cases i=1 2 N follow the steps (S1)(S2) and (S3) one by one and get the value 119875119894 119895 at thedifferent number of CPU

(S5) Find the maximum 119875119894 119895 of (S4) max(119875119894 119895) isin 119875119894 119895(i=1 2 N) Through max(119875119894 119895) the parameterscan be determined the number of CPU used foraccelerator invocation 119873ℎ119908 = 119894 the number ofprocesses for hardware encryption 119875ℎ119908 = 119895 thenumber of CPU for software encryption 119873119904119908 = 119873 minus119873ℎ119908 and the corresponding number of processes

119875119904119908 = 119873119904119908 The total processes should be activatedas 119875119905119900119905119886119897 = 119875119904119908 + 119875ℎ119908

To introduce MM algorithm more clearly we take AES-128-CBC as an example for the exploration of parameter i jAssuming the N as 16 in this example The detailed processfollowed by MM is as follows

(S1) Adopt the working mode as software encryp-tion through Adaptive Scheduler in ACSA All theencryption requests are processed through AES-NIThe working flow is illustrated as Figure 11(a)(S2) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 101 GBs(S3) Activate 2sim16 CPUs orderly corresponding to2sim16 processes separately Similar to (S2) maximumencryption bandwidth can be explored at the different

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 12: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

12 Security and Communication Networks

Input N The number of available CPUs in the systemOutput119873119904119908 The number of CPUs used for software encryption with AES-NI119873ℎ119908 The number of CPUs responsible for the processes of hardware invocation119875119904119908 The processes invoked for software encryption with AES-NI119875ℎ119908 The processes invoked for software encryption with AES-NI119875119905119900119905119886119897 The total number of active processes

lowast test for the maximum bandwidth with AES-NI in different CPU number general the process number isequal to the CPU number lowast(1) for CPU number from 1 to N do(2) calculate 119878119875119894 lowast 119878119875119894 is the maximum bandwidth when the CPU number is i lowast(3) end forlowast test for the maximum bandwidth with hardware in different CPU number and process number lowast(4) for CPU number from 1 to N do(5) for process number from 1 to 32 do(6) calculate119867119875119894 119895 lowast119867119875119894 119895 is the maximum bandwidth when the CPU number andprocess number is i and j lowast(7) end for(8) end forlowast Calculate the max performance difference between AES-NI computing and hardware encryption lowast(9) let maxdifflarr997888 1198751 1(10) for CPU number from 1 to N do(11) for process number from 1 to 32 do(12) 119875119894 119895 larr997888 (119867119875119894 119895 minus 119878119875119894)(13) if (maxdiff lt 119875119894 119895) then(14) maxdifflarr997888 119875119894 119895(15) 119873ℎ119908 larr997888 i(16) 119875ℎ119908 larr997888 j(17) Nsw larr997888 (N-i)(18) 119875119904119908 larr997888 119873119904119908(19) 119875119905119900119905119886119897 larr997888(119875119904119908 + 119875ℎ119908)(20) end if(21) end for(22) end for

Algorithm 1 The strategy for MM algorithm

Table 3 The number of processes 119875ℎ119908 at different number of available CPUs when best crypto performance achieved

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16119875ℎ119908 12 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

number of CPUs and processes We conclude the 119878119875119894(i = 1 2 16) in Figure 12(S4) Adopt theworkingmode as hardware encryptionthroughAdaptive Scheduler in ACSA All the encryp-tion requests are processed through hardware acceler-ators The working flow is illustrated as Figure 11(b)(S5) Activate one CPU and one process of ACSAaugment the workload to make CPU completelyloaded and get the maximum encryption bandwidthas 752 GBs in software working mode(S6) Still keep one CPU activated invoke 2simn pro-cesses orderly Record the maximum encryptionbandwidth when CPU completely loaded at differentcases If the bandwidth with n processes invoked isless than or equal to the bandwidth of n-1 processesinvoked the parameter j can be confirmed as n-1 andthe test can be stopped

(S7) Activate 2sim16 CPUs in turn and follow steps(S5) and (S6) to confirm the number of processeswhen maximum encryption bandwidth is achievedWe recorded the number of processes at differentnumber of CPUs as Table 3 and the correspondingbandwidth is shown in Figure 12(S8) Compute the performance difference betweensoftware and hardware encryption the results arerecorded as in Figure 13

From Figure 13 we can find the maximum difference in thecase of N =1 119875ℎ119908= 12 ie the number of CPU is 1 andthe corresponding processes are 12 Based on the test results27 processes should be invoked for the system in which 12are scheduled for hardware management and remaining 15processes could be allocated for software encryption withAES-NI if no other works are in need

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 13: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 13

User space

WebServer

OpenSSL

SoftwareCrypto lib

(a)U

ser spaceKernel space

WebServer

OpenSSL

CryptoDev

Hardware driver

Hardware engine

Crypto API

(b)

Figure 11 The working flow of (a) software encryption and (b) hardware encryption

101204

306408

505613

716817 919

1022 11221226

13281431

1533 1597

752 761 762 762 762 758 761 762 762 762 762 762 762 762 762 762

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16The number of CPUs

AES-256-CBC Crypto Performance

SoftwareHardware

02468

1012141618

Encr

yptio

n Ba

ndw

idth

(GB

s)

Figure 12 The encryption performance with hardwaresoftware working mode at different CPU number

651557

456354

257145

045 -055

-157-26

-36-464

-566-669

-771 -835

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The number of CPUs

The performance differences between hardware and software encryption at different CPU number

minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

Perfo

rman

ce D

iffer

ence

s (G

Bs)

Figure 13 The performance differences between hardware and software encryption

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 14: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

14 Security and Communication Networks

Table 4 Hardware environment of testing

Hardware Number Configurations

ARM Server 2CPU ARM Cortex-A5716 cores

Main frequency 21 GHzMemory 128 GB

Network Card 4 10 Gbps each

Net cable 4 2 Gigabit cables (Category 7A)2 10 Gbps fiber cables

Table 5 Software environment of testing

Software Configuration Software VersionOperating System Linux-4127OpenSSL OpenSSL-102jNginx Nginx-1116Benchmark ab (apache benchmark)

ServerTesting

machine(Client)

10 Gbps

10 Gbps

10 Gbps

10 Gbps

10 Gbps Ethernet port

Figure 14 Testing network (CASC acted as Server the Testingmachine as Clients)

6 System Test and Analysis

To reflect the real working environments as shown inFigure 14 we establish a test platform for ACSAwith 40 Gbpsnetwork bandwidth As shown in Table 4 we deployed 2Web Servers for testing one is used as ACSA crypto systemto response HTTPS accesses while another one is utilizedas a client to generate testing workloads Both machinesare equipped with 16 ARM processing units whose mainfrequency is 21 GHz and memory size is 128 GB To satisfythe high concurrent requirements from mass clients weestablished the networking environments with 4 NetworkCards and each contributes to 10 Gbps bandwidth

As illustrated in Table 5 we adopt Nginx [31] as WebServer which uses an asynchronous event-driven approachto handle requests The operating system is adopted as Linux-4127 and is extended to support HAC access OpenSSL-102j is utilized to perform software encryption throughcryptographic library libcrypto Adaptive Scheduler andMMalgorithm are also integrated into OpenSSL for efficientlyutilization of accelerators Ab (apache benchmark) [32] isa widely used testing toolset for website performance Wechoose this benchmark for end-to-end testing in a network

61 Testing Methodology To guarantee the correctness andcompleteness proposedACSA is evaluated from twoworkinglevels

(1) OpenSSL Testing on the Server Side This testing is per-formed through the standard benchmark speed in OpenSSLWe could detect the maximum encryption bandwidth for asingle algorithm with speed The tested encryption algorithmincludes AES-256-CBC and 3DES-CBC in which AES-CBCcould be accelerated through AES-NI while 3DES-CBC isnot supported by AES-NI To explore the performance fordifferent configurations and data characters we vary thenumber of processes the block size for encryption and thenumber of available ARMs The performance is evaluatedthrough MBs in this testing which indicates the amount ofdata that can be encrypted per second

(2) End-to-End Testing for Full System End-to-end testing isutilized to evaluate the performance of HTTPS service thatcan be provided byACSAWe use the testing machine to gen-erate HTTPS requirements from different clients Throughestablished 40 Gbps networking we could increase theworkload to simulate high concurrent accesses in real appli-cation scenarios Standard toolset ab is adopted for workloadtesting and the performance indicator is RPS (Request perSecond) For further analysis we choose the cipher suitebased on the same testing algorithms inOpenSSL Two ciphersuites are tested in our experiments ECDHE-RSA-AES256-SHA384 and ECDHE-RSA-DES-CBC3-SHA Different withOpenSSL testing the tested data are web pages instead of datablocksWe tested 7 differentweb pages to see the performancefor different data size We take full advantages of 4 networkcards to provide 32 ab processes so as to confirm the testingpressure with a high workload

62 OpenSSL Testing We tested the encryption bandwidthfor both AES-256-CBC and 3DES-CBC in this sectionFor each experiment we evaluated maximum encryptionbandwidth with different block sizes including 4K 8K 32K64K 128K 256K and 512K All the results are presented asMBs (Megabytes per second) in Figures 15 and 16

621 AES-256-CBC We tested and compared 4 work-ing flows for AES-256-CBC (1) software encryption withCPU the crypto computing is executed through OpenSSLlibcrypto We denoted this working flow as SW (2) softwareencryption with AES-NI the software computing is acceler-ated through special instruction set AES-NI This workingflow is denoted as SW-NI for clarity (3) hardware encryptionwith accelerators but without adaptive scheduling and MMstrategy and this working mode is denoted as HW and (4)hardware encryption with accelerators with our proposedoptimization method adaptive scheduling and MM strategyThis testing is denoted as HW-SW codesign for fairness theadaptive interrupt aggregation is applied to all the testingflows

Figure 15 showed the encryption bandwidth with fourdifferent working flows To better present the performanceimprovement with our proposedmethodology we concludedthe bandwidth comparison with HW-SWcodesign and AES-NI in Figure 16 in which the encryption flow with AES-NIis the best situation with software encryption

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 15: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 15

SW

SW-NI

HW

HW-SW-codesign

4 KB

127268

1177210

165075

1144683

8 KB

127231

1180424

318808

1163555

16 KB

126926

1181764

589148

1257691

32 KB

126354

1180972

593279

1394863

64 KB

126399

1180295

595403

1595237

128 KB

125823

1181072

596493

1698931

256 KB

124749

1173347

597058

1695630

512 KB

125186

1165808

596648

1685539

block size

The encryption bandwidths with different working flows (AES-256-CBC)

SW SW-NI HW HW-SW-codesign

000

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 15 Encryption bandwidths with SW SW-NI HW and HW-SW codesign for algorithm AES-256-CBC

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -276 -143 642 1811 3516 4385 4451 4458

HW-SW co-design vs HW improvement 59343 26497 11348 13511 16793 18482 18400 18250

HW-SW co-design vs SW improvement 79943 81452 89089 100393 116206 125025 125923 124643

block size

16 CPUs 32 processes AES-256-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW-NI improvementHW-SW co-design vs SW improvement

HW-SW co-design vs HW improvement

minus20000

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 16 Performance comparisons with HW-SW codesign and SW-NIHW

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 16: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

16 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBblock size

The encryption bandwidths of HW with different number of CPU (3DES-CBC)

234567

1

8

101112131415

9

16

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 17 Encryption bandwidths with hardware engines at different number of CPUs

As we can see from Figures 15 and 16 hardware encryp-tion with hardware engines can get better performancecompared with the crypto lib but worse than AES-NIThe reason is the invocation cost for hardware engineswith context switch and mode switch as we analyzed inSection 41 Besides for AES-NI the encryption functionis accelerated directly through instructions and no addi-tional operations are needed Proposed HW-SW codesignis optimal almost for all the situations We can get 799sim1246 performance improvement compared with softwareencryption and 593sim182 improvement compared withonly hardware encryption Even if compared with AES-NI proposed working flow can get 18sim44 performanceincrease for large data blocks If the block size is less than16 KB such as 4 KB and 8 KB the maximum encryptionbandwidth is a little bit lower than AES-NI The reasonis that the threshold for the data filter is configured as16 KB in this case ACSA will invoke AES-NI for dataencryption if the data size is smaller than the thresholdSince CPU resource is needed for data size checkout andscheduling decision the maximum encryption bandwidthis somewhat lower than AES-NI However this influenceis decreased with the increasing of data size on accountof the reduced frequency for scheduling checkout If thedata size is bigger than 16 KB both AES-NI and hardwareengine cooperated to contribute a better overall systemperformance According to the MM strategy in this casewe allocated 13 processes for hardware encryption and 19for software encryption As we can see from the recordedresults through hardware and software codesign we can getthe best encryption bandwidth compared with only softwareor hardware encryption

622 3DES-CBC Since AES-NI does not support 3DES-CBC yet the performance with hardware engines is muchbetter than software It is not reasonable for doing the datafilter and dynamic scheduling Therefore we only applied theMM strategy to make full utilization of hardware engineswith minimal management cost We firstly evaluated theencryption bandwidths with hardware engines at differentnumber of CPUs As shown in Figure 17 the hardwareaccelerators performed not so excellent when data size issmall the reason is the induced cost for many contextmodeswitches For encryption with hardware engines the largerthe data blocks the better the performance As we can seefrom the results we could get almost the best encryptionbandwidth only with four CPUs

If we want to get maximum performance with bothhardware and software encryption engines we could allocatethe remaining CPUs for software encryption Here for mostblock sizes it is reasonable to utilize only four CPUs forhardware encryption Therefore we can take full advantagesof the other 12 available CPUs for more encryption tasks Fora better presentation we tested three different working flows(1) software encrypting with OpenSSL crypto lib (SW) (2)hardware encryption with hardware engines and only fourCPUs are utilized (HW with four CPUs) and (3) hardwareand software codesign with 16 CPUs (HW-SW codesign)

As we can see from Figure 18 even though there are 16CPUs responsible for the 3DES encryption the encryptionbandwidth is still very small (only 271 MBs) Since itis computation intensive this algorithm occupies lots ofsystem resource if there is no instruction acceleration Theencryption computation is accelerated greatly by hardwarecrypto engines and the improvement is more obvious for

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 17: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 17

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 27169 27168 27120 27178 27166 27139 27174 27169

HW with 4 CPU 127039 242930 345360 348952 349598 349926 350083 349989

HW-SW co-design 192124 353686 366712 368688 369600 370105 370326 370154

block size

The encryption bandwidths with different working flows (DES3-CBC)

HW with 4 CPUSW HW-SW co-design

000

50000

100000

150000

200000

250000

300000

350000

400000

Encr

yptio

n Ba

ndw

idth

(MB

s)

Figure 18 Encryption bandwidth for different working flows

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW improvement 60714 120185 125218 125657 126052 126374 126280 126241HW-SW co-design vs HW with 4 CPUs improvement 5123 4559 618 566 572 577 578 576

block size

16 CPUs 32 processes DES3-CBC HW-SW co-design encryption bandwidths improvement

HW-SW co-design vs SW improvement HW-SW co-design vs HW with 4 CPUs improvement

000

20000

40000

60000

80000

100000

120000

140000

Encr

yptio

n Ba

ndw

idth

s Im

prov

emen

t

Figure 19 Performance comparisons with HW-SW codesign HW with 4 CPUs and SW

larger block sizesThe reason ismore invocation cost inducedfor smaller data blocks The remaining CPU idle can betransformed for software encryption and the maximumaggregate bandwidth could achieve 3703MBs with hardwareand software codesign

To have a clear comparison we concluded the perfor-mance improvement in Figure 19 As we can see there isnearly 12 times bandwidth improvement through HW-SW

codesign compared with software encryption There is only6 times improvement for small data blocks (4 K) due tothe worse utilization of hardware engines Compared withonly hardware encryption the improvement with HW-SWcodesign is not so obvious for large data blocksThe reason isthe poor contribution through software encryption withoutAES-NI Still we can get 45sim51 improvement for smalldata blocks with HW-SW codesign and get additional 200sim

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 18: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

18 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25841 41918 50474 60854 67939 72047 74112 75626

SW-NI 34291 61302 85761 118225 145102 165304 176168 183119

HW 22804 40920 57366 81416 98063 117944 127661 133644

HW-SW co-design 33896 60802 85159 117338 141641 170511 181899 188428

000

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidth

SW SW-NI HW HW-SW co-design

Figure 20 Network bandwidth with 16 processes for ECDHE-RSA-AES256-SHA384

300 MBs encryption bandwidth for large blocks Sincethe software encryption contributes little for total cryptoperformance compared with hardware engines we suggestthat remaining CPU idle utilized for other important taskssuch as database managements Of course it depends on userto decide how tomake a decision about the trade-off betweenthe CPU idle and encryption performance For example if itis rush hour the user could take full advantages of both hard-ware and software for maximum overall performance fornormal working time with fewer encryption requirementsthe user could only allocate needed management resource forHACs and try to have a best offloading

63 End-to-End Testing We adopt a pressure test to eval-uate the end-to-end performance Through pushing differ-ent workload for Web Server we can get the encryptionbandwidth and CPU idle in different working situations Wetested diverse page sizes to see the performance diversity fordifferent request characters

631 Cipher Suite ECDHE-RSA-AES256-SHA384 Similaras the OpenSSL testing we evaluated four different work-ing flows (1) software encrypting with OpenSSL cryptolib (SW) (2) software encryption with AES-NI (SW-NIinstruction acceleration) (3) hardware encryption usinghardware engines with proposed design flow (HW) and(4) hardware encryption with hardware-software codesign(HW-SW codesign) which is further optimized with MMand data aggregation For different request pages we furtherevaluated the trade-off between bandwidth and CPU idlewith 16 CPU processes and 32 processes respectively To

better present the advantages of proposed ACSA we showedthree performance indexes in this section RPS request persecondMBs encrypted data (inMegabytes) per second andCPU idle

(1) 16 CPUs 16 Processes We showed the RPS with differ-ent working flows for cipher suite ECDHE-RSA-AES256-SHA384 in Table 7 For a clear comparison and analysis wetransit RPS to network bandwidth MBs and presented theresults in Figure 20

As we can see from Figure 20 that HW-SW codesign canget greater network bandwidth compared with only hardwareencryption The reason is great reduction of invocation costthrough proposed data aggregation and adaptive schedulingThe influence is more obvious for small block sizes Forexample the maximum improvement achieves 4864 forcase page 4 KB If the page size is less than 128 KB thenetwork bandwidth withHW-SWcodesign is a little bit lowerthan AES-NI The reason is the cost for size detection andscheduling decision If the page size is equal or bigger than128 KB the overall performance is advanced compared withonly software encryption or only hardware encryption

Although the bandwidth improvement for the proposedmethodology is not obvious even kind of lower for smallrequests we still have the exploration space for furtheroptimization with HW-SW codesign As shown in Figure 21even though the network bandwidth is a little bit better withAES-NI there is no CPU idle at all However for HW-SWcodesign there are 1251sim1851 CPU idle compared withAES-NI for smaller page sizes If the page size is larger than64 KB the CPU idle increases to 2167sim2383 and there isalso 285sim338 performance improvement

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 19: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 19

Table 6 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (16 CPUs 16 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6615310 5365450 3230320 1947340 1087030 576376 296448 151251SW-NI 8778410 7846620 5488700 3783190 2321630 1322430 704671 366237HW 5837850 5237760 3671430 2605310 1569000 943554 510642 267287HW-SW co-design 8667940 7782700 4549950 3435770 2262360 1367090 727566 377097

Table 7 RPS with different working flows for ECDHE-RSA-AES256-SHA384 (32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 6588590 5298780 3203000 1939120 1080380 574414 296418 150348SW-NI 8630250 7782715 5301720 3711610 2292040 1309250 702085 365735HW 5849910 5489160 3794640 2697210 1640190 1004190 548614 288166HW-SW co-design 8517720 7721190 4640400 3616660 2487510 1534770 832519 439145

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -115 -081 -070 -075 -238 315 325 290

HW-SW co-design vs HW improvement 4864 4859 4845 4412 4444 4457 4249 4099

CPU idle 150 160 170 180 1550 1800 2000 2050

pagesize

16 CPUs 16 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

Figure 21 RPS improvement and CPU remaining with 16 processes for ECDHE-RSA-AES256-SHA384

If we expect to get a higher network bandwidth for a betterencryption performance we could utilize the remaining CPUresource As an example of the trade-off between CPU idleand performance we also presented the test results withmoreprocesses as below

(2) 16 CPUs 32 Processes As we can see from Figure 21 theworking flow with HW-SW codesign can get extra CPU idlecompared with AES-NITherefore to further explore a betterperformancewithHW-SWcodesignwe increase the numberof processes to utilize the CPU idleWe tested the results with32 processes and showed theRPSwith differentworking flowsfor cipher suite ECDHE-RSA-AES256-SHA384 in Table 6

For a clear comparison and analysis we transit the RPSto network bandwidth MBs and presented the results inFigure 22

As we can see from Figure 22 HW-SW codesign canget 49sim52 bandwidth improvement compared with onlyhardware encryption The reason is great reduction of invo-cation cost through proposed data aggregation and adaptivescheduling Nevertheless different with 16 processes theinfluence is more obvious for large block sizes since thereare more CPU idles available for performance improvementIf the page size is less than 64 KB the network bandwidthwith HW-SW codesign is a little bit lower than AES-NIThe reason is the cost for data size detection and scheduling

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 20: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

20 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 25737 41397 50047 60598 67524 71802 74105 75174

SW-NI 33712 60826 82839 115988 143253 163656 175521 182868

HW 22851 42103 59291 84288 102512 125524 137154 144083

HW-SW co-design 33403 60322 82070 115942 156283 192308 209216 219643

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidth

000

50000

100000

150000

200000

250000

Net

wor

k Ba

ndw

idth

(MB

s)

page size

SW SW-NI HW HW-SW co-design

Figure 22 Network bandwidth with 32 processes for ECDHE-RSA-AES256-SHA384

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

HW-SW co-design vs SW-NI improvement -092 -083 -093 -004 910 1751 1920 2011

HW-SW co-design vs HW improvement 4617 4327 3842 3755 5245 5320 5254 5244

CPU idle 250 200 200 180 150 180 150 160

page size

16 CPUs 32 processes ECDHE-RSA-AES256-SHA384 network bandwidthimprovement and CPU idle

minus1000

000

1000

2000

3000

4000

5000

CPU

idle

minus1000

000

1000

2000

3000

4000

5000

6000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

HW-SW co-design vs SW-NI improvement HW-SW co-design vs HW improvement CPU idle

Figure 23 RPS improvement and CPU remaining with 32 processes for ECDHE-RSA-AES256-SHA384

decision If the page size is equal or bigger than 64 KBthe overall performance is advanced compared with onlysoftware encryption or only hardware encryption

To illustrate the advantage of HW-SW codesign moreclearly we conclude the network improvement and CPU idlein Figure 23 As we can see from the figure for small pagesHW-SW codesign almost maintains the same performancewith AES-NI It is reasonable since we applied the request

filter to choose the most suitable encryption way Hardwareengines are not so excellent compared with AES-NI due tocontextswitch cost Therefore ACSA automatically adoptedsoftware encryption with AES-NI for small pages For largepages the saved CPU resource through hardware and soft-ware codesign could be utilized to further improve encryp-tion bandwidth Even compared with AES-NI we still get anadditional 20 network bandwidth for page size 512 KB

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 21: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 21

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW-SW co-design vs SW-NI improvement 16 processes -115 -081 -070 -075 -238 315 325 290HW-SW co-design vs SW-NI improvement 32 processes -092 -083 -093 -004 910 1751 1920 2011CPU idle 16 processes 150 160 170 180 1550 1800 2000 2050CPU idle 32 processes 250 200 200 180 150 180 150 160

page size

16 CPUs ECDHE-RSA-AES256-SHA384 network bandwidth improvement and CPU idle

HW-SW co-design vs SW-NI improvement 16 processes HW-SW co-design vs SW-NI improvement 32 processesCPU idle 16 processes CPU idle 32 processes

minus500

000

500

1000

1500

2000

2500

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

minus500

000

500

1000

1500

2000

2500

CPU

idle

Figure 24 Bandwidth improvement and CPU idle comparison between different working flows for ECDHE-RSA-AES256-SHA384

Table 8 ECDHE-RSA-DES-CBC3-SHA RPS with different working flows (16 CPUs 32 processes)

Page Size (KB) 4 8 16 32 64 128 256 512SW 3633470 2520740 1279160 690093 358973 183376 92440 46612HW 5827570 5071780 2577270 1452570 764565 390254 196862 99088

(3) Test Conclusion In this section we tested and analyzedthe end-to-end performance for cipher suite ECDHE-RSA-AES256-SHA384 in diverse page sizes at different availableCPU processes Through the experiment results we canfigure out the conclusions as follows

(1) Generally speaking proposed HW-SW codesigncould get the best performance compared with onlysoftware or hardware encryption For best case HW-SW codesign could get 5320 bandwidth improve-ment compared with only hardware encryption and2011 improvement compared with AES-NI Thecontribution comes from aggregation strategy andadaptive scheduling

(2) As we concluded in Figure 24 the proposed method-ology could provide a possibility for a better trade-off between performance and CPU idle It is possibleto get the same network bandwidthRPS with AES-NI but still have additional available CPU resourceThe user could utilize the saved CPU for otherimportant tasks or take full advantage of it for furtherperformance improvement

632 Cipher Suite ECDHE-RSA-DES-CBC3-SHA Since3DES is not supported by AES-NI yet and the contributionwith software encryption is little for HW-SW codesign wetested the end-to-end performance for 2 working flows inthis section The first one is software encryption withoutAES-NI The other one is hardware encryption with crypto

engines The RPS results for cipher suite ECDHE-RSA-DES-CBC3-SHA are shown as Table 8 We can get 16sim21 timesperformance if we adopted hardware encryption

We concluded the encryption bandwidth and the perfor-mance improvement in Figures 25 and 26 The maximumnetwork bandwidth with hardware encryption could achieve49544 MBS for large data blocks For the same testingconditions there are 60sim112 improvement comparedwithsoftware encryption Besides performance improved thereare still 1362sim8954 CPU idle available for hardwareencryption

According to the test in OpenSSL there are maximum 12times performance improvement However there are only 2times RPS improvement for end-to-end testing The problemis the constraint of the testing platform We found thedecryption speed with software in the client has been abottleneck All the CPUs are completely loaded for the client

7 Conclusions and Future Work

In this work we presented ACSA Adaptive Crypto Sys-tem based on Accelerators which is able to adopt cryptomode adaptively and dynamically according to the requestcharacter and system load We surveyed and analyzeddifferent working flows with SSLTLS firstly and foundneither hardware nor software encryption can claim thebest performance for all the application scenarios Eventhere is advanced instruction acceleration for AES the CPU

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 22: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

22 Security and Communication Networks

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB

SW 14193 19693 19987 21565 22436 22922 23110 23306

HW 22764 39623 40270 45393 47785 48782 49216 49544

page size

16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

SW HW

000

10000

20000

30000

40000

50000

60000N

etw

ork

Band

wid

th (M

Bs)

Figure 25 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth

4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KBHW vs SW improvement 6039 10120 10148 11049 11299 11282 11296 11258CPU idle 1362 1540 5315 7040 7507 8604 8817 8954

page size

16 CPUs 32 processes ECDHE -RSA-DES-CBC3-SHA network bandwidthimprovement and CPU idle

HW vs SW improvement CPU idle

000

2000

4000

6000

8000

10000

12000

Net

wor

k Ba

ndw

idth

Impr

ovem

ent

00010002000300040005000600070008000900010000

CPU

idle

Figure 26 16 CPUs 32 processes ECDHE-RSA-DES-CBC3-SHA network bandwidth improvement and CPU idle

occupation is considerable compared with hardware cryptoAlthough HACs performs well for calculation the invocationcost cannot be ignored for small data blocks We not onlyproposed optimization strategies such as data aggregationto advance the contribution with hardware crypto enginesbut also presented MM strategy (maximizing utilizationwith minimal overhead) and adaptive scheduling to takefull advantage of both software and hardware encryptionThrough the establishment of 40 Gbps networking we areable to evaluate the system performance in real applicationswith a high workload on various benchmarks and systemconfigurations For the encryption algorithm 3DES which isnot supported inAES-NI we could get about 12 times acceler-ationwith accelerators For typical encryptionAES supported

by instruction acceleration we could get 5320 bandwidthimprovement compared with only hardware encryption and2011 improvement compared with AES-NI Furthermoreuser could adjust the trade-off between CPU occupation andencryption performance through MM strategy to free CPUsaccording to the working requirements

Proposed design methodology possesses universal prop-erties The design flows of ACSA and MM algorithm areapplicable to other similar designs with hardware acceler-ation for security web access [33] As long as the designis heterogeneous architecture with CPUs and acceleratorsthe proposed design methodology is applicable regardlessof the type of CPU and accelerator Besides this work isbased on ARM server which can be furthered for energy

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 23: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

Security and Communication Networks 23

efficiency exploration [34] providing emerging solutionsfor data center besides X86 based architectures [35 36] Infuturework based on our current understanding of hardwareand software encryption features our research attempt is tostudy resource allocation strategies for heterogeneous servercenters in different architectures from the perspective ofenergy efficiency rather than performance

Data Availability

The data and codes are available on the lab server Anyonethat would like to obtain access could send an email to thecorresponding author

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work is supported by the National Natural ScienceFoundation of China (61502061) Chongqing ApplicationFoundation and Research in Cutting-Edge Technologies(cstc2015jcyjA40016) and the Fundamental Research Fundsfor the Central Universities (106112017CDJXY180004)

References

[1] N Khan I Yaqoob I A T Hashem et al ldquoBig data sur-vey technologies opportunities and challengesrdquo The ScientificWorld Journal vol 2014 Article ID 712826 18 pages 2014

[2] L Li K Ota Z Zhang and Y Liu ldquoSecurity and privacyprotection of social networks in big data erardquo MathematicalProblems in Engineering vol 2018 Article ID 6872587 2 pages2018

[3] S Subashini and V Kavitha ldquoA survey on security issues inservice deliverymodels of cloud computingrdquo Journal ofNetworkand Computer Applications vol 34 no 1 pp 1ndash11 2011

[4] J Wilson R S Wahby H Corrigan-Gibbs D Boneh P Levisand K Winstein ldquoTrust but verify auditing the secure internetof thingsrdquo in Proceedings of the 15th ACM International Confer-ence on Mobile Systems Applications and Services MobiSys pp464ndash474 USA 2017

[5] A Eric Young J TimHudson andR EngelschallOpenSSLTheOpen Source Toolkit for ssltls 2011

[6] S Baskaran and P Rajalakshmi ldquoHardware-software co-designof AES on FPGArdquo in Proceedings of the 2012 InternationalConference on Advances in Computing Communications andInformatics ICACCI 2012 pp 1118ndash1122 India August 2012

[7] L Bossuet M Grand L Gaspar V Fischer and G Gog-niat ldquoArchitectures of flexible symmetric key crypto engines-asurvey From hardware coprocessor to multi-crypto-processorsystem on chiprdquo ACM Computing Surveys vol 45 no 4 2013

[8] C Maharak and B Sowanwanichakul ldquoSecurity methods forweb-based applications on embedded systemrdquo in Proceedingsof the IEEE TENCON 2004 - 2004 IEEE Region 10 ConferenceAnalog and Digital Techniques in Electrical Engineering ppC56ndashC59Thailand November 2004

[9] A B Smith C D Jones and E F Roberts ldquoArticle DavidBrumley andDanBoneh Remote TimingAttacks are Practicalrdquoin Proceedings of the 12th Usenix Security Symposium 2003

[10] V P Nambiar and M M Zabidi Accelerating the AES Encryp-tion Function in OpenSSL for Embedded Systems IndersciencePublishers 2009

[11] MKhalilMNazrin andYWHau ldquoImplementation of SHA-2hash function for a digital signature System-on-Chip in FPGArdquoin Proceedings of the 2008 International Conference on ElectronicDesign ICED 2008 Malaysia December 2008

[12] D B Roy S Agrawal C Reberio and D MukhopadhyayldquoAccelerating OpenSSLrsquos ECC with low cost reconfigurablehardwarerdquo in Proceedings of the 2016 International Symposiumon Integrated Circuits ISIC 2016 Singapore December 2016

[13] A Thiruneelakandan and T Thirumurugan ldquoAn approachtowards improved cyber security by hardware acceleration ofOpenSSL cryptographic functionsrdquo in Proceedings of the 1stInternational Conference on Electronics Communication andComputing Technologies 2011 ICECCTrsquo11 pp 13ndash16 IndiaSeptember 2011

[14] C Su C Wang K Cheng C Huang and C Wu ldquoDesignand test of a scalable security processorrdquo in Proceedings ofthe ASP-DAC 2005 Asia and South Pacific Design AutomationConference p 372 Shanghai China January 2005

[15] T Isobe S Tsutsumi K Seto K Aoshima and K Kariyaldquo10Gbps implementation of TLSSSL accelerator on FPGArdquo inProceedings of the 2010 IEEE 18th International Workshop onQuality of Service IWQoS 2010 IEEE China June 2010

[16] H Wang G Bai and C H Zodiac ldquoSystem architectureimplementation for a high-performance Network SecurityProcessorrdquo in Proceedings of the International Conference onApplication-Specific Systems Architectures and Processors pp91ndash96 IEEE Belgium July 2008

[17] R Jeffrey ldquoIntel advanced encryption standard instructions(aes-ni)rdquo Tech Rep Intel 2010

[18] M Khalil-Hani V P Nambiar and M N Marsono ldquoHard-ware acceleration of OpenSSL cryptographic functions forhigh-performance internet securityrdquo in Proceedings of theUKSimAMSS 1st International Conference on Intelligent Sys-tems Modelling and Simulation ISMS 2010 pp 374ndash379 UKJanuary 2010

[19] B P Kumar P Ezhumalai and P Ramesh ldquoImproving theperformance of a scalable encryption algorithm (SEA) usingFPGArdquo International Journal of Computer Science and NetworkSecurity vol 10 no 2 2010

[20] D Jacquet F Hasbani P Flatresse et al ldquoA 3 GHz dual coreprocessor ARM cortex TM -A9 in 28 nm UTBB FD-SOICMOS with ultra-wide voltage range and energy efficiencyoptimizationrdquo IEEE Journal of Solid-State Circuits vol 49 no4 pp 812ndash826 2014

[21] Z Ou B Pang Y Deng J K Nurminen A Yla-Jaaski andP Hui ldquoEnergy- and cost-efficiency analysis of ARM-basedclustersrdquo in Proceedings of the 12th IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing CCGrid2012 pp 115ndash123 Canada May 2012

[22] J L Bez E E Bernart F F dos Santos L M Schnorr and PO Navaux ldquoPerformance and energy efficiency analysis of HPCphysics simulation applications in a cluster of ARMprocessorsrdquoConcurrency and Computation Practice and Experience vol 29no 22 p e4014 2017

[23] C D Schmidt and C D CranorHalf-SyncHalf-Async 1998

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 24: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

24 Security and Communication Networks

[24] X Zhang and K K Parhi ldquoHigh-speed VLSI architectures forthe AES algorithmrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 12 no 9 pp 957ndash967 2004

[25] P Chodowiec and K Gaj ldquoVery compact FPGA implementa-tion of the AES algorithmrdquo in Proceedings of the InternationalWorkshop on Cryptographic Hardware and Embedded SystemsSpringer Heidelberg Berlin Germany 2003

[26] G Singh ldquoA study of encryption algorithms (RSA DES 3DESand AES) for information securityrdquo International Journal ofComputer Applications vol 67 no 19 pp 33ndash38 2013

[27] G Piyush and K Sandeep ldquoA comparative analysis of SHA andMD5 algorithmrdquo Architecture 1 p 5 2014

[28] J Viega P Chandra and M Messier Network Security withOpenSSL Oreilly Publications 1st edition 2002

[29] B Welton D Kimpe J Cope C M Patrick K Iskra andR Ross ldquoImproving IO forwarding throughput with datacompressionrdquo in Proceedings of the 2011 IEEE InternationalConference on Cluster Computing CLUSTER 2011 pp 438ndash445USA September 2011

[30] A Timor A Mendelson Y Birk and N Suri ldquoUsing under-utilized CPU resources to enhance its reliabilityrdquo IEEE Trans-actions on Dependable and Secure Computing vol 7 no 1 pp94ndash109 2010

[31] httpnginxorgen[32] httphttpdapacheorgdocscurrentprogramsabhtml[33] httpwwwintelcomcontentwwwusenarchitecture-and-

technologyintel-quick-assist-technology-overviewhtml[34] D Kline N Parshook X Ge et al ldquoHolistically evaluating

the environmental impacts in modern computing systemsrdquoin Proceedings of the 7th International Green and SustainableComputing Conference IGSC 2016 pp 1ndash8 China November2016

[35] K Jonathan ldquoGrowth in data center electricity use 2005 to 2010rdquoA report by Analytical Press completed at the request of The NewYork Times 9 2011

[36] A K Jones ldquoGreen computing new challenges and opportuni-tiesrdquo in Proceedings of the on Great Lakes Symposium on VLSIp 3 ACM 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom

Page 25: Hardware/Software Adaptive Cryptographic …downloads.hindawi.com/journals/scn/2018/7631342.pdfSecurityandCommunicationNetworks Forexample,workin[] implementedAESaccelerationfor embedded

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Active and Passive Electronic Components

VLSI Design

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Shock and Vibration

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

SensorsJournal of

Hindawiwwwhindawicom Volume 2018

International Journal of

RotatingMachinery

Hindawiwwwhindawicom Volume 2018

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Chemical EngineeringInternational Journal of Antennas and

Propagation

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Navigation and Observation

International Journal of

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


Recommended