+ All Categories
Home > Documents > [IEEE 2012 IEEE 19th International Conference on Web Services (ICWS) - Honolulu, HI, USA...

[IEEE 2012 IEEE 19th International Conference on Web Services (ICWS) - Honolulu, HI, USA...

Date post: 09-Dec-2016
Category:
Upload: elroy
View: 216 times
Download: 2 times
Share this document with a friend
3
Overcoming Large Data Transfer Bottlenecks in RESTful Service Orchestrations Ryan K L Ko, Markus Kirchberg, Bu Sung Lee Service Platform Lab HP Labs Singapore {ryan.ko, markus.kirchberg, francis.lee}@hp.com Elroy Chew School of Computing National University of Singapore [email protected] AbstractAs REST (Representational State Transfer)-ful services are closely coupled to the HTTP (Hypertext Transfer Protocol), which eventually sits above the connection-based TCP (Transmission Control Protocol), it is common for RESTful services to experience latency and transfer inefficiencies especially in situations requiring the services to transfer large-scale data (i.e. above gigabytes of data) in RESTful workflows. Such inefficiencies are undesirable and impractical, and are compounded for RESTful service orchestrations in data-intensive industries such as Big Data analytics, cloud computing and life sciences. In this paper, we propose a non-invasive novel technique, Fast-Optimised-REST (FOREST), which enables RESTful services to overcome the traditional bottlenecks experienced during transfer of large sets of data. The initial experimental results show promise and demonstrated very significant reductions of up to 80% from original REST-ful data transfer times for extremely large data sets. Keywords- FOREST, REST, Big Data, Large Data Transfers, Service Orchestrations, Workflows, Transfer Bottlenecks, RESTful Services, UDT, TCP I. INTRODUCTION The increased adoption of REST (Representative State Transfer) for service orchestrations is mainly due to the simplicity and elegance of communicating data amongst RESTful services via HTTP (Hypertext Transfer Protocol) commands (e.g. GET, POST, and PUT), and the quick time- to-market resulting from the simple orchestration requirements. While REST works well for bite-sized data transfers (e.g. XML files in kilobytes), it experiences undesirable transfer bottlenecks for large data sets. Some commonly seen examples are RESTful bio-medical workflows crunching and mining data for insights on drug effectiveness using RapidMiner, or RESTful scientific workflows composed on jOpera running in grids. These workflows are sensitive to the order of the steps and also the control flows of the underlying services. Very often, large data analyses are bottlenecked by the slowness in data transfer, leading to inefficiency and user frustrations. When we observe the nature of REST, we start to identify REST’s dependency on HTTP, which is closely coupled to the Transmission Control Protocol (TCP). TCP, being a connection-based protocol, has to ensure almost zero loss and high integrity of data. However, for some situations, conducting checks for each transferred packet may be too “heavy-weight” for RESTful services. There has to be a way to maintain the integrity, yet reducing the transfer bottleneck times currently experienced. In this paper, we propose a novel non-invasive, legacy preserving solution, FOREST (Fast Optimal REST), for fast transfers of large data-sets over REST services with reliable receipt and control of transfer rates and security by encapsulating the original RESTful TCP data in UDT (UDP- based Data Transfer) [1-3]. FOREST’s results show promising improvements by reducing the transfer time up to 80% from traditional RESTful service orchestrations. II. RELATED WORK A. Industry UDP-related Solutions At the time of writing, several UDT and UDP-based solutions exist in the market. However, these solutions do not directly address the fundamental large data transfer problems specifically for existing RESTful services. Examples are Globus Online and Aspera. B. Our Focus vs. Current Industry Focus To our best knowledge, there is still an absence of practical solutions for services experiencing large data transfer bottlenecks. With the increased adoption and reliance of service-oriented workflows for data analytics and cloud computing services, it will only be a matter of time before implementers of Web service interfaces require more efficient methods for data transfers so that the workflows can complete execution within a reasonable and acceptable time, while preserving existing architectures for the next technology. III. FAST OPTIMAL REST (FOREST) A. Key Idea - Hybridizing 2 Technologies Our solution hybridizes this fast speed transfer properties of UDT (UDP-based Data Transport) with the encapsulation features of iproxy. Figure 1: High-level overview of our non-invasive FOREST solution As shown in Figure 1, as compared to traditional approaches, our proposed Fast Optimal REST (FOREST) encapsulates the sender REST services’ TCP connections into the payload of UDT protocol and then sends it efficiently over to the receiver, which will then extract the contents of the UDT payload as the data is received just as it was typically with TCP. The first part of our hybridized solution is iproxy [4], which is a method of encapsulating the sender’s TCP connections into the payload of UDP datagrams and send it ReST Receiver Service receives HTTP Put data 'thinking' it received via TCP. ReST Receiver Service receives HT T TTP Put data 'thinking' it received via Receiver extracts UDT payload that contains ReST TCP data via iproxy Receiver extracts UDT payload that contains ReST TCP data via iproxy Data sent efficiently by UDT Data sent efficiently by UDT Sender Side: TCP data encapsulated as UDT Payload via iproxy Sender Side: TCP data encapsulated as UDT Payload via iproxy Sender Side: Data meant to be sent by TCP Sender Side: Data meant to be sent by TCP Sender ReST Service sends data (e.g. HTTP Put) 2012 IEEE 19th International Conference on Web Services 978-0-7695-4752-7/12 $26.00 © 2012 IEEE DOI 10.1109/ICWS.2012.107 654 2012 IEEE 19th International Conference on Web Services 978-0-7695-4752-7/12 $26.00 © 2012 IEEE DOI 10.1109/ICWS.2012.107 654
Transcript
Page 1: [IEEE 2012 IEEE 19th International Conference on Web Services (ICWS) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE 19th International Conference on Web Services - Overcoming

Overcoming Large Data Transfer Bottlenecks in RESTful Service Orchestrations Ryan K L Ko, Markus Kirchberg, Bu Sung Lee

Service Platform Lab HP Labs Singapore

{ryan.ko, markus.kirchberg, francis.lee}@hp.com

Elroy Chew School of Computing

National University of Singapore [email protected]

Abstract— As REST (Representational State Transfer)-ful services are closely coupled to the HTTP (Hypertext Transfer Protocol), which eventually sits above the connection-based TCP (Transmission Control Protocol), it is common for RESTful services to experience latency and transfer inefficiencies especially in situations requiring the services to transfer large-scale data (i.e. above gigabytes of data) in RESTful workflows. Such inefficiencies are undesirable and impractical, and are compounded for RESTful service orchestrations in data-intensive industries such as Big Data analytics, cloud computing and life sciences. In this paper, we propose a non-invasive novel technique, Fast-Optimised-REST (FOREST), which enables RESTful services to overcome the traditional bottlenecks experienced during transfer of large sets of data. The initial experimental results show promise and demonstrated very significant reductions of up to 80% from original REST-ful data transfer times for extremely large data sets.

Keywords- FOREST, REST, Big Data, Large Data Transfers, Service Orchestrations, Workflows, Transfer Bottlenecks, RESTful Services, UDT, TCP

I. INTRODUCTION The increased adoption of REST (Representative State

Transfer) for service orchestrations is mainly due to the simplicity and elegance of communicating data amongst RESTful services via HTTP (Hypertext Transfer Protocol) commands (e.g. GET, POST, and PUT), and the quick time-to-market resulting from the simple orchestration requirements. While REST works well for bite-sized data transfers (e.g. XML files in kilobytes), it experiences undesirable transfer bottlenecks for large data sets. Some commonly seen examples are RESTful bio-medical workflows crunching and mining data for insights on drug effectiveness using RapidMiner, or RESTful scientific workflows composed on jOpera running in grids. These workflows are sensitive to the order of the steps and also the control flows of the underlying services. Very often, large data analyses are bottlenecked by the slowness in data transfer, leading to inefficiency and user frustrations.

When we observe the nature of REST, we start to identify REST’s dependency on HTTP, which is closely coupled to the Transmission Control Protocol (TCP). TCP, being a connection-based protocol, has to ensure almost zero loss and high integrity of data. However, for some situations, conducting checks for each transferred packet may be too “heavy-weight” for RESTful services. There has to be a way to maintain the integrity, yet reducing the transfer bottleneck times currently experienced.

In this paper, we propose a novel non-invasive, legacy preserving solution, FOREST (Fast Optimal REST), for fast transfers of large data-sets over REST services with reliable receipt and control of transfer rates and security by

encapsulating the original RESTful TCP data in UDT (UDP-based Data Transfer) [1-3]. FOREST’s results show promising improvements by reducing the transfer time up to 80% from traditional RESTful service orchestrations.

II. RELATED WORK A. Industry UDP-related Solutions

At the time of writing, several UDT and UDP-based solutions exist in the market. However, these solutions do not directly address the fundamental large data transfer problems specifically for existing RESTful services. Examples are Globus Online and Aspera. B. Our Focus vs. Current Industry Focus

To our best knowledge, there is still an absence of practical solutions for services experiencing large data transfer bottlenecks. With the increased adoption and reliance of service-oriented workflows for data analytics and cloud computing services, it will only be a matter of time before implementers of Web service interfaces require more efficient methods for data transfers so that the workflows can complete execution within a reasonable and acceptable time, while preserving existing architectures for the next technology.

III. FAST OPTIMAL REST (FOREST) A. Key Idea - Hybridizing 2 Technologies

Our solution hybridizes this fast speed transfer properties of UDT (UDP-based Data Transport) with the encapsulation features of iproxy.

Figure 1: High-level overview of our non-invasive FOREST solution

As shown in Figure 1, as compared to traditional approaches, our proposed Fast Optimal REST (FOREST) encapsulates the sender REST services’ TCP connections into the payload of UDT protocol and then sends it efficiently over to the receiver, which will then extract the contents of the UDT payload as the data is received just as it was typically with TCP.

The first part of our hybridized solution is iproxy [4], which is a method of encapsulating the sender’s TCP connections into the payload of UDP datagrams and send it

ReST Receiver Service receives HTTP Put data 'thinking' it received via TCP.

ReST Receiver Service receives HTTTTP Put data 'thinking' it received via

Receiver extracts UDT payload that contains ReST TCP data via iproxy Receiver extracts UDT payload that contains ReST TCP data via iproxy

Data sent efficiently by UDT Data sent efficiently by UDT

Sender Side: TCP data encapsulated as UDT Payload via iproxy Sender Side: TCP data encapsulated as UDT Payload via iproxy

Sender Side: Data meant to be sent by TCP Sender Side: Data meant to be sent by TCP

Sender ReST Service sends data (e.g. HTTP Put)

2012 IEEE 19th International Conference on Web Services

978-0-7695-4752-7/12 $26.00 © 2012 IEEE

DOI 10.1109/ICWS.2012.107

654

2012 IEEE 19th International Conference on Web Services

978-0-7695-4752-7/12 $26.00 © 2012 IEEE

DOI 10.1109/ICWS.2012.107

654

Page 2: [IEEE 2012 IEEE 19th International Conference on Web Services (ICWS) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE 19th International Conference on Web Services - Overcoming

to the receiver’s end. iproxy implements a connection-based protocol over UDP, identifying UDP packets by their source port, source IP address and Client ID. It comprises of a client-side proxy and a server-side proxy that allows arbitrary TCP/IP services to run over broadcast, multicast or unicast UDP. It was originally conceived as a method using a web-based interface to configure servers that had not been given an IP address on the LAN [4]. iproxy was designed to enable networked devices to be configured without prior knowledge of their network setup [4]. As such, this property is desirable for Web services, since it aids the initial setup and deployment onto a new network.

The second part of our hybridized solution is the usage of UDT as the backbone of large data transfers over a UDP channel [1-3]. UDT’s algorithm is known for efficiency as it follows a constant time to increase to 90% of the link capacity. UDT is also fair if all the flows have the same end-to-end link capacity with lower sending flow rate having at least the same increase parameter with higher sending flow rate. UDT is friendly to TCP at low bandwidth-delay product (BDP) networks as UDT increases its speed slowly in low BDP environments. UDT is connection-oriented in order to handle flow and congestion control.

As standalones, both iproxy and UDT are unable to handle large data transfers for RESTful services. Given the nature of the RESTful transfers, we propose a means of high-speed large data transfer for REST services using hybridization of encapsulating TCP connections into UDT datagrams. B. How FOREST Works

Initial experiments have shown promise as an alternative data transfer protocol where TCP does not function well such as overcoming TCP’s inefficiency in high BDP networks. We have modified UDT to accept incoming TCP connections to be encapsulated in the UDT payload and transferred over UDP protocol to the receiver’s end whenever a RESTful service orchestration takes place.

For each RESTful data transfer, the sender REST service: • Listens for any outgoing TCP connections with (large) data

from REST services; and • Encapsulates the TCP connection into the payload of the UDT

datagram and then sends it over a UDP channel to the port of the receiver. At the same time, the receiver REST service:

• Listens for any incoming UDT datagrams at the assigned listening port; and

• Forwards any incoming datagram to the relevant REST service after extracting the encapsulated TCP data from the payload of the UDT datagram.

UDT datagrams have two kinds of packets, data and control packets [2,3]. The first bit of the packet header differentiates them - 0 as data packet and 1 as control packet. C. FOREST RESTful Connection Setup

The underlying protocol of our approach uses the original UDT protocol to communicate from the sender to the receiver, making use of Protocol Connection Handshaking packets (i.e. Control Packet TYPE 000 [3]) for setting up of the connections between the sender and receiver. It is also important to note that UDT only requires a simple two-way

handshaking whereas TCP needs a three-way handshake in order to establish a connection. D. TCP and UDT Timers

Essentially our technique is a proxy mechanism, and the proxy process must be able to handle both the TCP timers at the incoming and the UDP timers at the outgoing interfaces without interfering each other’s processes during data transmission. Some of the methods we employed in our prototype to separate the entities include:

1. A client id for mapping: With the client id, the proxy is able to identify which

incoming connections are received at port 80; packets can be listened and detected, and then processed accordingly.

2. A multimap of TCP timers: multimap <int EST, int RETRANS, int DELAY, int PERSIST, int KEEP_ALIVE, int FIN_WAIT,

int 2MSL> TCP This multi-map of TCP timers (refer to [5]) is in place for

keeping track of incoming TCP connections so that the two separate entities (i.e. TCP and UDT) in the proxy have a ‘membrane’ in between; just to pass the TCP data for encapsulation in the payload of UDT datagrams in the iproxy component of our implementation. Moreover the TCP timers are set to time-out much later than UDT timers; thus, the TCP timers will never get mixed with UDT ones. However, problems might arise for example, if the UDT datagrams get lost too often and the timers time-out too often. In such a situation, accumulated EXP timers (each 0.5secs) might affect TCP timers; thus, having a data structure to contain the timers is essential in safekeeping them. E. Encapsulation and Decapsulation

As mentioned above, the optimal method for sending large data of REST services found so far is by encapsulating the TCP connection into the payload of the UDT datagram and sending it over the high speed UDP channel to the receiver. This allows the TCP data to be preserved for the receiver to retrieve the embedded information in the UDT payload and then processed by the relevant TCP service. Moreover, if only the UDT header is corrupted, the payload can still be processed if the receiver predicts the correct sequence number, then the UDT datagram would not be considered as a lost packet, avoiding a retransmission to occur.

Figure 2: Screen Capture of Console Output for Transfer of 103 MB of

Data between 2 RESTful Services on FOREST

F. Implementation of FOREST Prototype The prototype incorporating the previously discussed

algorithms and protocols, and demonstrating efficient large data transfer over REST Services was built with Python calling both iproxy and UDT APIs (see Figure 2). The Python language's dynamic typing and interpreted nature makes it an ideal language to prototype and to implement higher-level systems; moreover, the ability to extend Python with C++ code makes it practical to achieve high

655655

Page 3: [IEEE 2012 IEEE 19th International Conference on Web Services (ICWS) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE 19th International Conference on Web Services - Overcoming

performance. It allows simple relink of newer versions of the UDT libraries without having to recode the interface. Python’s high level web framework Django is being used to model the RESTful web service for large data transfer. HTTP methods such as GET, PUT POST and DELETE are being used in this entire framework. A user-friendly web interface was built to allow easy uploading and downloading of large files across multiple terminals. Users can simply login into the portal and submit their files while others can view those files and request to download them via a secure channel using UDT as the backend transport network.

IV. PRELIMINARY RESULTS We conducted experiments on a simple sequential

orchestration of two RESTful Web Services. To simulate over different network environments, we had two different setups for the RESTful Web Services (See Table 1): 1. Direct connection between two servers each hosting a

RESTful service using 10/100 Ethernet LAN cable, 2. Another setup using wireless connections between two

servers each hosting a RESTful service via a wireless router. Table 1: Comparison of timings from experiment of data transfers

between services via FOREST and traditional REST

It was found in the 10/100 Ethernet cable use case, that TCP had 20 to 30Mbps while the FOREST approach had a whopping 100 to 105Mbps stable connection. A slow-start was experienced in the beginning but this was expected. In the Wireless use case, the UDT-based FOREST approach averaged about 4-5Mbps. However, TCP had 1-2Mbps of transfer speed. As shown in Table 1 above, it was also found that FOREST reduced original REST transfer times by up to 80.5% of the original RESTful transfers over the Ethernet cable use case. Also shown in Table 1 are cases where FOREST enabled the large data transfers to enjoy more efficient time savings (i.e. up to 67.4% reduced from the original RESTful transfers) over the original RESTful orchestrations in wireless connections.

Therefore, both sets of data prove that FOREST is indeed a more efficient technique that can enable significant

timesavings and efficiency for large data transfers in RESTful orchestrations. This set of promising results also paves the way for the following future work and necessary validations.

V. FUTURE WORK Given the success of the initial experiments, we must

now ensure that the benefits in efficiency and transfer rates can be similarly felt and replicated in real production environments and large-scale data-intensive workflows. We will embark on (1) benchmarking and testing for scale, (2) testing with extremely large data sets (Terabyte and Petabyte range), and (3) address security aspects in FOREST.

VI. CONCLUDING REMARKS In this paper, we presented the approach and promising

initial results of Fast Optimal REST (FOREST), a technique which will reduce the transfer times of (large) data between RESTful services. The approach addresses the main bottleneck of REST data transfers – REST’s tight coupling and dependency on HTTP and eventually TCP-based transfers.

Our non-invasive approach is a hybridization of iproxy (for encapsulation and decapsulation) and UDT (for fast transfer), a UDP-based data transfer which efficiently upholds the integrity experienced in connection-based approaches. Potential timer conflicts between different protocols were also managed and found to be a non-issue. As standalones, both iproxy and UDT are unable to handle large data transfers for RESTful services.

Results from our initial experiments show promise – by significantly reducing 66 to 80% of the original RESTful TCP transfer times for large data sets. This is ongoing research and we believe that this is the first step to solving the large data transfer bottleneck problem faced in RESTful services, which is an impending problem given the current rise in Big Data analytics and cloud computing.

REFERENCES [1] Grossman, R., Gu, Y., Hong, X., Antonyc, A., Blomc, J., Dijkstrac,

F., de Laatc, C. 2005. Teraflows over Gigabit WANs with UDT, Journal of Future Generation Computer Systems - Special issue: High-speed networks and services for data-intensive grids: The DataTAG project, 21, 4 (April 2005), Elsevier Science Publishers B. V. Amsterdam, The Netherlands. DOI= http://dx.doi.org/10.1016/j.future.2004.10.007

[2] Gu, Y., Grossman, R. 2004. UDT: A Transport Protocol for Data Intensive Applications, Internet Engineering Task Force (IETF), Internet Draft, Available at: http://tools.ietf.org/html/draft-gg-udt-00

[3] Gu, Y., Hong, X., Grossman, R. 2004. Experiences in Design and Implementation of a High Performance Transport Protocol, Supercomputing 2004, Proceedings of the ACM/IEEE SC2004 Conference. 06-12 Nov. 2004. DOI = http://dx.doi.org/10.1109/SC.2004.24

[4] Horms, S. 2002. iproxy: Running TCP services over UDP, 2002. linux.conf.au, 6–9 February 2002, University of Queensland, Brisbane, Queensland, Australia. Available at: http://horms.net/projects/iproxy/iproxy_paper/stuff/iproxy_paper.pdf

[5] Chen, D. 2000. Overview of TCP structure and operations, Linux Based Environment for Experimental Evaluation of Standard TCP Algorithms. Available at: https://dbsgrad.sfsu.edu/~mmurphy/dachen/documents/Overview.htm

Connection Protocol Size of file Time Taken % reduced from REST

10/100 Ethernet REST 103.8 MB (Small) 41s

0

10/100 Ethernet REST 1.01 GB (Medium) 6m 8s

0

10/100 Ethernet REST 3.03 GB (Large) 18m 23s

0

10/100 Ethernet FOREST 103.8 MB (Small) 8s

80.5%

10/100 Ethernet FOREST 1.01 GB (Medium) 1m 21s

78.0%

10/100 Ethernet FOREST 3.03 GB (Large) 3m 53s

78.9%

Wireless LAN REST 103.8 MB (Small) 9m 23s

100

Wireless LAN REST 1.01 GB (Medium) 2h2m51s

100

Wireless LAN REST 3.03 GB (Large) 6h7m7s

100

Wireless LAN FOREST 103.8 MB (Small) 3m 07s

66.8%

Wireless LAN FOREST 1.01 GB (Medium) 42m 11s

65.7%

Wireless LAN FOREST 3.03 GB (Large) 1h58m96s

67.4%

656656


Recommended