+ All Categories
Home > Documents > System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging...

System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging...

Date post: 28-Jun-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
9
1. BDE Storage 1.1.System Overview The BDE storage agent is a storage-related component in BigData Express (BDE). Overall it manages both the local and shared storage systems that are used by DTNs for data transfer. The former is often dedicated to a node, while the latter is connected through network links and are often shared by a number of DTN clients as well as other non- data transfer related tasks (such as HPC jobs). In the BDE storage design, both of these two scenarios are considered. To facilitate high-performance and predictable data transfer, the BDE storage agent is designed to monitor and estimate the status of local and remote storage systems. SA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks. The overall architecture of SA is shown in Figure 1. In particular, SA has two main functions: 1) SA periodically profiles and estimates the throughput and capacity of storage devices (shown in blue dot line). We consider both of the cases where performance statistics are collected and exposed by the file system (e.g., Lustre, XFS), and the cases where performance statistics are not directly available (e.g., XT3, XT4, etc.). For the latter, the methodology we take is to periodically probe the storage performance to understand the storage load. In conjunction with a longer-term (e.g., a week long), one-time performance log captured, this hybrid approach can capture the storage workload characteristics without an intrusive impact on file systems, and 2) SA handles the commands from BDE server (dark solid line) to perform certain actions, e.g., performance estimation, as well as reports status to BDE server.
Transcript
Page 1: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

1. BDE Storage1.1.System OverviewThe BDE storage agent is a storage-related component in BigData Express (BDE). Overall it manages both the local and shared storage systems that are used by DTNs for data transfer. The former is often dedicated to a node, while the latter is connected through network links and are often shared by a number of DTN clients as well as other non- data transfer related tasks (such as HPC jobs). In the BDE storage design, both of these two scenarios are considered. To facilitate high-performance and predictable data transfer, the BDE storage agent is designed to monitor and estimate the status of local and remote storage systems. SA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

The overall architecture of SA is shown in Figure 1. In particular, SA has two main functions: 1) SA periodically profiles and estimates the throughput and capacity of storage devices (shown in blue dot line). We consider both of the cases where performance statistics are collected and exposed by the file system (e.g., Lustre, XFS), and the cases where performance statistics are not directly available (e.g., XT3, XT4, etc.). For the latter, the methodology we take is to periodically probe the storage performance to understand the storage load. In conjunction with a longer-term (e.g., a week long), one-time performance log captured, this hybrid approach can capture the storage workload characteristics without an intrusive impact on file systems, and 2) SA handles the commands from BDE server (dark solid line) to perform certain actions, e.g., performance estimation, as well as reports status to BDE server.

Figure 1: BDE storage architecture

1.2.System Design

1. SA flow chartFigure 2 illustrates the high-level flow chart of SA.

Page 2: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

1. Specify the storage configuration in the JSON file, and initiate a SA instance.

2. SA connects to the RabbitMQ and MongoDB services based on the specified service parameters.

3. SA spawns a new pthread to wait for commands from BDE server. Commands can be processed in a synchronous or asynchronous fashion.

Figure 2: SA flow chart

2. Sample JSON Configuration FileBDE storage agent utilizes a JSON configuration file to allow system administrators to set and adjust storage agent parameters, for example, the storage agent type (local or remote), and BDE server address and port, and etc. Here is a sample JSON configuration file.

{//the dir where iozone read/write test files"agent_name": "bde1",// the type of storage agent, "local" or "shared""agent_type": "local","rmq": {"rmq_host": "yosemite.fnal.gov","rmq_port": 5671,"rmq_cacert": "/home/bde/BDE/bigdata‐express‐storage/cacert.pem.yosemite"},"mongodb": {"db_host": "yosemite.fnal.gov","db_port": 27017,"db_name": "bde",

Page 3: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

"db_cacert": "","db_username": "bde","db_password": "bde"},"disk_device": "sda","iozone": {"iozone_path": "/home/bde/BDE/bigdata‐express‐storage/install/bin/iozone","iozone_test_dir": "" // the dir where iozone read/write test files}}

3. SA Registration MessageAs soon as SA is started, it will connect to the MongoDB services, and register itself.

storage: Storage node registration{id: string, # unique storage idname: string, # storage name (for display)queue: string, # rabbitmq queue nametype: string, # storage type, either be "local" or "shared"max_bw: float, # maximum IO bandwidth, in KB/stimestamp: int # timestamp when the registration happens}

4. Command MessageBelow is a list of command messages that are used to facilitate the communication between BDE server and SA. Each command is composed of both request and response field. For example, when SA receives the get_estimation_result command, SA will populate the response field using the most recent estimation and sends it back the BDE server.

1. Estimate

Request the storage agent to start a storage bandwidth estimation.request:{cmd: "estimate"params: { window: int }}response:{# storage agent reply id immediately and async execute bandwidth capacity estimation.# After estimation finish, BDE server can search estimation result in mongodb with this id.# estimate_id format : "agent_name" + "_"+"estimate"+"_"+"counter". For example

"bde1_estimate_1"estimate_id: string,code: int // 0 means success, 1 means fail}

2. get_estimate_state:

Page 4: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

request:{cmd: "get_estimate_state"params: {}}response:{code: int# code = 0 means thread ends. code = 1 means thread is alive.}

3. stop_estimate:

request:{cmd: "stop_estimate"params: {}}response:{code: int# code = 0 stop successfully, other means error}

4. get_estimation_result:

request:{cmd: "get_estimation_result"params: { }}response:{code: intid: string, # storage idtimestamp: int, # timestamp when the estimation happensmax_bw: float # KB/s...}

5. get_status:

get the immediate status reportrequest:{cmd: "get_status"params: { }}response:{code: intread_bw: float //read_bw: current read bandwidth, in KB/secwrite_bw: float // write_bw: current write bandwidth, in KB/sec

Page 5: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

}

5. Storage EstimationThe key challenge in BDE SA design is the storage estimation mechanisms. The root cause is that, unlike in SDN, there are no explicit QoS mechanism in storage systems at DOE computing and data centers. In particular, if the storage system is shared by multiple systems, e.g., DTNs and other computer clusters, there will be I/O traffic that are completely outside the control of BDE server. In this case, it is not possible to accurately report the file system load to the BDE server as SA is not aware of all the I/O traffic occurring in storage system. We aim to solve this problem by predicting storage load leveraging a combination of long-term and short-term storage performance characteristics to predict what is going to occur next on the storage systems. The intuitions are that 1) storage loads at large computing centers exhibit certain long-term characteristics that can be used to guide storage prediction. For example, if the storage system is shared to HPC systems, it will exhibit bursty traffic due to the periodic checkpoint/restart I/O traffic, and we anticipate storage load during the day is higher than that during the night due to the additional load from code compilation and interactive data analysis from users. 2) we need take advantage the short-term storage performance characteristics to figure out the transient behavior, such as a transient I/O spikes.

The long-term and short-term storage characteristics can be captured by two ways:· They can be directly obtained from the file system statistics. For example, in

Lustre, the performance data is offered in /proc/fs/lustre/ directory on a per storage target basis, as well as per partition basis. Below is a sample of I/O statistics at a given timestamp.

snapshot_time 1489416331.300873 secs.usecsreq_waittime 385839 samples [usec] 54 35074474 4824456722 29488479371845124req_active 385839 samples [reqs] 1 125 458249 2689305read_bytes 60018 samples [bytes] 0 1048576 41776455844 43393523956120020write_bytes 76050 samples [bytes] 4 1048576 78198908904 81824500299661100ost_setattr 83595 samples [usec] 72 120221 26280402 123084230356ost_read 60018 samples [usec] 116 826529 601715500 30569666128858ost_write 76050 samples [usec] 359 1301603 481516308 33144037859630ost_connect 1 samples [usec] 217793 217793 217793 47433790849ost_punch 567 samples [usec] 111 10686 203257 590632809ost_statfs 51307 samples [usec] 3296 591975 897293373 43944791865651ost_sync 24 samples [usec] 145 1237 9015 5094331ldlm_cancel 35645 samples [usec] 61 35074474 2784595864 29379148691771844obd_ping 420 samples [usec] 95 108385 2076795 59817972281

We can use the read_bytes and write_bytes counters to quantitatively capture the file system load at a given time.

· Storage statistics are not available from the file system. In this case, BDE SA will use small messages to actively probe the file system to obtain the throughput. A key difficulty is that the probes cannot be too intrusive otherwise they will impact data transfer jobs. To this end, we plan to design adaptive probes so that when the file system load increases, we can adjust either the probe size and the probe frequency to reduce the impact on the system load.

3. ImplementationLSA is implemented in C++ using a number of third party libraries. It uses Jsoncpp external library for parsing JSON configure file and JSON messages from BDE server. It uses the capabilities

Page 6: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

provided by MongoDB and RabbitMQ to interact with BDE server. In addition, multi-threading is used to handle messages and commands for higher concurrency.Local Storage Agent API1. do_init

Parse configurations that are passed in using JSON format, and conduct initial iozone tests to determine the maximum potential I/O bandwidth of given storage devices.

2. do_registration

Register configuration information and the initial testing results to MongoDB.

3. do_command

Receive JSON command message from RabbitMQ server, parse, and launch tests accordingly.

4. get_usage

Use df external commands to get the usage status of given storage devices.

5. get_realtime_bandwidth

Use iostat external commands to get the usage status of given storage devices.

6. get_potential_bandwidth

Use iozone toolkit to analyze the potential maximum I/O bandwidth of given storage devices.

4. Install Storage Agent

1. Install and configure Kerberos on BDE nodes

2. Download the source code from BDE repository via git

· git clone ssh://[email protected]/cvs/projects/bigdata-express-storage

3. Install third party modules:

· Run the bootstrap_libraries.sh script in bigdata-express-storage directory to download all third party libraries, such as MongoDB and RabbitMQ.

· Install iozone

wget http://www.iozone.org/src/current/iozone-3-465.i386.rpmrpm -ivh /path/to/iozone-3-338.i386.rpm

4. Build storage agent

· do “cmake -f CMakeList.txt” to generate makefile

· do “make”, and the executable local_storage_agent_d will be generated under bigdata-express-storage/src

Page 7: System Overview - Fermilab › documents › BDE-SA-d… · Web viewSA maintains a messaging channel to the BDE server to facilitate scheduling the end-to-end data transfer over networks.

5. Update JSON configure file

· Specify the pathname of iozone and storage agent in local_agent_config.json.

Details of JSON storage configuration can be found in https://cdcvs.fnal.gov/redmine/projects/bigdata-express/wiki/

Local_Storage_Agent_Configure_File


Recommended