PARALLEL DATA COMPRESSION IN CLINICAL DATA WITH … · PARALLEL DATA COMPRESSION IN CLINICAL DATA...

transcript

DOI: 10.23883/IJRTER.2017.3180.LWA51 519

PARALLEL DATA COMPRESSION IN CLINICAL DATA WITH

CHUNKING ALGORITHM ON A CLOUD

Ms. R. Hemavathy1, Mrs. G. Sangeethalakshmi

2, Ms. A. Anitha

1Research Scholar, Dept of Computer Science and Applications, D.K.M College for Women

(Autonomous), Vellore, Tamilnadu, India 2Assitant Professor, Dept of Computer Science and Applications, D.K.M College for Women

(Autonomous), Vellore, Tamilnadu, India 3Research Scholar, Dept of Computer science and Applications, D.K.M College for

Women(Autonomous), Vellore, Tamilnadu, India

Abstract - The emergence of massive datasets in a clinical data presents both challenges and

opportunities in data storage and analysis. Advances in information and communication technology

present the most viable solutions to big data Storage and analysis in terms of efficiency and

scalability. As data progressively grows within data centers, the cloud storage systems continuously

face challenges in saving storage capacity and providing capabilities necessary to move big data

within an acceptable time frame. The storage pressure on cloud storage system caused by the

explosive growth of data is growing by the day, especially a vast amount of redundant data waste a

lot of storage space. Data deduplication can effectively reduce the size of data by eliminating

redundant data in storage systems. In cloud data storage, the deduplication technology plays a major

role. Current big sensing Clinical data processing on Cloud have adopted some data compression

techniques. However, due to the high volume and velocity of big sensing data, traditional data

compression techniques lack sufficient efficiency and scalability for data processing. Instead of

compressing basic data units, the compression will be conducted over partitioned data chunks. In the

deduplication technology, data are broken down into multiple pieces called “chunks”. The chunking

algorithm uses a new parallel processing framework The Two Thresholds Two Divisors (TTTD)

algorithm is used for chunking mechanism and is used for controlling the variations of the chunk-

size. It is vital those big data solutions are multithreaded and that data access approaches be precisely

tailored to large volumes of semi-structured/unstructured data. To restore original data sets, some

restoration functions and predictions will be designed. Successful restoration of Electronic Health

Record helps improve patient safety and quality of care.

Key Words: Deduplication, Two Thresholds Two Divisors(TTTD)

I. INTRODUCTION

The Big sensing statistics from fully extraordinary types of sensing systems e.g. (Healthcare, video,

satellite, meteorology, earthquake pursuit, guests pursuit, advanced physics simulations, genomics,

biological stud, and so on,..) are high heterogeneous, and it's traditional traits of commonplace real

international info. They are four ‘V’s, Volume, Variety, Velocity, Veracity. To beat the system

issues ensuing from four ‘V’s, of big sensing info, the event in expand huge process on Cloud is

exploit properly-favored a day. Cloud computing offers a promising platform for enormous process

with its effective computation capability, garage, measurability, resource build use of and low fee,

and has attracted crucial attention in alignment with big records.

To reduce the time and house price for large data, particularly huge sensing processing on Cloud,

totally different techniques are proposed and developed. However because of the dimensions and

speed of huge sensing data in real world, the present data compression and reduction techniques still

got to be improved. It’s been well recognized that big sensing data or big data sets from mesh

International Journal of Recent Trends in Engineering & Research (IJRTER)

Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]

networks adore sensing element systems and social networks will take the form of big graph data. To

method those big graph data, current techniques normally introduce complex and multiple iterations.

Healthcare is one of the foremost necessary areas for developing and developed countries to facilitate

the valuable human resource. nowadays health care business is flooded with huge quantity of

information that require validation and correct analysis. Despite the fact that big data Analytics will

contribute a serious role in process and analyzing the health care information in form of forms to

deliver appropriate applications, india takes the second place in the world in its population.

Increasing population in india over-burdens the health care structure within the country. The

exponential growth of data over the last decade has introduced a brand new domain in the field of

data technology referred to as big data. Here we tend to propose map cut back big medical huge data

of india. To any improve reduce size reduction, cut back the processing time value and unleash the

iterations in process big sensing data. Instead of pressing basic data units, the compression are

conducted over divided data chunks. in the deduplication technology, data are broken down into

multiple items referred to as “chunks”.

The chunking algorithm uses a new parallel processing framework The Two Thresholds Two

Divisors (TTTD) algorithm is used for chunking mechanism and is used for controlling the variations

of the chunk-size. It is vital those big data solutions are multithreaded and that data access

approaches be precisely tailored to large volumes of semi-structured/unstructured data, especially

streaming big sensing data on Cloud. With this novel technique, big sensing data stream will be

filtered to form standard data chunks at first based on our predefined similarity model. Then, the

coming sensing data stream will be compressed according to the generated standard data chunks.

With the above data compression, we tend to aim to improve the data compression efficiency by

avoiding traditional compression based on each data unit, which is space and time costly due to low

level data traverse and manipulation. At the equivalent time, because the compression happens at a

higher data chunk level, it reduces the chance for introducing too much usage of iteration and

recursion which prove to be main trouble in processing big graph data.

II. RELATED WORK

2.1 Healthcare In India

In spite of the government has promised to introduce digitization for maintaining medical records,

the fact isn’t of course. The country does not even have standardization in common medical

terminologies. In big data processing the data must be process in a distributed environment. The

requirement for analyzing data such as medical information requires statistical and mining approach

for analyzing the data. Delivering the data during a faster response time will be at higher priority.

2.2 Datacompression In Cloud Computing

The cloud environment based on the captured necessities and to present its implementation on

Amazon Web Services. They also presents an experimentation of running the Map Reduce system in

a cloud environment to validate the projected framework and to present the evaluation of the

experiment based on the criteria such as speed of processing, data-storage usage, latent period and

cost efficiency .

2.3 Analysis Of Bidgata Compression

Paper Feature Advantages Disadvantages

Big Data Processing

in Cloud

Environments

Big data processing

techniques from

system and

application aspects

MapReduce

optimization

strategies and

applications

Grid and cloud computing have

all intended to access large

amounts of computing power by

aggregating resources.

A survey of large

scale data

management

approaches in cloud

environments

Mechanisms of

deploying data-

intensive

applications in the

Economical

processing of large

scale data on the

cloud is provided.

The latency gap between multi-

core CPUs and mechanical hard

disks is growing every year

which makes the challenges of

data-intensive computing harder

to overcome.

A information

platform for scalable

one-pass analytics

using map reduce

MapReduce model

to incremental one-

pass analytics

Magnitude

reduction of

internal data spills

Long latencies and making them

unsuitable for producing

incremental results.

Stream as you go:

The case for

incremental data

access and processing

in the cloud

Processing data

based on a stream

data management

architecture

MapReduce-based

DNA sequence

analysis application

long-term storage of complete

data sets is an explicit

requirement

Very fast estimation

for result and

accuracy of big data

analytics: The EARL

system

Approximate

results based on

samples

Response times

obtained in the

actual

computations

The error estimation is required

to determine if node recovery is

necessary

2.4 Existing System

• In existing, to process those big graph data, current techniques normally introduce complex

and multiple iterations.

• Iterations and recursive algorithms may cause computation problems such as parallel memory

bottlenecks, deadlocks on data accessing, algorithm inefficiency.

• Even with Cloud platform, the task of big data processing may introduce unacceptable time

cost, or even lead to processing failures.

2.5 Drawback

• In which mainly concentrate on the static scenes such as the backup and archive systems, are

not suitable for cloud storage system due to the dynamic nature of data.

• Current techniques normally introduce complex and multiple iterations. Iterations and

recursive algorithms may cause computation problems such as parallel memory bottlenecks,

deadlocks on data accessing, algorithm inefficiency.

• It may introduce unacceptable time cost, or even lead to processing failures.

2.6 Proposed Algorithm

2.6.1Two Thresholds Two Divisors (TTTD) chunking

• Our proposed system will be working on processing big data on cloud by using

compression technique working Huffman algorithm, by overcoming some drawbacks of Map

Reduce utilized in existing system.

• The data which is to be stored on cloud after compression will be available in its original size

on native server.

• This data on local server on time of processing will be divided into chunks so that the data

will be processed parallel and quicker.

2.6.2 Advantages Of Proposed System

• The compression happens at a higher data chunk level.

• It reduces the chance for introducing an excessive amount of usage of iteration and recursion

which prove to be main trouble in processing big graph data.

III. SYSTEM DESIGN

Architecture Diagram

Figure1. Architecture

IV. MODULES

• Similarity Model

• Data Chunk Generation and Formation

• Data Compression

• Distributed Storage Network

• Data driven scheduling on Cloud

4.1 Similarity Model

• Currently, there are five types of models are commonly used including common element

approach, template models, geometric models, feature models and Geon theory.

• However, the subsequent proposed models are related to geometric model and common

element approach in terms of numerical data and text data respectively.

• Our similarity models work on two types of data sets, multidimensional numerical data and

text data.

4.2 Data Chunk Generation and Formation

• In the problem analysis, we have introduced the essential idea of data chunk based

compression.

• Under that theme, the data will not be compressed by encoding or data prediction one by one.

It is similar to high frequent component compression.

• The difference is that the frequent component compression acknowledges only simple data

units; whereas our data chunk based compression recognizes complex data partitions and

patterns during the compression process.

4.3 Data Compression

• MapReduce is a framework for processing parallelizable and scalable issues across huge

datasets using a large number of computers (nodes), collectively referred to as a cluster or a

• Computational processing can occur on data stored either in a file system (unstructured) or in

a database (structured).

• MapReduce can take advantage of locality of data, processing data on or near the storage

assets to reduce data transmission.

• Map" function: The master node takes the input, divides it into smaller sub-problems, and

distributes them to worker nodes. A worker node may do this again in turn, leading to a

multi-level tree structure.

• "Reduce" function: The master node then collects the answers to all the sub-problems and

combines them in some way to form the output – the answer to the problem it was originally

trying to solve.

4.4 Distributed storage network

• Distributed Networking is a distributed computing network system, said to be "distributed"

when the computer programming and the data to be worked on are spread out over more than

one computer.

• Prior to the emergence of low-cost desktop computer power, computing was generally

centralized to one computer. Although such canters still exist, distribution networking

applications and data operate more efficiently over a mix of desktop workstations, local area

network servers, regional servers, Web servers, and other servers.

4.5 Data driven scheduling on Cloud

• Map reduce and TTTD the approaches for reducing the big data size with data suppression.

• However, a smaller size of data doesn’t undoubtedly mean a shorter processing time which is

quite related to the task division and workload distribution over Cloud.

• In order to supply a shorter and quicker processing time, we will introduce a novel scheduling

algorithm based on the compressed data sets.

• Our data driven scheduling and its mapping, two types of mapping strategies, node based

mapping and edge based mapping will be introduced first for comparison.

4.6 Mapping for Cloud scheduling with real network nodes

• The most direct and easy way to distribute and schedule the big data processing task over

Cloud is based on the real work topology of the network itself.

• Under the theme of this mapping, the mapping algorithm is pretty easy and therefore the

computation resources are divided and distributed to each node for simulating and analyzing

data flows in a real world network.

4.7 Mapping for Cloud scheduling with data exchange edges

• Instead of directly allocating the computation resources over Cloud according to the real

world network topology, the mapping can be also carried out based on the data exchanging

edge between nodes.

• Each edge which has data flows over it in an exceedingly network are going to be simulated

and analyzed with a computation unit from Cloud.

V. DFT

VI. IMPLEMENTATION

6.1 Algorithm

Two Thresholds Two Divisors (TTTD) chunking

• Our proposed system will be working on processing big data on cloud by using

compression technique working Huffman algorithm, by overcoming some drawbacks of

Map Reduce used in existing system.

• The data which is to be stored on cloud after compression will be available in its original

size on local server.

• This data on local server on time of processing will be divided into chunks so that the

data will be processed parallel and faster.

• To divide big data into chunks we will be using TTTD algorithm. The compression will

be applied on every single chunk on cloud server and we will be showing that till which

level the big data is being compressed and what the compression level bar will show that

till what extent the file size has been compressed while storing on cloud.

• The TTTD algorithm was proposed by HP laboratory at Palo Alto, California. This

algorithm use same idea as the BSW algorithm does.

• In addition, the TTTD algorithm uses four parameters, the maximum threshold, the

minimum threshold, the main divisor, and the second divisor, to avoid the problems of the

BSW algorithm.

• The maximum and minimum thresholds are used to eliminate very large-sized and very

small-sized chunks in order to control the variations of chunk-size.

Cloud Server

View Patient

Basic Details

Data Owner

Details

View End User

Details

Data Owner

Authority Check

Give Permission

to User

View Encrypted

Upload Patient

Details

View Patient

Details

View Patient

Details

Give Request to

Authority

View Symptoms

View Patient

Details

Download

Patient File

• The main divisor plays the same role as the BSW algorithm and can be used to make the

chunk-size close to our expected chunk-size.

• In usual, the value of the second divisor is half of the main divisor. Due to its higher

probability, second divisor assists algorithm to determine a backup breakpoint for chunks

in case the algorithm cannot find any breakpoint by main divisor

Figure2. Chunking data

VII. RESULTS

SCREEN SHOTS

HOME PAGE

OWNERLOGIN

PATIENT DETAILS

VIEWPATIENDETAILS

USERLOGIN

USERVIEWPATIENDETAIL

CLOUDOWNERDETAILS

CLOUDPATIENTDETAILS

AUTHORITY KEY GENERATE

AUTHORITY GIVE PERMISSION

VIII. CONCLUSION AND FUTURE ENHANCEMENT

In this paper, we tend to proposed a novel scalable data compression based on similarity calculation

among the partitioned data chunks with Cloud computing. For proper granularity, an effective and

efficient chunking algorithm is a must. If the data is chunked accurately, it increases the throughput

and the net deduplication performance. The file level chunking method is efficient for small files

deduplication, however relevant for a big file environment or a backup environment. TTTD-S

algorithm, not only successfully achieves the significant improvements in running time and average

chunk size, but also obtains the better controls on the variations of chunk-size by reducing the large-

sized chunks. In current technique text document only considered but in Future image file also

consider for data compression Technique.

REFERENCES [1] S. Tsuchiya, Y. Sakamoto, Y. Tsuchimoto and V. Lee, “Big Data Processing in Cloud Environments,” FUJITSU

Science and Technology Journal, 48(2): 159-168, 2012.

[2] “Big data: science in the petabyte era: Community cleverness Required” Nature 455 (7209): 1, 2008.

[3] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica

and M. Zaharia, “A view of cloud computing,” Communications of the ACM 53(4): 50-58, 2010.

[4] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg and I. Brandic, “Cloud computing and emerging it platforms: Vision,

hype, and reality for delivering computing as the 5th utility,” Future Generation Computer Systems 25(6): 599-616,

[5] L. Wang, J. Zhan, W. Shi and Y. Liang, “In cloud, can scientific communities benefit from the economies of scale?”

IEEE Transactions on Parallel and Distributed Systems 23(2): 296-303, 2012.

[6] S. Sakr, A. Liu, D. Batista, and M. Alomari, “A survey of large scale data management approaches in cloud

environments,” Communications Surveys & Tutorials, IEEE, 13(3): 311–336, 2011.

[7] B. Li, E. Mazur, Y. Diao, A. McGregor and P. Sheny, “A platform for scalable one-pass analytics using mapreduce,”

in: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11), 2011, pp.

985-996.

[8] R. Kienzler, R. Bruggmann, A. Ranganathan and N. Tatbul, “Stream as you go: The case for incremental data access

and processing in the cloud,” IEEE ICDE International Workshop on Data Management in the Cloud (DMC'12),

[9] C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V.B.N. Rao, V. Sankarasubramanian, S.

Seth, C. Tian, T. ZiCornell and X. Wang, “Nova: Continuous pig/hadoop workflows,” Proceedings of the ACM

SIGMOD International Conference on Management of Data (SIGMOD'11), pp. 1081-1090, 2011.

[10] K.H. Lee, Y.J. Lee, H. Choi, Y.D. Chung and B. Moon, “Parallel data processing with mapreduce: A survey,” ACM

SIGMOD Record 40(4): 11-20, 2012.

[11] X. Zhang, C. Liu, S. Nepal and J. Chen, “An Efficient Quasiidentifier Index based Approach for Privacy

Preservation over Incremental Data Sets on Cloud,” Journal of Computer and System Sciences (JCSS), 79(5): 542-

555, 2013.

PARALLEL DATA COMPRESSION IN CLINICAL DATA WITH … · PARALLEL DATA COMPRESSION IN CLINICAL DATA...

Documents