Post on 23-May-2020
transcript
DOI: 10.23883/IJRTER.2017.3180.LWA51 519
PARALLEL DATA COMPRESSION IN CLINICAL DATA WITH
CHUNKING ALGORITHM ON A CLOUD
Ms. R. Hemavathy1, Mrs. G. Sangeethalakshmi
2, Ms. A. Anitha
3
1Research Scholar, Dept of Computer Science and Applications, D.K.M College for Women
(Autonomous), Vellore, Tamilnadu, India 2Assitant Professor, Dept of Computer Science and Applications, D.K.M College for Women
(Autonomous), Vellore, Tamilnadu, India 3Research Scholar, Dept of Computer science and Applications, D.K.M College for
Women(Autonomous), Vellore, Tamilnadu, India
Abstract - The emergence of massive datasets in a clinical data presents both challenges and
opportunities in data storage and analysis. Advances in information and communication technology
present the most viable solutions to big data Storage and analysis in terms of efficiency and
scalability. As data progressively grows within data centers, the cloud storage systems continuously
face challenges in saving storage capacity and providing capabilities necessary to move big data
within an acceptable time frame. The storage pressure on cloud storage system caused by the
explosive growth of data is growing by the day, especially a vast amount of redundant data waste a
lot of storage space. Data deduplication can effectively reduce the size of data by eliminating
redundant data in storage systems. In cloud data storage, the deduplication technology plays a major
role. Current big sensing Clinical data processing on Cloud have adopted some data compression
techniques. However, due to the high volume and velocity of big sensing data, traditional data
compression techniques lack sufficient efficiency and scalability for data processing. Instead of
compressing basic data units, the compression will be conducted over partitioned data chunks. In the
deduplication technology, data are broken down into multiple pieces called “chunks”. The chunking
algorithm uses a new parallel processing framework The Two Thresholds Two Divisors (TTTD)
algorithm is used for chunking mechanism and is used for controlling the variations of the chunk-
size. It is vital those big data solutions are multithreaded and that data access approaches be precisely
tailored to large volumes of semi-structured/unstructured data. To restore original data sets, some
restoration functions and predictions will be designed. Successful restoration of Electronic Health
Record helps improve patient safety and quality of care.
Key Words: Deduplication, Two Thresholds Two Divisors(TTTD)
I. INTRODUCTION
The Big sensing statistics from fully extraordinary types of sensing systems e.g. (Healthcare, video,
satellite, meteorology, earthquake pursuit, guests pursuit, advanced physics simulations, genomics,
biological stud, and so on,..) are high heterogeneous, and it's traditional traits of commonplace real
international info. They are four ‘V’s, Volume, Variety, Velocity, Veracity. To beat the system
issues ensuing from four ‘V’s, of big sensing info, the event in expand huge process on Cloud is
exploit properly-favored a day. Cloud computing offers a promising platform for enormous process
with its effective computation capability, garage, measurability, resource build use of and low fee,
and has attracted crucial attention in alignment with big records.
To reduce the time and house price for large data, particularly huge sensing processing on Cloud,
totally different techniques are proposed and developed. However because of the dimensions and
speed of huge sensing data in real world, the present data compression and reduction techniques still
got to be improved. It’s been well recognized that big sensing data or big data sets from mesh
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 520
networks adore sensing element systems and social networks will take the form of big graph data. To
method those big graph data, current techniques normally introduce complex and multiple iterations.
Healthcare is one of the foremost necessary areas for developing and developed countries to facilitate
the valuable human resource. nowadays health care business is flooded with huge quantity of
information that require validation and correct analysis. Despite the fact that big data Analytics will
contribute a serious role in process and analyzing the health care information in form of forms to
deliver appropriate applications, india takes the second place in the world in its population.
Increasing population in india over-burdens the health care structure within the country. The
exponential growth of data over the last decade has introduced a brand new domain in the field of
data technology referred to as big data. Here we tend to propose map cut back big medical huge data
of india. To any improve reduce size reduction, cut back the processing time value and unleash the
iterations in process big sensing data. Instead of pressing basic data units, the compression are
conducted over divided data chunks. in the deduplication technology, data are broken down into
multiple items referred to as “chunks”.
The chunking algorithm uses a new parallel processing framework The Two Thresholds Two
Divisors (TTTD) algorithm is used for chunking mechanism and is used for controlling the variations
of the chunk-size. It is vital those big data solutions are multithreaded and that data access
approaches be precisely tailored to large volumes of semi-structured/unstructured data, especially
streaming big sensing data on Cloud. With this novel technique, big sensing data stream will be
filtered to form standard data chunks at first based on our predefined similarity model. Then, the
coming sensing data stream will be compressed according to the generated standard data chunks.
With the above data compression, we tend to aim to improve the data compression efficiency by
avoiding traditional compression based on each data unit, which is space and time costly due to low
level data traverse and manipulation. At the equivalent time, because the compression happens at a
higher data chunk level, it reduces the chance for introducing too much usage of iteration and
recursion which prove to be main trouble in processing big graph data.
II. RELATED WORK
2.1 Healthcare In India
In spite of the government has promised to introduce digitization for maintaining medical records,
the fact isn’t of course. The country does not even have standardization in common medical
terminologies. In big data processing the data must be process in a distributed environment. The
requirement for analyzing data such as medical information requires statistical and mining approach
for analyzing the data. Delivering the data during a faster response time will be at higher priority.
2.2 Datacompression In Cloud Computing
The cloud environment based on the captured necessities and to present its implementation on
Amazon Web Services. They also presents an experimentation of running the Map Reduce system in
a cloud environment to validate the projected framework and to present the evaluation of the
experiment based on the criteria such as speed of processing, data-storage usage, latent period and
cost efficiency .
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 521
2.3 Analysis Of Bidgata Compression
Paper Feature Advantages Disadvantages
Big Data Processing
in Cloud
Environments
Big data processing
techniques from
system and
application aspects
MapReduce
optimization
strategies and
applications
Grid and cloud computing have
all intended to access large
amounts of computing power by
aggregating resources.
A survey of large
scale data
management
approaches in cloud
environments
Mechanisms of
deploying data-
intensive
applications in the
cloud
Economical
processing of large
scale data on the
cloud is provided.
The latency gap between multi-
core CPUs and mechanical hard
disks is growing every year
which makes the challenges of
data-intensive computing harder
to overcome.
A information
platform for scalable
one-pass analytics
using map reduce
MapReduce model
to incremental one-
pass analytics
Magnitude
reduction of
internal data spills
Long latencies and making them
unsuitable for producing
incremental results.
Stream as you go:
The case for
incremental data
access and processing
in the cloud
Processing data
based on a stream
data management
architecture
MapReduce-based
DNA sequence
analysis application
long-term storage of complete
data sets is an explicit
requirement
Very fast estimation
for result and
accuracy of big data
analytics: The EARL
system
Approximate
results based on
samples
Response times
obtained in the
actual
computations
The error estimation is required
to determine if node recovery is
necessary
2.4 Existing System
• In existing, to process those big graph data, current techniques normally introduce complex
and multiple iterations.
• Iterations and recursive algorithms may cause computation problems such as parallel memory
bottlenecks, deadlocks on data accessing, algorithm inefficiency.
• Even with Cloud platform, the task of big data processing may introduce unacceptable time
cost, or even lead to processing failures.
2.5 Drawback
• In which mainly concentrate on the static scenes such as the backup and archive systems, are
not suitable for cloud storage system due to the dynamic nature of data.
• Current techniques normally introduce complex and multiple iterations. Iterations and
recursive algorithms may cause computation problems such as parallel memory bottlenecks,
deadlocks on data accessing, algorithm inefficiency.
• It may introduce unacceptable time cost, or even lead to processing failures.
2.6 Proposed Algorithm
2.6.1Two Thresholds Two Divisors (TTTD) chunking
• Our proposed system will be working on processing big data on cloud by using
compression technique working Huffman algorithm, by overcoming some drawbacks of Map
Reduce utilized in existing system.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 522
• The data which is to be stored on cloud after compression will be available in its original size
on native server.
• This data on local server on time of processing will be divided into chunks so that the data
will be processed parallel and quicker.
2.6.2 Advantages Of Proposed System
• The compression happens at a higher data chunk level.
• It reduces the chance for introducing an excessive amount of usage of iteration and recursion
which prove to be main trouble in processing big graph data.
III. SYSTEM DESIGN
Architecture Diagram
Figure1. Architecture
IV. MODULES
• Similarity Model
• Data Chunk Generation and Formation
• Data Compression
• Distributed Storage Network
• Data driven scheduling on Cloud
4.1 Similarity Model
• Currently, there are five types of models are commonly used including common element
approach, template models, geometric models, feature models and Geon theory.
• However, the subsequent proposed models are related to geometric model and common
element approach in terms of numerical data and text data respectively.
• Our similarity models work on two types of data sets, multidimensional numerical data and
text data.
4.2 Data Chunk Generation and Formation
• In the problem analysis, we have introduced the essential idea of data chunk based
compression.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 523
• Under that theme, the data will not be compressed by encoding or data prediction one by one.
It is similar to high frequent component compression.
• The difference is that the frequent component compression acknowledges only simple data
units; whereas our data chunk based compression recognizes complex data partitions and
patterns during the compression process.
4.3 Data Compression
• MapReduce is a framework for processing parallelizable and scalable issues across huge
datasets using a large number of computers (nodes), collectively referred to as a cluster or a
grid.
• Computational processing can occur on data stored either in a file system (unstructured) or in
a database (structured).
• MapReduce can take advantage of locality of data, processing data on or near the storage
assets to reduce data transmission.
• Map" function: The master node takes the input, divides it into smaller sub-problems, and
distributes them to worker nodes. A worker node may do this again in turn, leading to a
multi-level tree structure.
• "Reduce" function: The master node then collects the answers to all the sub-problems and
combines them in some way to form the output – the answer to the problem it was originally
trying to solve.
4.4 Distributed storage network
• Distributed Networking is a distributed computing network system, said to be "distributed"
when the computer programming and the data to be worked on are spread out over more than
one computer.
• Prior to the emergence of low-cost desktop computer power, computing was generally
centralized to one computer. Although such canters still exist, distribution networking
applications and data operate more efficiently over a mix of desktop workstations, local area
network servers, regional servers, Web servers, and other servers.
4.5 Data driven scheduling on Cloud
• Map reduce and TTTD the approaches for reducing the big data size with data suppression.
• However, a smaller size of data doesn’t undoubtedly mean a shorter processing time which is
quite related to the task division and workload distribution over Cloud.
• In order to supply a shorter and quicker processing time, we will introduce a novel scheduling
algorithm based on the compressed data sets.
• Our data driven scheduling and its mapping, two types of mapping strategies, node based
mapping and edge based mapping will be introduced first for comparison.
4.6 Mapping for Cloud scheduling with real network nodes
• The most direct and easy way to distribute and schedule the big data processing task over
Cloud is based on the real work topology of the network itself.
• Under the theme of this mapping, the mapping algorithm is pretty easy and therefore the
computation resources are divided and distributed to each node for simulating and analyzing
data flows in a real world network.
4.7 Mapping for Cloud scheduling with data exchange edges
• Instead of directly allocating the computation resources over Cloud according to the real
world network topology, the mapping can be also carried out based on the data exchanging
edge between nodes.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 524
• Each edge which has data flows over it in an exceedingly network are going to be simulated
and analyzed with a computation unit from Cloud.
V. DFT
VI. IMPLEMENTATION
6.1 Algorithm
Two Thresholds Two Divisors (TTTD) chunking
• Our proposed system will be working on processing big data on cloud by using
compression technique working Huffman algorithm, by overcoming some drawbacks of
Map Reduce used in existing system.
• The data which is to be stored on cloud after compression will be available in its original
size on local server.
• This data on local server on time of processing will be divided into chunks so that the
data will be processed parallel and faster.
• To divide big data into chunks we will be using TTTD algorithm. The compression will
be applied on every single chunk on cloud server and we will be showing that till which
level the big data is being compressed and what the compression level bar will show that
till what extent the file size has been compressed while storing on cloud.
• The TTTD algorithm was proposed by HP laboratory at Palo Alto, California. This
algorithm use same idea as the BSW algorithm does.
• In addition, the TTTD algorithm uses four parameters, the maximum threshold, the
minimum threshold, the main divisor, and the second divisor, to avoid the problems of the
BSW algorithm.
• The maximum and minimum thresholds are used to eliminate very large-sized and very
small-sized chunks in order to control the variations of chunk-size.
Login
Cloud Server
View Patient
Basic Details
Data Owner
Details
View End User
Details
User
Data Owner
Authority Check
Give Permission
to User
View Encrypted
Data
Upload Patient
Details
View Patient
Details
View Patient
Details
Give Request to
Authority
View Symptoms
View Patient
Details
Download
Patient File
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 525
• The main divisor plays the same role as the BSW algorithm and can be used to make the
chunk-size close to our expected chunk-size.
• In usual, the value of the second divisor is half of the main divisor. Due to its higher
probability, second divisor assists algorithm to determine a backup breakpoint for chunks
in case the algorithm cannot find any breakpoint by main divisor
Figure2. Chunking data
VII. RESULTS
SCREEN SHOTS
HOME PAGE
OWNERLOGIN
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 526
PATIENT DETAILS
VIEWPATIENDETAILS
USERLOGIN
USERVIEWPATIENDETAIL
CLOUDOWNERDETAILS
CLOUDPATIENTDETAILS
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 04; April - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 527
AUTHORITY KEY GENERATE
AUTHORITY GIVE PERMISSION
VIII. CONCLUSION AND FUTURE ENHANCEMENT
In this paper, we tend to proposed a novel scalable data compression based on similarity calculation
among the partitioned data chunks with Cloud computing. For proper granularity, an effective and
efficient chunking algorithm is a must. If the data is chunked accurately, it increases the throughput
and the net deduplication performance. The file level chunking method is efficient for small files
deduplication, however relevant for a big file environment or a backup environment. TTTD-S
algorithm, not only successfully achieves the significant improvements in running time and average
chunk size, but also obtains the better controls on the variations of chunk-size by reducing the large-
sized chunks. In current technique text document only considered but in Future image file also
consider for data compression Technique.
REFERENCES [1] S. Tsuchiya, Y. Sakamoto, Y. Tsuchimoto and V. Lee, “Big Data Processing in Cloud Environments,” FUJITSU
Science and Technology Journal, 48(2): 159-168, 2012.
[2] “Big data: science in the petabyte era: Community cleverness Required” Nature 455 (7209): 1, 2008.
[3] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica
and M. Zaharia, “A view of cloud computing,” Communications of the ACM 53(4): 50-58, 2010.
[4] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg and I. Brandic, “Cloud computing and emerging it platforms: Vision,
hype, and reality for delivering computing as the 5th utility,” Future Generation Computer Systems 25(6): 599-616,
2009.
[5] L. Wang, J. Zhan, W. Shi and Y. Liang, “In cloud, can scientific communities benefit from the economies of scale?”
IEEE Transactions on Parallel and Distributed Systems 23(2): 296-303, 2012.
[6] S. Sakr, A. Liu, D. Batista, and M. Alomari, “A survey of large scale data management approaches in cloud
environments,” Communications Surveys & Tutorials, IEEE, 13(3): 311–336, 2011.
[7] B. Li, E. Mazur, Y. Diao, A. McGregor and P. Sheny, “A platform for scalable one-pass analytics using mapreduce,”
in: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'11), 2011, pp.
985-996.
[8] R. Kienzler, R. Bruggmann, A. Ranganathan and N. Tatbul, “Stream as you go: The case for incremental data access
and processing in the cloud,” IEEE ICDE International Workshop on Data Management in the Cloud (DMC'12),
2012.
[9] C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V.B.N. Rao, V. Sankarasubramanian, S.
Seth, C. Tian, T. ZiCornell and X. Wang, “Nova: Continuous pig/hadoop workflows,” Proceedings of the ACM
SIGMOD International Conference on Management of Data (SIGMOD'11), pp. 1081-1090, 2011.
[10] K.H. Lee, Y.J. Lee, H. Choi, Y.D. Chung and B. Moon, “Parallel data processing with mapreduce: A survey,” ACM
SIGMOD Record 40(4): 11-20, 2012.
[11] X. Zhang, C. Liu, S. Nepal and J. Chen, “An Efficient Quasiidentifier Index based Approach for Privacy
Preservation over Incremental Data Sets on Cloud,” Journal of Computer and System Sciences (JCSS), 79(5): 542-
555, 2013.