Fast detection of transformed data leaks[mithun_p_c]

transcript

FAST DETECTION OF TRANSFORMED DATA LEAKS

Submitted By,JOSNA KRISHNA

S7 CSEROLL No.:35

CONTENTS: INTRODUCTION SENSITIVE DATAS IN COMPANIES DATA LEAKAGE-------HOW??? DANGER… TOWARDS SECURITY EXISTING SYSTEM PROPOSED SYSTEM INTO THE ALGORITHM CONCLUSION

INTRODUCTION

DATA LEAKAGE: Data leakage is the unauthorized transmission of sensitive data or information from within an organization to an external destination .

SENSITIVE DATAS OF COMPANIES INCLUDES………

•Intellectual Properties•Financial Information•Patient Information•Personal Credit Card Data,•& Other Information Depending Upon the Business and the industry.

DATA LEAKAGE-----------HOW???•In the course of business, data must be handed over to trusted 3rd Parties for some operations.

•Sometimes these trusted 3rd Parties may act as points ofData leakage.

•Data Leakage mainly happens due to Human Errors.

EXAMPLES ARE……

•A hospital may give patient records to researcher who will devise new treatment.•Company may have partnership with other companies that require sharing of customer data.•An enterprise may outsource it’s data processing, so data must be given to various othercompanies.

DANGER….•Number of leaked sensitive data records has grown 10 times in recent years.•Data leakage by accidents exceeds the risk posed by vulnerable software.•Sensitive data leakage is more in cases where there is no End-to-End encryption (example: PGP-Pretty Good Privacy)

TOWARDS SECURITY……•Prevent clear text sensitive Data from Direct Access.•Deploy a Screening Tool:

-To scan computer file systems.-To scan server storage.-Inspect outbound network traffic.

•Data leak detection differs from AntiVirus and Network Intrusion Detection System (AV&NIDS).

DATA LEAK DETECTION HAS.. :

->New security requirements &

->Algorithmic Challenges.

Algorithmic Challenges:-Data Transformation-Scalability

•Direct usage of Automata-based string matching is not possible.

EXISTING SYSTEM :It is based on Set Intersection.Operation performed on 2 sets of n-grams.One from content and one from sensitive data.This method is used to detect similar documents on:

• The web.• Shared malicious traffic

pattern.• Malware.• E-mail spam.

EXISTING SYSTEMS ARE :

Symantec DLP Identity Finder Global Velocity GoCloud DLP etc.

DISADVANTAGES OF EXISTING SYSTEM :

Set Intersection is order less.(Ordering of shared n-grams is not analyzed)

Generates false alerts.(When n is set to small value)Cannot detect the partial data leakage.It is not an adequate method.

PROPOSED SYSTEM:This one is holding sequential alignment algorithm.Executed on :• Sampled sensitive data sequence.• Sampled content being inspected.

Alignment produces the amount of sensitive data in a content.More accuracy is achieved.

FACING THE CHALLENGES :

Scalability issue is solved by sampling

both the Sensitive Data & Content

Sequence before aligning.

A pair of algorithms is used:

• Comparable Sampling Algorithm

• Sampling Oblivious Alignment

Algorithm

High detection specificity.Pervasive & localized modifications.

ABOUT THE ALGORITHM :

o The Comparable Sampling Algorithm yields constant samples of a sequence wherever the sampling starts and ends

o The Sampling Oblivious Alignment Algorithm infers the similarity between the original unsampled sequence with sophisticated techniques through dynamic programming.

CONTINUATION : In this method, both sensitive data

& content sequence are sampled. The alignment is performed on

sampled sequences Here, a ‘Comparable Sampling’

property is used. Both the algorithms performs more

faster on a GPU than a CPU. Promises high speed security

scanning.

INTO THE ALGORITHMS

COMPARABLE SAMPLING ALGORITHM

Requirements:

Definition 1: A substring is a consecutive segment of the original string.

Definition 2: A subsequence does not require its items to be consecutive in the original string.

Definition 3: Given string x is substring of y ,comparable sampling on x and y yields x’ and y’. x’ is similar to a substring of y’.Definition 4: Given x as a substring of y, a subsequence preserving sampling on x and y yield two subsequences x’ and y’ ,so that x’ is substring of y’.

ADVANTAGES : It is deterministic and

subsequence preserving.

This algorithm is unbiased.

It yields a constant samples of a sequence wherever the sampling starts and ends.

COMPARABLE SAMPLING ALGORITHM : Input: an array S of items, a size |w| for a sliding

window w, a selection function f (w, N) that selects N smallest

items from a window w, i.e., f = min(w, N) Output: a sampled array T 1: initialize T as an empty array of size |S| 2: w ←read(S, |w|) 3: let w.head and w.tail be indices in S

corresponding to the higher-indexed end and lower-indexed end of w,

respectively 4: collection mc ← min(w, N) 5: while w is within the boundary of S do

6: mp ←mc 7: move w toward high index by 1 8: mc ← min(w, N) 9: if mc = mp then 10: item en ← collectionDiff (mc,mp) 11: item eo ← collectionDiff (mp,mc) 12: if en < eo then 13: write value en to T at w.head’s position 14: else 15: write value eo to T at w.tail’s position 16: end if 17: end if 18: end while

ALGORITHM ANALYSIS :We set our sampling procedure with a sliding

windowof size 6 (i.e., |w| = 6) and N= 3. The input sequence is 1,5,1,9,8,5,3,2,4,8. The initial

windoww= [1,5,1,9,8,5] and collection mc =

sliding{1,1,5}.

COMPLEXITY :

The complexity of selection function is O(n log|w|) or O(n),where n is the size of input, |w| is the size of the window.

The factor O(log|w|) comes from maintaining the smallest N items within the window.

SAMPLING OBLIVIOUS ALIGNMENT ALGORITHM

Requirements:

The algorithm runs on compact sampled sequences L .

Extra fields for scoring matrix cells in dynamic

programming.

Extra step in recurrence relation for updating the null

region.

Complex weight function computes similarities

between two null region.

SECURITY ADVANTAGES :

Order –aware comparison

High Tolerance to pattern variation

Capability of detecting partial leaks

Consistent

SAMPLING OBLIVIOUS ALIGNMENT ALGORITHM Input: A weight function fw, visited

cells in H matrix that areadjacent to H(i, j ): H(i −1, j −1), H(i, j

−1), and H(i −1, j ),and the i -th and j -th items Lai,Lbjin two sampled sequences Laand Lb, respectively.

CONCLUSION :

•Presented here is a content inspection technique for sensitive data leakage.•Detection approach is based on aligning 2 samples for similarity comparison.•Our alignment method is useful for common data scenarios.

Fast detection of transformed data leaks[mithun_p_c]

Engineering