Download - Big Data (SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL SCIENCE DATA)

SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL

SCIENCE DATA

Presented By :

AKSHAY (1CE10CS006)

Guided By :Mr. Deepak N RAsst.Prof, Dept. of CSE

ABSTRACT• Long Tail Science Data

• Big Data Problem

• Data Bridge

INTRODUCTION•   Datasets.• Highly distributed, not well organized or curated ,

not easily discoverable or reusable.• Long tail of science data – the massive number of

relatively small datasets.• Contain rich information that can be used to maximize

new scientific discoveries.• Big Data problem: how do we enable easier

discoverability, use of the massive number of smaller datasets ?

DATABRIDGE VISION• DataBridge is an indexing mechanism for scientific

datasets, similar to web search engines that help find web pages of interest.

• Tags ,metadata ,contexts and naming conventions to identify relevancy.

• It will map datasets connected by multi-dimensional relationships.

• Maximize the usefulness of long tail data for scientific research

• Facilitate searching for collaborators.• Enable data set publication as a means of

communication.•   Assist scientists in discovering “interesting” data sets

by automatically forming communities of data.

MULTIDIMENSIONAL NETWORK

•   Nodes represents a single data set.• Edges represents the similarity of the two data sets.• Line thickness denotes strength of similarity.

BUILD A SOCIAL NETWORK FOR SCIENTIFIC DATA•   Instrument known data :

- Use DataVerse Network and iRODS.

- DataVerse contains social science and political data.

- iRODS used by many academic and government agencies around the world.

•   Investigate similarity measures:

- Data to Data Connections: metadata and   derived data about the data set.

- User to Data Connections: metadata about   the usage and users of the data set.

- Method to Data Connections: metadata   about the analysis of the data set.

DATA TO DATA SIMILARITY MEASURES

• Use native and “derived” metadata.• Native metadata provided with the dataset.• Derived metadata e.g. from the Hive ontology engine.• Use “categorical” similarity measures such as

occurrence frequency to produce a similarity matrix for non-numeric data.

USER TO DATA SIMILARITY MEASURES

•   Create audit trails tracking .

- Use of data sets in published Papers. - Views and downloads of data sets. - owners of data sets. 

•   Calculate similarity of data sets from audit trails.

•   Use frequency and recency of access as a measure of data value.

METHODS TO DATA SIMILARITY MEASURES

•   Create an ontology of analytic methods and applications.

•   Gather information about the usage of methods on data sets.

•   Calculate similarity from ontology and usage information.

COMMUNITY DETECTION• Investigate a number of community detection

algorithms e.g. :

- Spatial algorithms such as Euclidean or  Manhattan distance.

- Algorithms based on adjacency relationships. 

QUERY INTERFACES•   Simple network visualization as shown already

- Add thresholds and other filters . •   Multiple dimensional queries.

- Simple relational calculus. - Ontology of similarity and community  detection methods combined with SPARQL.

DATABRIDGE PROGRESS TO DATE

•   Initial gatherer that populates metadata store from our local DVN.

•   Metadata store is the file system.

•   Relevance engine with one algorithm.

•   Network database is currently both Neo4j and Titan.

DATABRIDGE CONCLUSION•   A promising start on a challenging research problem.

•   Outstanding issues:

- What metrics we will use to compare various similarity and community detection algorithms?

- What happens when we scale up?  - What sort of queries and query interfaces will  be most effective?

- How best to encourage publication?

REFERENCES[1] P. Holdren. “Memorandum For The Heads Of Executive Departments And

Agencies. SUBJECT: Increasing Access to the Results of Federally Funded Scientific Research,” Feb. 2013,

http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf

[2] Databridge, http://www.databridge.web.unc.edu.

[3] DVN, The Dataverse Network, http://thedata.org

[4] C. Palmer and C. Faloutsos, “Electricity based external similarity of

categorical attributes,” in: Proceedings of PAKDD, 2003, pp. 486-500.

[5] M. Crosas, “A Data Sharing Story,” 2012, J eScience Librarianship 1(3): Article 7, 2012.

[6] A. Rajasekar, H. Kum, M. Crosas, J. Crabtree, S. Sankaran, H.

Lander, T. Carsey, G. King, and J. Zhan. “The DataBridge,” Science

Journal. ASE in press 2013.

THANK YOU !