SOCIOMETRIC METHODS FOR RELEVANCY ANALYSIS OF LONG TAIL
SCIENCE DATA
Presented By :
AKSHAY (1CE10CS006)
Guided By :Mr. Deepak N RAsst.Prof, Dept. of CSE
INTRODUCTION• Datasets.• Highly distributed, not well organized or curated ,
not easily discoverable or reusable.• Long tail of science data – the massive number of
relatively small datasets.• Contain rich information that can be used to maximize
new scientific discoveries.• Big Data problem: how do we enable easier
discoverability, use of the massive number of smaller datasets ?
DATABRIDGE VISION• DataBridge is an indexing mechanism for scientific
datasets, similar to web search engines that help find web pages of interest.
• Tags ,metadata ,contexts and naming conventions to identify relevancy.
• It will map datasets connected by multi-dimensional relationships.
• Maximize the usefulness of long tail data for scientific research
• Facilitate searching for collaborators.• Enable data set publication as a means of
communication.• Assist scientists in discovering “interesting” data sets
by automatically forming communities of data.
• Nodes represents a single data set.• Edges represents the similarity of the two data sets.• Line thickness denotes strength of similarity.
BUILD A SOCIAL NETWORK FOR SCIENTIFIC DATA• Instrument known data :
- Use DataVerse Network and iRODS.
- DataVerse contains social science and political data.
- iRODS used by many academic and government agencies around the world.
• Investigate similarity measures:
- Data to Data Connections: metadata and derived data about the data set.
- User to Data Connections: metadata about the usage and users of the data set.
- Method to Data Connections: metadata about the analysis of the data set.
DATA TO DATA SIMILARITY MEASURES
• Use native and “derived” metadata.• Native metadata provided with the dataset.• Derived metadata e.g. from the Hive ontology engine.• Use “categorical” similarity measures such as
occurrence frequency to produce a similarity matrix for non-numeric data.
USER TO DATA SIMILARITY MEASURES
• Create audit trails tracking .
- Use of data sets in published Papers. - Views and downloads of data sets. - owners of data sets.
• Calculate similarity of data sets from audit trails.
• Use frequency and recency of access as a measure of data value.
METHODS TO DATA SIMILARITY MEASURES
• Create an ontology of analytic methods and applications.
• Gather information about the usage of methods on data sets.
• Calculate similarity from ontology and usage information.
COMMUNITY DETECTION• Investigate a number of community detection
algorithms e.g. :
- Spatial algorithms such as Euclidean or Manhattan distance.
- Algorithms based on adjacency relationships.
QUERY INTERFACES• Simple network visualization as shown already
- Add thresholds and other filters . • Multiple dimensional queries.
- Simple relational calculus. - Ontology of similarity and community detection methods combined with SPARQL.
DATABRIDGE PROGRESS TO DATE
• Initial gatherer that populates metadata store from our local DVN.
• Metadata store is the file system.
• Relevance engine with one algorithm.
• Network database is currently both Neo4j and Titan.
DATABRIDGE CONCLUSION• A promising start on a challenging research problem.
• Outstanding issues:
- What metrics we will use to compare various similarity and community detection algorithms?
- What happens when we scale up? - What sort of queries and query interfaces will be most effective?
- How best to encourage publication?
REFERENCES[1] P. Holdren. “Memorandum For The Heads Of Executive Departments And
Agencies. SUBJECT: Increasing Access to the Results of Federally Funded Scientific Research,” Feb. 2013,
http://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
[2] Databridge, http://www.databridge.web.unc.edu.
[3] DVN, The Dataverse Network, http://thedata.org
[4] C. Palmer and C. Faloutsos, “Electricity based external similarity of
categorical attributes,” in: Proceedings of PAKDD, 2003, pp. 486-500.
[5] M. Crosas, “A Data Sharing Story,” 2012, J eScience Librarianship 1(3): Article 7, 2012.
[6] A. Rajasekar, H. Kum, M. Crosas, J. Crabtree, S. Sankaran, H.
Lander, T. Carsey, G. King, and J. Zhan. “The DataBridge,” Science
Journal. ASE in press 2013.