+ All Categories
Home > Documents > Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 ›...

Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 ›...

Date post: 30-May-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
46
Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena, CA Mining and UDlizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access NASA AIST (NNX15AM85G) Dr. Chaowei (Phil) Yang, Mr. Yongyao Jiang, Ms. Yun Li, Geography and GeoInformaDon Science, George Mason University Mr. Edward M Armstrong, Mr. Thomas Huang, Mr. David Moroni, Mr. Chris Finch, Dr. Lewis J. McGibbney, Mr. Frank Greguska, Mr. Gary Chen Jet Propulsion Laboratory, NASA
Transcript
Page 1: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

MiningandUDlizingDatasetRelevancyfromOceanographicDataset(MUDROD)Metadata,UsageMetrics,andUser

FeedbacktoImproveDataDiscoveryandAccess NASAAIST(NNX15AM85G)

Dr.Chaowei(Phil)Yang,Mr.YongyaoJiang,Ms.YunLi,

GeographyandGeoInformaDonScience,GeorgeMasonUniversity

Mr.EdwardMArmstrong,Mr.ThomasHuang,Mr.DavidMoroni,Mr.ChrisFinch,Dr.LewisJ.McGibbney,Mr.FrankGreguska,Mr.GaryChen

JetPropulsionLaboratory,NASA

Page 2: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  ProjectBackground–  Problems–  ObjecDves–  FuncDons

•  System–  Logmining–  QuerySemanDcs–  Ranking–  RecommendaDon

•  Results•  Nextstep

Agenda

Page 3: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

3

Data Discovery Problems

•  Keyword-based matching (traditional search engines) –  User query: ocean wind –  Final query: ocean AND wind

•  Reveal the real intent of user query –  ocean wind = “ocean wind” OR “greco” OR “surface wind” OR “mackerel breeze” …

•  PO.DAAC UWG Recommendation 2014-07

•  NASA ESDSWG Search Relevance Recommendations 2016 & 2017

Page 4: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  Analyzeweblogstodiscoveruserknowledge(queryanddatarelaDonships)•  ConstructknowledgebasebycombiningsemanDcsandprofileanalyzer•  Improvedatadiscoveryby1)be^erranking;2)recommendaDon;3)

ontologynavigaDon

ObjecDves

Page 5: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  Weblogpreprocessing•  SemanDcanalysisofuserqueries&NavigaDon

•  Machinelearningbasedsearchranking•  DataRecommendaDon

FuncDons/Modules

Page 6: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Weblogprocessing

Page 7: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  Requestssentfromcliente.g.browser,cmdlinetool,etc.recordedbyserver

•  LogfilesprovidedbyPO.DAAC(HTTP(S),FTP)

ClientIP:68.180.228.99Requestdate/Dme:[31/Jan/2015:23:59:13-0800]Request:"GET/datasetlist/...HTTP/1.1"HTTPCode:200Bytesreturned:84779Referrer/previouspage:“/ghrsst/"Useragent/browser:"Mozilla/5.0...

68.180.228.99--[31/Jan/2015:23:59:13-0800]"GET/datasetlist/...HTTP/1.1"20084779"/ghrsst/""Mozilla/5.0..."

Weblogs

Page 8: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Goal: reconstruct user browsing pattern (search history & clickstream) from a set of raw logs

Weblogs

UseridenDficaDon

CrawlerdetecDon

StructurereconstrucDon

SessionidenDficaDon Searchhistory

Clickstream

AddiDonalstepsinclude:wordnormalizaDon,stopwordsremoval,andstemming

Datapreprocess

Page 9: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Reconstructedsessionstructure

Page 10: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Datapreprocessresults

1.Usersearchhistory2.Clickstream

Jiang,Y.,Y.Li,C.Yang,E.M.Armstrong,T.Huang&D.Moroni(2016)“ReconstrucDngSessionsfromDataDiscoveryandAccessLogstoBuildaSemanDcKnowledgeBaseforImprovingDataDiscovery”ISPRSInternaDonalJournalofGeo-InformaDon,5,54.

Page 11: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

SemanDcsimilarity

Page 12: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

ExisDngontology(SWEET)•  SWEET(RaskinandPan2003)•  FocusononlytworelaDons•  Thecloser,themoresimilar

Page 13: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Usersearchhistory•  Createquery–usermatrix•  Calculatebinarycosinesimilarity

Conceptualexample

Page 14: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Clickstream•  Hypothesis:similarqueriescanresultinsimilarclickingbehavior•  Iftwoqueriesaresimilar,thedatathatgetclickedaoertheyaresearched

wouldbemorelikelytobesimilar

Queryb

Data

Querya

Similar?

Page 15: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Metadata•  Hypothesis:semanDcallyrelatedtermstendtoappearinthesame

metadatamorefrequently•  EssenDallythesameastheclickstreamanalysis•  PerformLatentSemanDcAnalyses(LSA)overtheterm–metadata

matrix

Queryb

Metadata

Querya

Similar?

Page 16: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

IntegraDon•  Allfourresultscouldbeconvertedto

•  Problem:–  Noneofthemareperfect(uncertaintyindata,hypothesisandmethod)

–  Metadataandontologymighthaveunknowntermstosearchengineendusers

–  SomeDmes,similarityvaluesfromdifferentmethodsareinconsistent

ConceptA

ConceptB

Similarity

Page 17: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

IntegraDon

•  Themaximumsimilarityofallofthecomponents(largesimilarityappearstobemorereliable)

•  Theadjustmentincrementbecomeslargerwhenthesimilarityexistsinmoresources

Page 18: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

SemanDcSimilarityCalculaDonWorkflow

Page 19: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

ResultsandevaluaDonQuery Searchhistory Clickstream Metadata SWEET Integratedlist

oceantemperature

seasurface

temperature(0.66),sea

surfacetopography(0.56),

oceanwind(0.56),

aqua(0.49)

seasurface

temperature(0.94),

sst(0.94),grouphigh

resoluDonseasurface

temperaturedataset(0.89),

ghrsst(0.87)

sst(0.96),ghrsst(0.77),sea

surfacetemperature(0.72),

surfacetemperature(0.63),

reynolds(0.58)

None

sst(1.0),seasurfacetemperature(1.0),

ghrsst(1.0),grouphighresoluDonsea

surfacetemperaturedataset(0.99),

reynoldsseasurfacetemperature(0.74)

Samplegroup Overallaccuracy

Mostpopular10queries 88%

Leastpopular10queries 61%

Randomlyselected10queries 83%

Bydomainexperts

Jiang,Y.,Y.Li,C.Yang,K.Liu,E.M.Armstrong,T.Huang&D.Moroni(2017)AComprehensiveApproachtoDeterminingtheLinkageWeightsamongGeospaDalVocabularies-AnExamplewithOceanographicDataDiscovery.InternaDonalJournalofGeographicalInformaDonScience(minorrevision)

Page 20: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  QuerysuggesDon•  QuerymodificaDon

Whatcanweuseitfor?

Page 21: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Searchranking

Page 22: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Background

•  Rankingisalong-standingproblemingeospaDaldatadiscovery•  Typically,hundreds,eventhousandsofmatches•  CangetlargerasmoreEarthobservaDondataisbeingcollected

Page 23: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

ObjecDveandMethods

•  Putthemostdesireddatatothetopoftheresultlist•  Whatfeaturescanrepresentusers’searchpreferencesforgeospaDal

data?•  HowcantherankingfuncDonreachabalanceofallthesefeatures?

•  IdenDfiedelevenfeaturesfrom–  GeospaDalmetadataa^ributes–  Query– metadatacontentoverlap–  Userbehaviorfromweblogs

Page 24: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Rankingfeatures– Metadataa^ributesFeatures Description

Release date The date when the data was published

Processing level (PL) The processing level of image products, ranging from level 0 to level 4.

Version number The publish version of the data

Spatial resolution The spatial resolution of the data

Temporal resolution The temporal resolution of the data

•  Fivemetadatafeatures•  Verifiedbydomainsexperts•  Query-independent:staDc,dependsonthedataitself,won’tchangewiththequery

Page 25: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

SpaDalquery-metadataoverlap

•  SpaDalsimilaritybetweenqueryareaandthecoverageofaparDculardata

•  Overlapareanormalizedbytheoriginalareaofqueryanddata

Page 26: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Rankingfeatures– Userbehavior•  All-Dme,monthly,userpopularity,andsemanDcpopularity(retrieved

fromweblogs)•  SemanDcpopularity:thenumberofDmesthatthedatahasbeen

clickedaoersearchingaparDcularqueryanditshighlyrelatedones(query-dependent)

oceantemperature

oceantemperature

seasurfacetemperature

sst

1.0

1.0

1.0

DataA

10

5

5

Page 27: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

RankSVM•  Oneofthewell-recognizedMLrankingalgorithm•  ConvertarankingproblemintoaclassificaDonproblemthataregular

SVMalgorithmcansolve•  3mainsteps

Ø  1)Standardize:mean=0,std=1o  SVMisnotscaleinvarianto  Over-opDmizedo  Longertotrain

Ø  2)Foranypairoftrainingdata,calculatethedifferenceØ  3)ArankingproblembecomesabinaryclassificaDonproblem,whereSVMis

appliedtofindtheopDmaldecisionboundary

Page 28: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

ArchitectureSec.15.4.2

•  All of these (except for training) can be finished within 2 seconds

•  None of the open source mainstream ML library provide any ranking algorithm

•  Implemented it by ourselves with the aid of Spark MLlib Index

User query

Semantic query

Top K retrieval

Ranking model

Learning algorithm

Training data

Re-ranked results

Feature extractor

Weblogs

User clicks Knowledge base

Page 29: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

NDCG(K)forfivedifferentrankingmethodsatvaryingK(1-40)

Jiang,Y.,Y.Li,C.Yang,K.Liu,E.M.Armstrong,T.Huang,D.Moroni&L.J.McGibbney(2017)TowardsintelligentgeospaDaldiscovery:amachinelearningrankingframework.InternaDonalJournalofDigitalEarth(minorrevision)

Page 30: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

DatarecommendaDon

Page 31: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

HowtorecommendgeospaDaldata?•  UsegeospaDalmetadataforcontent-basedrecommendaDon

-MetadataspaDotemporalsimilarity-Metadataa^ributesimilarity-Metadatacontentsimilarity

•  LeverageuserbehaviorsdataforCFrecommendaDon

-Session-basedco-occurrenceofdata

Page 32: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

AFributetype AFributename AFributedescripHon

SpaHotemporalaFributes DatasetCoverage-EastLon TheEastlongitudeoftheboundingrectangle

DatasetCoverage-WestLon TheWestlongitudeoftheboundingrectangle

DatasetCoverage-NorthLat TheNorthlaDtudeoftheboundingrectangle

DatasetCoverage-SouthLat TheSouthlaDtudeoftheboundingrectangle

DatasetCoverage-StartTimeLong ThestartDmeofthedata

DatasetCoverage-StopTimeLong TheendDmeofthedata

Categoricalgeographic

aFributes

DatasetRegion-Region Regionofdataset.Suchasglobal,AtlanDc

Dataset-ProjecDonType Projecttypelikecylindricallat-lon

Dataset-ProcessingLevel Dataprocessinglevel

DatasetPolicy-DataFormat Dataformate.g.HDF,NetCDF

DatasetSource-Sensor-ShortName Shortnameofsensor

OrdinalgeographicaFributes Dataset-TemporalResoluDon TemporalresoluDonofdataset

Dataset-TemporalRepeat TemporalresoluDonofdataset

Dataset-SpaDalResoluDon SpaDalresoluDonofdataset

Descriptive attributes Dataset-description Describe the content of the dataset

Geographicmetadata

Page 33: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  SpaDalvariables:NorthLat,SouthLat,WestLon,EastLon•  Temporalvariables:DatasetCoverage-StartTimeLong,StopTimeLong

•  UsevolumeoverlapraDotocalculatesimilarity

𝑠𝑝𝑎𝑡𝑖𝑜𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙↓𝑠𝑖𝑚(𝑟↓𝑖 , 𝑟↓𝑗 ) =(𝑣𝑜𝑙𝑢𝑚𝑒(𝑟↓𝑖 ∩ 𝑟↓𝑗 )/𝑣𝑜𝑙𝑢𝑚𝑒(𝑟↓𝑖 ) + 𝑣𝑜𝑙𝑢𝑚𝑒(𝑟↓𝑖 ∩ 𝑟↓𝑗 )/𝑣𝑜𝑙𝑢𝑚𝑒(𝑟↓𝑗 ) )∗0.5𝑣𝑜𝑙𝑢𝑚𝑒(𝑟)=|𝑒𝑎𝑠𝑡𝑙𝑜𝑛−𝑤𝑒𝑠𝑡𝑙𝑜𝑛|∗|𝑠𝑜𝑢𝑡ℎ𝑙𝑎𝑡−𝑛𝑜𝑟𝑡ℎ𝑙𝑎𝑡|∗|𝑒𝑛𝑑𝑡𝑖𝑚𝑒−𝑠𝑡𝑎𝑟𝑡𝑖𝑚𝑒|

SpaDotemporalsimilarity

Page 34: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  Fixednumberofvalues•  Nointrinsicordering•  sensor-name:"AMSR-E","MODIS","AVHRR-3"and

"WindSat"

𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙_𝑣𝑎𝑟_𝑠𝑖𝑚(𝑣↓𝑖 , 𝑣↓𝑗 )= 𝑣↓𝑖  ∩ 𝑣↓𝑗 /𝑣↓𝑖  ∪ 𝑣↓𝑗  

Categoricalsimilarity

Page 35: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Ordinala^ributeissimilartocategoricala^ributebutitsvalueshasaclearorder,e.g.spaDalresoluDon•  Convertedintorankfrom1toR•  NominalizetheseranksforsimilaritycalculaDon

𝑜𝑟𝑑𝑖𝑛𝑎𝑙↓𝑣𝑎𝑟↓𝑠𝑖𝑚(𝑣↓𝑖 , 𝑣↓𝑗 )  =1− |𝑛𝑜𝑟𝑚↓𝑟𝑎𝑛𝑘(𝑣↓𝑖 ) −𝑛𝑜𝑟𝑚↓𝑟𝑎𝑛𝑘(𝑣↓𝑗 ) |

𝑛𝑜𝑟𝑚_𝑟𝑎𝑛𝑘(𝑣↓𝑖 )= 𝑅𝑎𝑛𝑘𝑣↓𝑖 +1/𝑅+1 

Ordinalsimilarity

Page 36: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

OriginaltextAquariusLevel3seasurfacesalinity(SSS)standardmappedimagedatacontainsgridded1degreespaDalresoluDonSSSaveragedoverdaily,7day,monthly,andseasonalDmescales.ThisparDculardatasetistheseasonalclimatology,Ascendingseasurfacesalinityproductforversion4.0oftheAquariusdataset,whichistheofficialendofprimemissionpublicdatareleasefromtheAQUARIUS/SAC-Dmission.OnlyretrievedvaluesforAscendingpasseshavebeenusedtocreatethisproduct.TheAquariusinstrumentisonboardtheAQUARIUS/SAC-Dsatellite,acollaboraDveeffortbetweenNASAandtheArgenDnianSpaceAgencyComisionNacionaldeAcDvidadesEspaciales(CONAE).Theinstrumentconsistsofthreeradiometersinpushbroomalignmentatincidenceanglesof29,38,and46degreesincidenceanglesrelaDvetotheshadowsideoftheorbit.Footprintsforthebeamsare:76km(along-track)x94km(cross-track),84kmx120kmand96kmx156km,yieldingatotalcross-trackswathof370km.Theradiometersmeasurebrightnesstemperatureat1.413GHzintheirrespecDvehorizontalandverDcalpolarizaDons(THandTV).Asca^erometeroperaDngat1.26GHzmeasuresoceanbacksca^erineachfootprintthatisusedforsurfaceroughnesscorrecDonsintheesDmaDonofsalinity.Thesca^erometerhasanapproximate390kmswath.

ExtractedtermsRadiometersMeasureBrightnessTemperature,AQUARIUS/SACMission,ImageData,BroomAlignment,ResoluDonSSS,AQUARIUS/SACSatellite,Sca^erometer,Sca^erometer,AquariusData,ArgenDnianSpaceAgencyComisionNacional,IncidenceAngles,TimeScales,AcDvidadesEspaciales,Cross-trackSwath,OfficialEnd,AquariusInstrument,ShadowSide,AscendingSeaSurfaceSalinityProduct,Level,Level,SurfaceRoughnessCorrecDons,DataRelease,Salinity,Density,Salinity,Density,AQUARIUS,L3,SSS,SMIA,SEASONAL-CLIMATOLOGY,V4

Step1:PhraseextracDon1.  ExtracttermcandidatesfrommetadatadescripDonwithPOS(partofspeech)Tagging2.  Introduce“occurrence”and“strength”tofilterouttermsfromcandidates.“occurrence”:occurrencesnumberofterms“strength”:thenumberofwordsinaterm

DescripDvesimilarity

Page 37: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Step2:Representmetadatainthephrasevectorspace(Thedimensionlowerthanwordfeaturespace)

Term1 Term2 Term3 Term4 Term5 Term6 Term7 … TermN

dataser1 1 0 1 0 0 0 0 1

dataser2

0 0 0 1 1 0 0 0

datasetk 1 0 1 0 0 0 0 1

Step3:Calculatecosinesimilarity

MetadataabstractsemanDcsimilarity

Page 38: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Session1 Session2 … SessionN

Data1 1 0 1Data2 0 0 0…

Datak 1 0 1

Calculatemetadatasimilaritybasedonsessionlevelco-occurrence

Similarity (i,j) = 𝑁(𝑖ᴖ𝑗)/√𝑁(𝑖) ∗√𝑁(𝑗)  

N(i):ThenumberofsessionsinwhichdatasetiwasviewedordownloadN(j):ThenumberofsessionsinwhichdatasetjwasviewedordownloadN(𝑖ᴖ𝑗):Thenumberofsessionsinwhichbothdatasetiandjwereviewedordownload

SessionbasedrecommendaDon

Page 39: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

RecommendaHonmethod

Pros Cons Strategy

DescripDvesimilarity

1.NaturallanguageprocessingmethodscanbeadoptedtofindlatentsemanDcrelaDonship

1.  Manydatasetshasnearlysameabstractwithfewwords/valueschanged2.  Itishardtoextractdetaileda^ributesfromdescripDon

UsedasthebasicmethodofrecommendaDonalgorithm

A^ributesimilarity(spaDotemporal,ordinal,categorical)

1.Asstructureddata,geographicmetadatahavemanyvariables.

1.Variablevaluesmaybenullorwrong2.Thequalitydependsontheweightassignedtoeveryvariable

AssupplementtosemanDcsimilarity

Sessionconcurrence

1.Reflectusers’preference

1.Coldstartproblem:Newlypublisheddatadon’thaveusagedata

Fine-tunerecommendaDonlist

Recommend (i) =𝑊𝑠𝑠∗𝐷𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑣𝑒𝑐↓𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) + ∑𝑊𝑐𝑣∗𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑎𝑙↓𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) +∑𝑊𝑜𝑣∗𝑂𝑟𝑑𝑖𝑎𝑛𝑙↓𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑖) +𝑊𝑠𝑡𝑣 ∗𝑆𝑝𝑎𝑡𝑖𝑜𝑇𝑒𝑚𝑝𝑜𝑟𝑎𝑙𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦+ 𝑊𝑠𝑜∗𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦

HybridrecommendaDon

Page 40: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10

Precision

PosiDon

Hybridsimilarity Word-basedsimilarity A^ributesimilarity Session-basedsimilarity

QuanDtaDveEvaluaDon HybridsimilarityoutperformothersimilariDessinceitintegratesmetadataa^ributesanduserpreference.

Y.Li,Jiang,Y.,C.Yang,K.Liu,E.M.Armstrong,T.Huang,D.Moroni&L.J.McGibbney(2017)AGeospaDalDataRecommenderSystembasedonMetadataandUserBehaviour(revision)

Page 41: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Conclusion

•  LogminingenablesadataportalintegraDngimplicituserpreferences•  Wordsimilarityretrievedbydataminingtasksexpandsanygivenquery

toimprovesearchrecallandprecision.•  TherichsetofrankingfeaturesandtheMLalgorithmprovide

substanDaladvantagesoverusingotherrankingmethods•  TherecommendaDonalgorithmcandiscoverlatentdatarelevancy•  Theproposedarchitectureenablesthelooselycoupledsooware

structureofadataportalandavoidsthecostofreplacingtheexisDngsystem

Page 42: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  PublicaDons•  Jiang,Y.,Y.Li,C.Yang,E.M.Armstrong,T.Huang&D.Moroni(2016)ReconstrucDngSessionsfromDataDiscoveryand

AccessLogstoBuildaSemanDcKnowledgeBaseforImprovingDataDiscovery.ISPRSInternaDonalJournalofGeo-InformaDon,5,54.

•  Y.Li,Jiang,Y.,C.Yang,K.Liu,E.M.Armstrong,T.Huang&D.Moroni(2016)LeveragecloudcompuDngtoimprovedataaccesslogmining.IEEEOceans2016.

•  Yang,C.,etal.,2017.BigDataandcloudcompuDng:innovaDonopportuniDesandchallenges.Interna>onalJournalofDigitalEarth,10(1),pp.13-53.(the2ndmostreadpaperofIJDEinit’sdecadalhistory)

•  Jiang,Y.,Y.Li,C.Yang,K.Liu,E.M.Armstrong,T.Huang&D.Moroni(2017)AComprehensiveApproachtoDeterminingtheLinkageWeightsamongGeospaDalVocabularies-AnExamplewithOceanographicDataDiscovery.InternaDonalJournalofGeographicalInformaDonScience(minorrevision)

•  Jiang,Y.,Y.Li,C.Yang,K.Liu,E.M.Armstrong,T.Huang,D.Moroni&L.J.McGibbney(2017)TowardsintelligentgeospaDaldiscovery:amachinelearningrankingframework.InternaDonalJournalofDigitalEarth(minorrevision)

•  Y.Li,Jiang,Y.,C.Yang,K.Liu,E.M.Armstrong,T.Huang,D.Moroni&L.J.McGibbney(2017)AGeospaDalDataRecommenderSystembasedonMetadataandUserBehaviour(revision)

•  Jiang,Y.,Y.Li,C.Yang,K.Liu,E.M.Armstrong,T.Huang,D.Moroni&L.J.McGibbney(2017)Asmartweb-baseddatadiscoverysystemforoceansciences.(ongoing)

•  Sourcecode:h^ps://github.com/mudrod/mudrod•  PO.DAACLabs:h^p://mudrod.jpl.nasa.gov/•  PDLeverage:h^p://pd.cloud.gmu.edu/

Products

Page 43: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Component CurrentTRL ProjectendTRL DescripHonSemanHcsearchengine

SearchDispatcher 7 7 TranslaDngausersearchqueryintoasetofnewsemanDcqueries

Similaritycalculator 7 7 CalculaDngthesemanDcsimilarityfromweblogs,metadata,andontology

RecommendaDonmodule 7 7 Recommendingsimilardatasetstotheclickeddataset

Rankingmodule 7 7 Re-rankingthesearchresultsbasedonRankSVMMLalgorithm

Knowledgebase

Ontology 7 7 ExtensionsfromSWEETontologyforearthsciencedata

TripleStore 7 7 ESIPontologyrepositoryVocabularylinkagediscoveryengine

Profileanalyzer 7 7 ExtracDnguserbrowsingpa^ernfromrawweblogs

Webservices/GUI

Rankingservice/presenter 7 7 ProvidingandpresenDngtherankedresults

RecommendaDonservice/presenter 7 7 ProvidingandpresenDngtherelateddatasets

OntologynavigaDonservice/presenter 7 7 ProvidingandpresenDngrelatedsearches

Page 44: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

Nextsteps•  Addmorefeatures(e.g.,temporalsimilarity)•  CreatetrainingdatafromweblogsforRankSVM•  Developaqueryunderstandingmoduletobe^erinterpretuser’ssearch

intent(e.g.“oceanwindlevel3”->“oceanwind”AND“level3”)•  SupportSolr•  Supportnearreal-DmedataingesDontodynamicallyupdateknowledge

base•  IntegraDonwithDOMSandOceanXtremesforanoceanscienceanalyDcs

center•  LeverageadvancedcompuDngtechniquestospeeduptheprocess

Page 45: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

•  YangC.,JiangY.,LY.,ArmstrongE.,HuangT.,andMoroniD.,2015.“UDlizingAdvancedITTechnologiestoSupportMUDRODtoAdvanceDataDiscoveryandAccess”,AGU,SanFrancisco,CA.

•  YangC.,JiangY.,LY.,ArmstrongE.,HuangT.,andMoroniD.,2016.“MiningandUDlizingDatasetRelevancyfromOceanographicDataset(MUDROD)Metadata,UsageMetrics,andUserFeedbacktoImproveDataDiscoveryandAccess”,ESIPwintermeeDng2016,WashingtonD.C.

•  JiangY.,YangC.,LY.,ArmstrongE.,HuangT.,andMoroniD.,2016.“AComprehensiveApproachtoDeterminingtheLinkageWeightsamongGeospaDalVocabularies-AnExamplewithOceanographicDataDiscovery”,AAG2016,SanFrancisco,CA.

•  YangC.,JiangY.,LY.,ArmstrongE.,HuangT.,andMoroniD.,2016.“MiningandUDlizingDatasetRelevancyfromOceanographicDataset(MUDROD)Metadata,UsageMetrics,andUserFeedbacktoImproveDataDiscoveryandAccess”,PO.DAACUWG,Pasadena,CA.

•  LY.,YangC.,JiangY.,ArmstrongE.,HuangT.,andMoroniD.,2016.“LeveragingcloudcompuDngtospeedupuseraccesslogmining”,Oceans16MTSIEEE,Monterey,CA.

•  JiangY.,YangC.,LY.,ArmstrongE.,HuangT.,andMoroniD.,2017.“TowardsintelligentgeospaDaldiscovery:amachinelearningrankingframework”,AAG2017,Boston,MA.

•  LY.,YangC.,JiangY.,ArmstrongE.,HuangT.,andMoroniD.,2017.“Ageographicrecommendersystemusingmetadataanduserfeedbacks”,AAG2017,Boston,MA.

PresentaDons

Page 46: Mining and UDlizing Dataset Relevancy from Oceanographic ... › forum › estf2017 › presentations › ... · Earth Science Technology Forum (ESTF2017), June 13-15, 2017 Pasadena,

EarthScienceTechnologyForum(ESTF2017),June13-15,2017Pasadena,CA

1.  NASAAISTProgram(NNX15AM85G)2.  PO.DAACSWEETOntologyTeam(IniDallyfundedbyESTO)3.  HydrologyDAACRahulRamachandran(providingtheearlier

versionofNOESIS)4.  ESDISforprovidingtesDnglogsofCMR5.  AllteammembersatJPLandGMU

Acknowledgements


Recommended