Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | william-doyle |
View: | 217 times |
Download: | 2 times |
1
Bettina Berendt
KU Leuven, Dept. of Computer Science, Hypermedia & Databases
www.cs.kuleuven.be/~berendt
3 December 2007 [updated version]
Intelligent bibliography creation and markup for authors:
The missing link between Google Scholar and plagiarism prevention?
2
Three related problems
3(1) Why is literature search sub-optimal?
4(2) How can we improve our knowledge sources?
5(3) Why are academic standards degrading(and what can we do about this)?
6... and even more related questions
Literature search /
understanding science
Quality of
scientific writing
( and data)
Quality of
learning & teaching
scientific method
Ranking of scientists
PatentsQuality assurance
for Open Access
See Erik‘s talk on Thu
[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]
7Context: Research areas andThe Big Questions
What is & what are we doing with our data / information / knowledge?
The privilege (and responsibility) of Computer Scientists
8The approach
Requirements analysis K
nowl edge
9Acknowledgements – work presented today (HU = Humboldt-Univ. zu Berlin, IIS = Inst. Information Systems)
Elke Brenstein – ex HU, Inst. Pedagogy and Informatics, now Lernen & Gestalten Consulting
Kai Dingel – ex HU IIS
Christoph Hanser – ex HU IIS
Frank Havemann – HU Inst. of Library Science
Sebastian Kolbe – HU IIS / TU, Comp.Sci.
Beate Krause – ex HU IIS; now Inst. of Knowledge Engineering, Univ. of Kassel & Research Center L3S, Hannover
Bert Wendland – ex HU, Digital Publishing Group, now Bibliothèque Nationale de France
The Citeseer and Citebase teams
+ many (other) students + colleagues!
10
Why are the problems related, and why should we care?
11“Garbage in, garbage out“ (or: Quality in, quality out)
Literature search /
understanding science
Quality of
scientific writing
( and data)
Quality of
learning & teaching
scientific method
Quality of
citation metadata
Quality of
information extraction
algorithms
16
Background:Search functionalitiesavailable in Web-based DLs
17Searching and navigating from the search result
Keyword search
Related by text similarity
Related by linkage
18Similarity measures for determining neighbourhoods
are based on
links (citations)
text
usage
19Similarity measures: (Some) roots in bibliometrics / scientometrics
Co-citation analysis (Small, 1973, 1977; Small & Greenlee, 1980)
„specialities“ (Small & Griffith, 1974)
– cluster of co-cited doc.s = the knowledge base of a specialty
„research fronts“ (Garfield & Small, 1989)
Bibliographic coupling (Kessler 1963)
Co-word analysis (Callon et al., 1983, 1986)
Combinations (e.g., Braam, Moed, & Raan, 1991; Glenisson, Glänzel, Janssens, and De Moor, various 200x)
[PageRank: Pinski & Narin, 1976]
20Scientometric mappings
Choice of figures based on [Chen, Mapping Scientific Frontiers, Springer 2003]
21
Co-citationBibliographic coupling
Link-based similarity measures: basic forms
Direct citation
A B
C
A B C
A B
Direct citation Bibliographic coupling
Co-citation
22Link-based similarity: citing documents
23Link-based similarity: cited documents
24Link-based similarity: local co-citation neighbourhood
25Link-based similarity: local bibliographic-coupling neighbourhood
26Link-based similarity: local bibliographic-coupling neighbourhood
Active Bibliography
sources also cited by others
Active Bibliography Score
= Common Citation Inverse Document Frequency
Active Bibliography
sources also cited by others
Active Bibliography Score
= Common Citation Inverse Document Frequency
27Text-based similarity (I)
28Text-based similarity (I)
Similarity at the sentence level:
respects sentence structure (sequence, minus some data cleaning)
usually revisions of the document under consideration
Similarity at the text level:
based on bag-of-words and TF.IDF
29Text-based similarity (II)
30Similarity based on the text of metadata
31Usage-based similarity (here: community-based)
32Interactive Citation analysis tools: Pros and Cons
+ Sources are immediately accessible (if they are Open Access)
+ Relationships become visible (esp. when looking at the full text)
- Not available for all disciplines in the same quality
- Incomplete and incorrect document analysis
- Frustration when documents are not Open Access
- Algorithms are not always understandable
e.g., Google Scholar proprietary Citeseer open source
- Intractability I: only local search, starting from one document
- Intractability II: no ranking or unclear ranking in result sets
[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]
33Problem “intractability due to local search“
Local search in the neighbourhood of 1 document
No “Top-down“ grouping of documents
Why are groups useful?
Citation indices must be formed with reference to groups
Understanding a scientific field includes forming groups of concepts
Assumption: Concepts are represented by groups
34
Our approach:Search + grouping + interactivity
35
Build a tool that is
o user-friendly
o intelligent
o modular and extensible
36
[Berendt, Proc. AAAI Symposium KCVC 2005][Berendt, Dingel, & Hanser, Proc. ECDL 2006]
[Berendt & Krause, submitted][Berendt & Kolbe, in preparation]
37
Web servicesWeb services
System architecture
Web servicesWeb services
Text & link mining /Information Extraction tools
Text & link mining /Information Extraction tools
Databases(local a/omirrored)
Databases(local a/omirrored)
other WS and info. sources
VBA macroVBA macro
38
39Search; Retrieval[slides not included in the online version]
40
41
42Organisation of the literature /bibliography construction [slide not included in the online version] – here‘s the old interface
43
Co-citationBibliographic coupling
Citation analysis and text analysis
Direct citation
A B
C
A B
Direct citation Bibliographic coupling
C
A B
Co-citation
& similarity measure –
e.g., Jaccard coeff. for co-citation / analogous for b.c.:
No. of sources cited in both documents no. of sources cited in at least 1 document
& keywords
(title & abstract, TF.IDF)
44Current architecture of the clustering tool(partial view)[slide not included in the online version]
45Enter full-text indexing(ex.: D. Mladenič‘s publications @ IST-WORLD; Pajntar & Ferlež,200x;cf. DocumentAtlas / Ontogen: Fortuna, Mladenič & Grobelnik,2005+)
46Questions
How to make this scale better?
How to best combine link & text analysis?
Best cluster quality measures?
What else is needed to turn this into real ontology learning?
How to best support this as an interactive learning process?
47
48Discussion
How to best support this as a collaborative learning process?
49
50Writing
corrected, XML annotated, and formatted
51
Information extraction: Reference parsing in 3 tools
... In our tool: involving the author ( higher IE quality)
How to best learn regular expressions?
How to best support this as an interactive learning process?
52
(Other) uses in education
53How can these ideas be used in education?
Tool + evaluations
In addition, tasks like these
and Google as an
example of why citation networks are useful
Auxiliary (Web-based) materials on why + how to cite
Classes on these topics
54Plagiarism detection market study
[Berendt, Humboldt-Universität CMS Journal, 2007]
56A final question
Should we delegate to Google Scholar
and its proprietary algorithms
or ...?
57
… for your attention!
Thank you …