1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December...

transcript

Bettina Berendt

KU Leuven, Dept. of Computer Science, Hypermedia & Databases

www.cs.kuleuven.be/~berendt

3 December 2007 [updated version]

Intelligent bibliography creation and markup for authors:

The missing link between Google Scholar and plagiarism prevention?

Three related problems

3(1) Why is literature search sub-optimal?

4(2) How can we improve our knowledge sources?

5(3) Why are academic standards degrading(and what can we do about this)?

6... and even more related questions

Literature search /

understanding science

Quality of

scientific writing

( and data)

Quality of

learning & teaching

scientific method

Ranking of scientists

PatentsQuality assurance

for Open Access

See Erik‘s talk on Thu

[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]

7Context: Research areas andThe Big Questions

What is & what are we doing with our data / information / knowledge?

The privilege (and responsibility) of Computer Scientists

8The approach

Requirements analysis K

nowl edge

9Acknowledgements – work presented today (HU = Humboldt-Univ. zu Berlin, IIS = Inst. Information Systems)

Elke Brenstein – ex HU, Inst. Pedagogy and Informatics, now Lernen & Gestalten Consulting

Kai Dingel – ex HU IIS

Christoph Hanser – ex HU IIS

Frank Havemann – HU Inst. of Library Science

Sebastian Kolbe – HU IIS / TU, Comp.Sci.

Beate Krause – ex HU IIS; now Inst. of Knowledge Engineering, Univ. of Kassel & Research Center L3S, Hannover

Bert Wendland – ex HU, Digital Publishing Group, now Bibliothèque Nationale de France

The Citeseer and Citebase teams

+ many (other) students + colleagues!

Why are the problems related, and why should we care?

11“Garbage in, garbage out“ (or: Quality in, quality out)

Literature search /

understanding science

Quality of

scientific writing

( and data)

Quality of

learning & teaching

scientific method

Quality of

citation metadata

Quality of

information extraction

algorithms

Background:Search functionalitiesavailable in Web-based DLs

17Searching and navigating from the search result

Keyword search

Related by text similarity

Related by linkage

18Similarity measures for determining neighbourhoods

are based on

links (citations)

19Similarity measures: (Some) roots in bibliometrics / scientometrics

Co-citation analysis (Small, 1973, 1977; Small & Greenlee, 1980)

„specialities“ (Small & Griffith, 1974)

– cluster of co-cited doc.s = the knowledge base of a specialty

„research fronts“ (Garfield & Small, 1989)

Bibliographic coupling (Kessler 1963)

Co-word analysis (Callon et al., 1983, 1986)

Combinations (e.g., Braam, Moed, & Raan, 1991; Glenisson, Glänzel, Janssens, and De Moor, various 200x)

[PageRank: Pinski & Narin, 1976]

20Scientometric mappings

Choice of figures based on [Chen, Mapping Scientific Frontiers, Springer 2003]

Co-citationBibliographic coupling

Link-based similarity measures: basic forms

Direct citation

Direct citation Bibliographic coupling

Co-citation

22Link-based similarity: citing documents

23Link-based similarity: cited documents

24Link-based similarity: local co-citation neighbourhood

25Link-based similarity: local bibliographic-coupling neighbourhood

26Link-based similarity: local bibliographic-coupling neighbourhood

Active Bibliography

sources also cited by others

Active Bibliography Score

= Common Citation Inverse Document Frequency

Active Bibliography

sources also cited by others

Active Bibliography Score

= Common Citation Inverse Document Frequency

27Text-based similarity (I)

28Text-based similarity (I)

Similarity at the sentence level:

respects sentence structure (sequence, minus some data cleaning)

usually revisions of the document under consideration

Similarity at the text level:

based on bag-of-words and TF.IDF

29Text-based similarity (II)

30Similarity based on the text of metadata

31Usage-based similarity (here: community-based)

32Interactive Citation analysis tools: Pros and Cons

+ Sources are immediately accessible (if they are Open Access)

+ Relationships become visible (esp. when looking at the full text)

- Not available for all disciplines in the same quality

- Incomplete and incorrect document analysis

- Frustration when documents are not Open Access

- Algorithms are not always understandable

e.g., Google Scholar proprietary Citeseer open source

- Intractability I: only local search, starting from one document

- Intractability II: no ranking or unclear ranking in result sets

[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]

33Problem “intractability due to local search“

Local search in the neighbourhood of 1 document

No “Top-down“ grouping of documents

Why are groups useful?

Citation indices must be formed with reference to groups

Understanding a scientific field includes forming groups of concepts

Assumption: Concepts are represented by groups

Our approach:Search + grouping + interactivity

Build a tool that is

o user-friendly

o intelligent

o modular and extensible

[Berendt, Proc. AAAI Symposium KCVC 2005][Berendt, Dingel, & Hanser, Proc. ECDL 2006]

[Berendt & Krause, submitted][Berendt & Kolbe, in preparation]

Web servicesWeb services

System architecture

Web servicesWeb services

Text & link mining /Information Extraction tools

Databases(local a/omirrored)

other WS and info. sources

VBA macroVBA macro

39Search; Retrieval[slides not included in the online version]

42Organisation of the literature /bibliography construction [slide not included in the online version] – here‘s the old interface

Co-citationBibliographic coupling

Citation analysis and text analysis

Direct citation

Direct citation Bibliographic coupling

Co-citation

& similarity measure –

e.g., Jaccard coeff. for co-citation / analogous for b.c.:

No. of sources cited in both documents no. of sources cited in at least 1 document

& keywords

(title & abstract, TF.IDF)

44Current architecture of the clustering tool(partial view)[slide not included in the online version]

45Enter full-text indexing(ex.: D. Mladenič‘s publications @ IST-WORLD; Pajntar & Ferlež,200x;cf. DocumentAtlas / Ontogen: Fortuna, Mladenič & Grobelnik,2005+)

46Questions

How to make this scale better?

How to best combine link & text analysis?

Best cluster quality measures?

What else is needed to turn this into real ontology learning?

How to best support this as an interactive learning process?

48Discussion

How to best support this as a collaborative learning process?

50Writing

corrected, XML annotated, and formatted

Information extraction: Reference parsing in 3 tools

... In our tool: involving the author ( higher IE quality)

How to best learn regular expressions?

How to best support this as an interactive learning process?

(Other) uses in education

53How can these ideas be used in education?

Tool + evaluations

In addition, tasks like these

and Google as an

example of why citation networks are useful

Auxiliary (Web-based) materials on why + how to cite

Classes on these topics

54Plagiarism detection market study

[Berendt, Humboldt-Universität CMS Journal, 2007]

56A final question

Should we delegate to Google Scholar

and its proprietary algorithms

or ...?

… for your attention!

Thank you …

1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December...

Documents