+ All Categories
Home > Documents > 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December...

1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December...

Date post: 28-Dec-2015
Category:
Upload: william-doyle
View: 217 times
Download: 2 times
Share this document with a friend
52
1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases www.cs.kuleuven.be/~berendt 3 December 2007 [updated version] Intelligent bibliography creation and markup for authors: The missing link between Google Scholar and plagiarism prevention?
Transcript
Page 1: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

1

Bettina Berendt

KU Leuven, Dept. of Computer Science, Hypermedia & Databases

www.cs.kuleuven.be/~berendt

3 December 2007 [updated version]

Intelligent bibliography creation and markup for authors:

The missing link between Google Scholar and plagiarism prevention?

Page 2: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

2

Three related problems

Page 3: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

3(1) Why is literature search sub-optimal?

Page 4: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

4(2) How can we improve our knowledge sources?

Page 5: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

5(3) Why are academic standards degrading(and what can we do about this)?

Page 6: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

6... and even more related questions

Literature search /

understanding science

Quality of

scientific writing

( and data)

Quality of

learning & teaching

scientific method

Ranking of scientists

PatentsQuality assurance

for Open Access

See Erik‘s talk on Thu

[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]

Page 7: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

7Context: Research areas andThe Big Questions

What is & what are we doing with our data / information / knowledge?

The privilege (and responsibility) of Computer Scientists

Page 8: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

8The approach

Requirements analysis K

nowl edge

Page 9: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

9Acknowledgements – work presented today (HU = Humboldt-Univ. zu Berlin, IIS = Inst. Information Systems)

Elke Brenstein – ex HU, Inst. Pedagogy and Informatics, now Lernen & Gestalten Consulting

Kai Dingel – ex HU IIS

Christoph Hanser – ex HU IIS

Frank Havemann – HU Inst. of Library Science

Sebastian Kolbe – HU IIS / TU, Comp.Sci.

Beate Krause – ex HU IIS; now Inst. of Knowledge Engineering, Univ. of Kassel & Research Center L3S, Hannover

Bert Wendland – ex HU, Digital Publishing Group, now Bibliothèque Nationale de France

The Citeseer and Citebase teams

+ many (other) students + colleagues!

Page 10: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

10

Why are the problems related, and why should we care?

Page 11: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

11“Garbage in, garbage out“ (or: Quality in, quality out)

Literature search /

understanding science

Quality of

scientific writing

( and data)

Quality of

learning & teaching

scientific method

Quality of

citation metadata

Quality of

information extraction

algorithms

Page 12: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

16

Background:Search functionalitiesavailable in Web-based DLs

Page 13: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

17Searching and navigating from the search result

Keyword search

Related by text similarity

Related by linkage

Page 14: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

18Similarity measures for determining neighbourhoods

are based on

links (citations)

text

usage

Page 15: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

19Similarity measures: (Some) roots in bibliometrics / scientometrics

Co-citation analysis (Small, 1973, 1977; Small & Greenlee, 1980)

„specialities“ (Small & Griffith, 1974)

– cluster of co-cited doc.s = the knowledge base of a specialty

„research fronts“ (Garfield & Small, 1989)

Bibliographic coupling (Kessler 1963)

Co-word analysis (Callon et al., 1983, 1986)

Combinations (e.g., Braam, Moed, & Raan, 1991; Glenisson, Glänzel, Janssens, and De Moor, various 200x)

[PageRank: Pinski & Narin, 1976]

Page 16: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

20Scientometric mappings

Choice of figures based on [Chen, Mapping Scientific Frontiers, Springer 2003]

Page 17: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

21

Co-citationBibliographic coupling

Link-based similarity measures: basic forms

Direct citation

A B

C

A B C

A B

Direct citation Bibliographic coupling

Co-citation

Page 18: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

22Link-based similarity: citing documents

Page 19: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

23Link-based similarity: cited documents

Page 20: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

24Link-based similarity: local co-citation neighbourhood

Page 21: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

25Link-based similarity: local bibliographic-coupling neighbourhood

Page 22: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

26Link-based similarity: local bibliographic-coupling neighbourhood

Active Bibliography

sources also cited by others

Active Bibliography Score

= Common Citation Inverse Document Frequency

Active Bibliography

sources also cited by others

Active Bibliography Score

= Common Citation Inverse Document Frequency

Page 23: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

27Text-based similarity (I)

Page 24: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

28Text-based similarity (I)

Similarity at the sentence level:

respects sentence structure (sequence, minus some data cleaning)

usually revisions of the document under consideration

Similarity at the text level:

based on bag-of-words and TF.IDF

Page 25: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

29Text-based similarity (II)

Page 26: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

30Similarity based on the text of metadata

Page 27: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

31Usage-based similarity (here: community-based)

Page 28: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

32Interactive Citation analysis tools: Pros and Cons

+ Sources are immediately accessible (if they are Open Access)

+ Relationships become visible (esp. when looking at the full text)

- Not available for all disciplines in the same quality

- Incomplete and incorrect document analysis

- Frustration when documents are not Open Access

- Algorithms are not always understandable

e.g., Google Scholar proprietary Citeseer open source

- Intractability I: only local search, starting from one document

- Intractability II: no ranking or unclear ranking in result sets

[Berendt & Havemann, Jahrbuch Wissenschaftsforschung 2007]

Page 29: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

33Problem “intractability due to local search“

Local search in the neighbourhood of 1 document

No “Top-down“ grouping of documents

Why are groups useful?

Citation indices must be formed with reference to groups

Understanding a scientific field includes forming groups of concepts

Assumption: Concepts are represented by groups

Page 30: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

34

Our approach:Search + grouping + interactivity

Page 31: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

35

Build a tool that is

o user-friendly

o intelligent

o modular and extensible

Page 32: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

36

[Berendt, Proc. AAAI Symposium KCVC 2005][Berendt, Dingel, & Hanser, Proc. ECDL 2006]

[Berendt & Krause, submitted][Berendt & Kolbe, in preparation]

Page 33: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

37

Web servicesWeb services

System architecture

Web servicesWeb services

Text & link mining /Information Extraction tools

Text & link mining /Information Extraction tools

Databases(local a/omirrored)

Databases(local a/omirrored)

other WS and info. sources

VBA macroVBA macro

Page 34: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

38

Page 35: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

39Search; Retrieval[slides not included in the online version]

Page 36: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

40

Page 37: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

41

Page 38: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

42Organisation of the literature /bibliography construction [slide not included in the online version] – here‘s the old interface

Page 39: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

43

Co-citationBibliographic coupling

Citation analysis and text analysis

Direct citation

A B

C

A B

Direct citation Bibliographic coupling

C

A B

Co-citation

& similarity measure –

e.g., Jaccard coeff. for co-citation / analogous for b.c.:

No. of sources cited in both documents no. of sources cited in at least 1 document

& keywords

(title & abstract, TF.IDF)

Page 40: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

44Current architecture of the clustering tool(partial view)[slide not included in the online version]

Page 41: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

45Enter full-text indexing(ex.: D. Mladenič‘s publications @ IST-WORLD; Pajntar & Ferlež,200x;cf. DocumentAtlas / Ontogen: Fortuna, Mladenič & Grobelnik,2005+)

Page 42: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

46Questions

How to make this scale better?

How to best combine link & text analysis?

Best cluster quality measures?

What else is needed to turn this into real ontology learning?

How to best support this as an interactive learning process?

Page 43: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

47

Page 44: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

48Discussion

How to best support this as a collaborative learning process?

Page 45: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

49

Page 46: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

50Writing

corrected, XML annotated, and formatted

Page 47: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

51

Information extraction: Reference parsing in 3 tools

... In our tool: involving the author ( higher IE quality)

How to best learn regular expressions?

How to best support this as an interactive learning process?

Page 48: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

52

(Other) uses in education

Page 49: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

53How can these ideas be used in education?

Tool + evaluations

In addition, tasks like these

and Google as an

example of why citation networks are useful

Auxiliary (Web-based) materials on why + how to cite

Classes on these topics

Page 50: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

54Plagiarism detection market study

[Berendt, Humboldt-Universität CMS Journal, 2007]

Page 51: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

56A final question

Should we delegate to Google Scholar

and its proprietary algorithms

or ...?

Page 52: 1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent.

57

… for your attention!

Thank you …


Recommended