+ All Categories
Home > Documents > The Erdős webgraph server

The Erdős webgraph server

Date post: 28-Dec-2016
Category:
Upload: vince
View: 215 times
Download: 2 times
Share this document with a friend
3
Discrete Applied Mathematics 167 (2014) 315–317 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Communication The Erdős webgraph server Rafael Ördög a,b , Dániel Bánky a,b , Balázs Szerencsi a,b , Péter Juhász a , Vince Grolmusz a,b,a Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary b Uratim Ltd., H-1118 Budapest, Hungary article info Article history: Received 17 September 2013 Accepted 18 October 2013 Available online 9 November 2013 Communicated by Endre Boros Keywords: Webgraph abstract We describe the new Erdős Webgraph Server, paying tribute to Paul Erdős, deceased 17 years ago. The server is publicly available at http://web-graph.org. Much work has been done on webgraphs, but by the best of our knowledge, there is no other regularly refreshed, freely available webgraph on the net: the freshest we are aware of is two years old. Here the crawling process and the graph building strategy of the server is detailed. © 2013 Elsevier B.V. All rights reserved. 1. Introduction The Webgraph [3–5] is a well-studied representation of the World Wide Web by a directed graph, whose vertices correspond to the pages of the WWW, and a directed edge connects page X to page Y if there exists a hyperlink on page X , referring to page Y . The webgraph plays a main role in computing the PageRank of the webpages [2], and discovering similar pages in the web [6,7]. While there is an enormous literature of the study of the webgraph (see, e.g., http://clair.si.umich.edu/ radev/ webgraph/), we found that there are virtually no regularly refreshed webgraphs available freely for research purposes on the web. Therefore we created the Erdős Webgraph Server at the http://web-graph.org address. 2. Architecture Since the current indexable web contains around 10 12 URLs, allowing the nodes of the graph constructed to be the URLs seemed to be impossible. Therefore we applied the following definition. Definition 1. The nodes of the graph are domain names, and two nodes, X and Y , are connected by a directed edge (X , Y ) if there exists a document (URL) under the domain X that hyperlinks to a document (URL) under the domain Y . By the current estimates, there are around 150 million domains on the web, so constructing, maintaining and distributing such a graph seems to be manageable. For example, in Fig. 1, the thick arrows are the edges in our graph, and these edges were added since the thin arrows link from one document to the other, under the distinct domains. Corresponding author at: Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary. Tel.: +36 1 381 2226; fax: +36 1 381 2231. E-mail address: [email protected] (V. Grolmusz). 0166-218X/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.dam.2013.10.032
Transcript

Discrete Applied Mathematics 167 (2014) 315–317

Contents lists available at ScienceDirect

Discrete Applied Mathematics

journal homepage: www.elsevier.com/locate/dam

Communication

The Erdős webgraph serverRafael Ördög a,b, Dániel Bánky a,b, Balázs Szerencsi a,b, Péter Juhász a,Vince Grolmusz a,b,∗

a Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungaryb Uratim Ltd., H-1118 Budapest, Hungary

a r t i c l e i n f o

Article history:Received 17 September 2013Accepted 18 October 2013Available online 9 November 2013Communicated by Endre Boros

Keywords:Webgraph

a b s t r a c t

We describe the new Erdős Webgraph Server, paying tribute to Paul Erdős, deceased 17years ago. The server is publicly available at http://web-graph.org. Much work has beendone onwebgraphs, but by the best of our knowledge, there is no other regularly refreshed,freely available webgraph on the net: the freshest we are aware of is two years old. Herethe crawling process and the graph building strategy of the server is detailed.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

The Webgraph [3–5] is a well-studied representation of the World Wide Web by a directed graph, whose verticescorrespond to the pages of the WWW, and a directed edge connects page X to page Y if there exists a hyperlink on page X ,referring to page Y . Thewebgraph plays amain role in computing the PageRank of thewebpages [2], and discovering similarpages in the web [6,7].

While there is an enormous literature of the study of the webgraph (see, e.g., http://clair.si.umich.edu/∼radev/webgraph/), we found that there are virtually no regularly refreshed webgraphs available freely for research purposes onthe web. Therefore we created the Erdős Webgraph Server at the http://web-graph.org address.

2. Architecture

Since the current indexable web contains around 1012 URLs, allowing the nodes of the graph constructed to be the URLsseemed to be impossible. Therefore we applied the following definition.

Definition 1. The nodes of the graph are domain names, and two nodes, X and Y , are connected by a directed edge (X, Y ) ifthere exists a document (URL) under the domain X that hyperlinks to a document (URL) under the domain Y .

By the current estimates, there are around 150million domains on theweb, so constructing,maintaining and distributingsuch a graph seems to be manageable.

For example, in Fig. 1, the thick arrows are the edges in our graph, and these edges were added since the thin arrows linkfrom one document to the other, under the distinct domains.

∗ Corresponding author at: Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary. Tel.: +36 1 381 2226; fax:+36 1 381 2231.

E-mail address: [email protected] (V. Grolmusz).

0166-218X/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.dam.2013.10.032

316 R. Ördög et al. / Discrete Applied Mathematics 167 (2014) 315–317

Fig. 1. The edges of the Erdős Webgraph.

Fig. 2. The cyclic architecture of the crawler, collecting edges for the Erdős Webgraph Server.

2.1. Crawler design

Currently 50–100 crawling robots traverse the web. Our crawlers are identified as uCrawl robots.The crawling process works as a cycle (see Fig. 2).

• At first, an algorithm determines the URLs need to be visited or revisited, from our database server, sorted by priority.• This program also mixes non-visited and visited URLs, creates special packets for crawling machines, then places them

on a huge temporary storage.• If a crawling robot is out of addresses, and it needs some new URLs, it requires and receives a packet from the

packetmaker. Then it visits those sites with respect to the content of robots.txts and their contents.• Each crawler collects new URLs from each site visited, and estimates a time for the next visit.• From this information the crawling bot also makes packets, then puts them to another temporary storage server.• From that storage, a packet loader process loads only the new addresses and only the new edges of the webgraph into

the database, and removes the deprecated sites.

3. Application

The current webgraph is available for download at http://web-graph.org.The download contains one zipped file with the following format: two 26 character long string identifiers are written in

a line separatedwith TAB character. Every line defines a directed edge of thewebgraph. The identifiers correspond to nodes:the first one describes the tail, the second one does the head.

R. Ördög et al. / Discrete Applied Mathematics 167 (2014) 315–317 317

Fig. 3. The power law distribution of the degrees of the Erdős Webgraph.

For example, a row of the form

01324moja6i5ghdbhfe94iou9e 5ltb8q97ou2ui154lc4ohc9pq

means that a URL in a domain with ID ‘‘01324moja6i5ghdbhfe94iou9e’’ links to a URL in the domain of ID‘‘5ltb8q97ou2ui154lc4ohc9pqt’’. If one wishes to know the actual domain names behind these IDs, then under the addresshttp://web-graph.org/index.php/domain-dictionary a domain dictionary application is given.

The above examples translate to www.autocluster.hu and www.matech2000.hu, respectively.

4. The power law distribution of the data

The power law distribution [1] of the degrees of the constructed graph is nicely visible on the doubly logarithmic chartof Fig. 3.

Acknowledgments

The authors acknowledge the partial support of OTKA CNK 77780, of ERC Advanced Grant 227701 DISCRETECONT, of theEuropean Union and the European Social Fund under the grant agreement no. TÁMOP 4.2.1/B-09/KMR-2010-0003.

References

[1] Albert-László Barabási, Réka Albert, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512.[2] Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems 30 (1998) 107–117.[3] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener, Graph structure

in the web, in: WWW2000, 2000.[4] Soumen Chakrabarti, Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, Jon Kleinberg, Mining

the Web’s link structure, Computer 32 (8) (1999) 60–67.[5] Soumen Chakrabarti, David A. Gibson, Kevin S. McCurley, Surfing the Web backwards, in: WWW1999, 1999, pp. 1679–1693.[6] Jeffrey Dean, Monika R. Henzinger, Finding related pages in the world wide web, Computer Networks 31 (1999) 1467–1479.[7] Taher H. Haveliwala, Aristides Gionis, Dan Klein, Piotr Indyk, Evaluating strategies for similarity search on the web, in: Proceedings of the 11th

International Conference on World Wide Web, WWW’02, ACM, New York, NY, USA, 2002, pp. 432–442.


Recommended