Discrete Applied Mathematics 167 (2014) 315–317
Contents lists available at ScienceDirect
Discrete Applied Mathematics
journal homepage: www.elsevier.com/locate/dam
Communication
The Erdős webgraph serverRafael Ördög a,b, Dániel Bánky a,b, Balázs Szerencsi a,b, Péter Juhász a,Vince Grolmusz a,b,∗
a Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungaryb Uratim Ltd., H-1118 Budapest, Hungary
a r t i c l e i n f o
Article history:Received 17 September 2013Accepted 18 October 2013Available online 9 November 2013Communicated by Endre Boros
Keywords:Webgraph
a b s t r a c t
We describe the new Erdős Webgraph Server, paying tribute to Paul Erdős, deceased 17years ago. The server is publicly available at http://web-graph.org. Much work has beendone onwebgraphs, but by the best of our knowledge, there is no other regularly refreshed,freely available webgraph on the net: the freshest we are aware of is two years old. Herethe crawling process and the graph building strategy of the server is detailed.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
The Webgraph [3–5] is a well-studied representation of the World Wide Web by a directed graph, whose verticescorrespond to the pages of the WWW, and a directed edge connects page X to page Y if there exists a hyperlink on page X ,referring to page Y . Thewebgraph plays amain role in computing the PageRank of thewebpages [2], and discovering similarpages in the web [6,7].
While there is an enormous literature of the study of the webgraph (see, e.g., http://clair.si.umich.edu/∼radev/webgraph/), we found that there are virtually no regularly refreshed webgraphs available freely for research purposes onthe web. Therefore we created the Erdős Webgraph Server at the http://web-graph.org address.
2. Architecture
Since the current indexable web contains around 1012 URLs, allowing the nodes of the graph constructed to be the URLsseemed to be impossible. Therefore we applied the following definition.
Definition 1. The nodes of the graph are domain names, and two nodes, X and Y , are connected by a directed edge (X, Y ) ifthere exists a document (URL) under the domain X that hyperlinks to a document (URL) under the domain Y .
By the current estimates, there are around 150million domains on theweb, so constructing,maintaining and distributingsuch a graph seems to be manageable.
For example, in Fig. 1, the thick arrows are the edges in our graph, and these edges were added since the thin arrows linkfrom one document to the other, under the distinct domains.
∗ Corresponding author at: Institute of Mathematics, Eötvös University, Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary. Tel.: +36 1 381 2226; fax:+36 1 381 2231.
E-mail address: [email protected] (V. Grolmusz).
0166-218X/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.dam.2013.10.032
316 R. Ördög et al. / Discrete Applied Mathematics 167 (2014) 315–317
Fig. 1. The edges of the Erdős Webgraph.
Fig. 2. The cyclic architecture of the crawler, collecting edges for the Erdős Webgraph Server.
2.1. Crawler design
Currently 50–100 crawling robots traverse the web. Our crawlers are identified as uCrawl robots.The crawling process works as a cycle (see Fig. 2).
• At first, an algorithm determines the URLs need to be visited or revisited, from our database server, sorted by priority.• This program also mixes non-visited and visited URLs, creates special packets for crawling machines, then places them
on a huge temporary storage.• If a crawling robot is out of addresses, and it needs some new URLs, it requires and receives a packet from the
packetmaker. Then it visits those sites with respect to the content of robots.txts and their contents.• Each crawler collects new URLs from each site visited, and estimates a time for the next visit.• From this information the crawling bot also makes packets, then puts them to another temporary storage server.• From that storage, a packet loader process loads only the new addresses and only the new edges of the webgraph into
the database, and removes the deprecated sites.
3. Application
The current webgraph is available for download at http://web-graph.org.The download contains one zipped file with the following format: two 26 character long string identifiers are written in
a line separatedwith TAB character. Every line defines a directed edge of thewebgraph. The identifiers correspond to nodes:the first one describes the tail, the second one does the head.
R. Ördög et al. / Discrete Applied Mathematics 167 (2014) 315–317 317
Fig. 3. The power law distribution of the degrees of the Erdős Webgraph.
For example, a row of the form
01324moja6i5ghdbhfe94iou9e 5ltb8q97ou2ui154lc4ohc9pq
means that a URL in a domain with ID ‘‘01324moja6i5ghdbhfe94iou9e’’ links to a URL in the domain of ID‘‘5ltb8q97ou2ui154lc4ohc9pqt’’. If one wishes to know the actual domain names behind these IDs, then under the addresshttp://web-graph.org/index.php/domain-dictionary a domain dictionary application is given.
The above examples translate to www.autocluster.hu and www.matech2000.hu, respectively.
4. The power law distribution of the data
The power law distribution [1] of the degrees of the constructed graph is nicely visible on the doubly logarithmic chartof Fig. 3.
Acknowledgments
The authors acknowledge the partial support of OTKA CNK 77780, of ERC Advanced Grant 227701 DISCRETECONT, of theEuropean Union and the European Social Fund under the grant agreement no. TÁMOP 4.2.1/B-09/KMR-2010-0003.
References
[1] Albert-László Barabási, Réka Albert, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512.[2] Sergey Brin, Lawrence Page, The anatomy of a large-scale hypertextual web search engine, Computer Networks and ISDN Systems 30 (1998) 107–117.[3] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener, Graph structure
in the web, in: WWW2000, 2000.[4] Soumen Chakrabarti, Byron E. Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins, David Gibson, Jon Kleinberg, Mining
the Web’s link structure, Computer 32 (8) (1999) 60–67.[5] Soumen Chakrabarti, David A. Gibson, Kevin S. McCurley, Surfing the Web backwards, in: WWW1999, 1999, pp. 1679–1693.[6] Jeffrey Dean, Monika R. Henzinger, Finding related pages in the world wide web, Computer Networks 31 (1999) 1467–1479.[7] Taher H. Haveliwala, Aristides Gionis, Dan Klein, Piotr Indyk, Evaluating strategies for similarity search on the web, in: Proceedings of the 11th
International Conference on World Wide Web, WWW’02, ACM, New York, NY, USA, 2002, pp. 432–442.