Research Issues for Large Scale Digital Library Search Engines in the Cloud: CiteSeer X or Why...

Research Issues for Large Scale Digital Library Search Engines in the Cloud: CiteSeerX

orWhy consider CiteSeerX as a Cloud Testbed

C. Lee Giles, Pradeep Teregowda, Bhuvan UrgaonkarPennsylvania State University

University Park, PA

Data Varies with Disciplineor Small vs Big Science

Data Varies with Disciplineor Small vs Big Science

Small vs Big science

“Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.”

‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)

Data is local

Data will not be shared

At some point there will be needed “local” clouds

If you can’t move the data around,

•Bandwidth of a van loaded with disks

take the analysis/cloud to the data!

Do all/most data manipulations locally

clouds for digital libraries/search engines

Several features attractive for information retrieval systems such as digital libraries and search engines (storage and fast access)

Flexibility/growth

• Components such as crawlers, web interfaces, etc. can utilize resources on demand.

Management

• Utilizing cloud services potentially requires less investment in hardware and maintenance.

Reliability

• By deploying across sites (or adopting solutions distribution services provided by vendors), systems are potentially more stable.

What about CiteSeerX?

SeerSuite - CiteSeerx

SeerSuite Framework for digital libraries Flexible, scalable, robust, portable, state of the art machine

learning extractors, open source.

Easy to create instances of SeerSuite both production and research grade: CiteSeerx: computer science

ChemXSeer: chemistry

ArchSeer: archaeology

CollabSeer

EnronSeer

YouSeer

Facilitates research

http://citeseerx.ist.psu.edu

How CiteSeerX is like a specialty search engine

CiteSeerX shares several components with digital libraries and search enginesWeb Interface

• Both digital libraries and search engines provide interfaces to users to interact with the application

Crawlers

• Focused crawling of the web for scholarly/academic documents

Index

• Both digital libraries and search engines utilize an inverted index to provide efficient and fast access to users through search.

Databases

• Digital libraries maintain extensive metadata usually in relational databases.

CiteSeerX as a testbed Digital Libraries continue to grow and be widely used

Cyberinfrastructure for scientists and academics

Google Scholar is very popular & to some invaluable

Publisher collections: ACM portal, Scopus, etc.; Library of Congress (NDLP)

DLs are usually poorly supported and have few monetization models

CiteSeerX is a digital library and a search engine

Features of CiteSeerX

Automatic acquisition of new documents by focused web crawling (1.5M documents, 20M citations – 2TB, 1-4 M authors) (data regularly shared by rsync), 24/7 service

Interface for search widely used (2 M hits/day, 200K queries/day)

Full text indexing

Autonomous citation indexing, linking documents through citations.

Automatic metadata extraction for each document.

MyCiteSeer for personalization.

New features in development, e.g.

Table extraction and search Algorithm extraction and search

Commercial grade open source code and data shared

4 systems:

• Production• Crawling• Staging• Research

All or some can be cloudized

PSU

3 systems

Collection of Research Issues Hosting cloud CiteSeerX instances

Economic issues Cost of hosting Cost of refactoring the source to be hosted in the cloud.

Computational/technical issues What workflow to cloudize Component modification for efficient operation VM size: storage, memory and CPU sizing as a function of

needs Establishing computational needs and availability clusters Appropriate load balancing across multiple sites. Security of data stored including metadata and user data.

Policy issues Privacy of user data Copyright issues.

CiteSeerX Architecture

Web Application

Focused Crawler

Document Conversion and Extraction

Document Ingestion

Data Storage

Maintenance Services

Federated Services

USENIX ‘10

CiteSeerX data transfer

Nodes for hosting IEEE Cloud ‘10

Hosting models for DLsComponent hosting SeerSuite is modular by design and architecture; host

individual components across available infrastructure.

Content hosting CiteSeerx provides access to document metadata,

copies and application content Host parts or complete set.

Peak load loading Support the application during peak loads Support growth of traffic.

Focus on actual public cloud costs Google APP, EC2 estimates

Hot

USENIX Hotcloud ‘10, IEEE Cloud ‘10 USENIX Hotcloud ‘10, IEEE Cloud ‘10

Component Hosting

Expense of hosting the whole of CiteSeerx maybe prohibitive.

Solution: Host a component or service i.e., Component/service code Data on which the component acts Interfaces, etc. associated with the component

Goal: Identify optimal subset/components based on: Service growth Service usage New services

Component Hosting - Costs

Least expensive option - host the index for cases.

Most expensive - host web services.

Component Amazon EC2 Google App Engine

Initial Monthly Costs

Initial Monthly Costs

Web Services

0 1448.18 0 942.53

Repository 0 1000 163.8 593.21

Database 0 858.89 12 348.05

Index 0 527.08 3.1 83.48

Extraction 0 499.02 0 90.6

Crawler 0 513.4 0 105

Component Hosting – Lessons Learned

Hosting components is reasonable

Having a service oriented architecture helps

Amazon EC2

Computation costs dominate.

Google App Engine

Refactoring costs ?

Refactoring required not just for components, but other services.

Storage and transfer costs maybe optimized

A study of data transfer in the application gives insights to costs.

Approach suitable for meeting fixed budgets

How many components of an application can be hosted for a fixed budget.

Content Hosting – Lessons Learned

Hosting specific content relevant to peak load scenarios Easy to do – minimal refactoring required, affects a

minimal set of components (presentation layer).

More complex scenarios need to be examined Hosting papers from the repository Hosting shards of the index Database

Peak Load – Lessons Learned

Hosting only during peak load conditions is economically feasible.

Growth potential Can be used to handle growth in traffic, instead of

procuring new hardware. Hosting a specific component under stress; such as a

database In such a case it will cost 400$ to host the database in

Amazon EC2.

Research Directions• Similar to many discussed at this work shop applied to

DLs

• Explore policies for hosting based on Privacy/Security Integrity/reliability QoS; $ Costs Local vs public

Architecture redesign utilizing cloud primitives and systems spanning multiple sites

Queues, Key Stores, Clusters

Optimization of existing features for automated VM-ing

ConclusionsAdvantages of cloudizing CiteSeerx

Reliability, maintenance, potential costs savings

Different costs of hosting for all or parts of Components

Content

Peak load

CiteSeerX working system – testbed? Data, storage, access, databases

Growth, evolving features, users

$ savings depend on continued support; working local system may still be needed

Archival issues/ Google Scholar

NSF cloud focus?

Besides what has been proposed at this workshop

Clouds for science – what industry will not or can not support

Both for big and small science

Clouds for the “new” sciences – social, political, historical,… that have growing amounts of data

Also focus on data:

Data rules, Without data, there is no science.

Future Work

Cost of refactoring – particularly for Google App Engine.

Cost comparisons for other cloud offerings – Azure, Eucalyptus.

Privacy and user issues – myCiteSeer and private clouds.

Technical issues with cross hosting – load balancing, latency needed to be addressed.

Virtualization in SeerSuite, components built with cloud hosting in mind (Federated Services).

ReferencesGROSSMAN , R., AND G U , Y. Data mining using high performance data clouds: Experimental studies using sector

and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), ACM,pp. 920–927.

MIKA , P., AND T UMMARELLO , G. Web semantics in the clouds. IEEE Intelligent Systems 23, 5 (2008), 82–87.

NURMI , D., WOLSKI , R., G RZEGORCZYK , C., O BERTELLI , G., S OMAN , S., YOUSEFF , L., AND ZAGORODNOV, D. The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid-Volume 00 (2009), IEEE Computer Society, pp. 124–131.

SINGH , A., S RIVATSA , M., AND L IU , L. Search-as-a-service: Outsourced search over outsourced storage. ACM Trans. Web 3, 4 (2009), 1–33.

TEREGOWDA , P. URGAONKAR , B., AND GILES , C. Cloud computing: A digital libraries perspective. In 3rd IEEE 2010 International Conference on Cloud Computing (2010).

TEREGOWDA , P. B., COUNCILL , I. G., FERNANDEZ , J. P. R., KASBHA , M., ZHENG , S., AND GILES , L. C. Seersuite: Developing a scalable and reliable application framework for building digital libraries by crawling the web. In USENIX Conference on Web Application Development (2010).

P.B Teregowda, B. Urgaonkar, C.L. Giles, "Cost Implications Of Moving To The Cloud: A Digital Libraries Perspective," 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), 2010.

VAN DE SOMPEL , H., NELSON , M., LAGOZE , C., AND WARNER , S. Resource harvesting within the OAI-PMH framework. D-Lib Magazine 10, 12 (2004), 1082–9873.

WALKER , E., BRISKEN , W., AND ROMNEY, J. To lease or not to lease from storage clouds. Computer 43 (2010), 44–50.

WEIGEL , F., PANDA , B., RIEDEWALD , M., GEHRKE , J., AND CALIMLIM , M. Large-scale collaborative analysis and extraction of web data. Proc. VLDB Endow. 1, 2 (2008), 1476–1479.

WOOD , T., CECCHET, E., RAMAKRISHNANY, K., SHENOY, P., VAN DER MERWEY, J., AND V ENKATARAMANI , A. Disaster recovery as a cloud service: Economic benefits & deployment challenges. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (2010).

Date post:	22-Dec-2015
Category:	Documents
View:	214 times
Download:	1 times

Research Issues for Large Scale Digital Library Search Engines in the Cloud: CiteSeer X or Why...

Documents