Date post: | 22-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
Research Issues for Large Scale Digital Library Search Engines in the Cloud: CiteSeerX
orWhy consider CiteSeerX as a Cloud Testbed
C. Lee Giles, Pradeep Teregowda, Bhuvan UrgaonkarPennsylvania State University
University Park, PA
Data Varies with Disciplineor Small vs Big Science
Data Varies with Disciplineor Small vs Big Science
Small vs Big science
“Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.”
‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)
Data is local
Data will not be shared
At some point there will be needed “local” clouds
If you can’t move the data around,
•Bandwidth of a van loaded with disks
take the analysis/cloud to the data!
Do all/most data manipulations locally
clouds for digital libraries/search engines
Several features attractive for information retrieval systems such as digital libraries and search engines (storage and fast access)
Flexibility/growth
• Components such as crawlers, web interfaces, etc. can utilize resources on demand.
Management
• Utilizing cloud services potentially requires less investment in hardware and maintenance.
Reliability
• By deploying across sites (or adopting solutions distribution services provided by vendors), systems are potentially more stable.
What about CiteSeerX?
SeerSuite - CiteSeerx
SeerSuite Framework for digital libraries Flexible, scalable, robust, portable, state of the art machine
learning extractors, open source.
Easy to create instances of SeerSuite both production and research grade: CiteSeerx: computer science
ChemXSeer: chemistry
ArchSeer: archaeology
CollabSeer
EnronSeer
YouSeer
Facilitates research
http://citeseerx.ist.psu.edu
How CiteSeerX is like a specialty search engine
CiteSeerX shares several components with digital libraries and search enginesWeb Interface
• Both digital libraries and search engines provide interfaces to users to interact with the application
Crawlers
• Focused crawling of the web for scholarly/academic documents
Index
• Both digital libraries and search engines utilize an inverted index to provide efficient and fast access to users through search.
Databases
• Digital libraries maintain extensive metadata usually in relational databases.
CiteSeerX as a testbed Digital Libraries continue to grow and be widely used
Cyberinfrastructure for scientists and academics
Google Scholar is very popular & to some invaluable
Publisher collections: ACM portal, Scopus, etc.; Library of Congress (NDLP)
DLs are usually poorly supported and have few monetization models
CiteSeerX is a digital library and a search engine
Features of CiteSeerX
Automatic acquisition of new documents by focused web crawling (1.5M documents, 20M citations – 2TB, 1-4 M authors) (data regularly shared by rsync), 24/7 service
Interface for search widely used (2 M hits/day, 200K queries/day)
Full text indexing
Autonomous citation indexing, linking documents through citations.
Automatic metadata extraction for each document.
MyCiteSeer for personalization.
New features in development, e.g.
Table extraction and search Algorithm extraction and search
Commercial grade open source code and data shared
4 systems:
• Production• Crawling• Staging• Research
All or some can be cloudized
Collection of Research Issues Hosting cloud CiteSeerX instances
Economic issues Cost of hosting Cost of refactoring the source to be hosted in the cloud.
Computational/technical issues What workflow to cloudize Component modification for efficient operation VM size: storage, memory and CPU sizing as a function of
needs Establishing computational needs and availability clusters Appropriate load balancing across multiple sites. Security of data stored including metadata and user data.
Policy issues Privacy of user data Copyright issues.
CiteSeerX Architecture
Web Application
Focused Crawler
Document Conversion and Extraction
Document Ingestion
Data Storage
Maintenance Services
Federated Services
USENIX ‘10
CiteSeerX data transfer
Nodes for hosting IEEE Cloud ‘10
Hosting models for DLsComponent hosting SeerSuite is modular by design and architecture; host
individual components across available infrastructure.
Content hosting CiteSeerx provides access to document metadata,
copies and application content Host parts or complete set.
Peak load loading Support the application during peak loads Support growth of traffic.
Focus on actual public cloud costs Google APP, EC2 estimates
Hot
USENIX Hotcloud ‘10, IEEE Cloud ‘10 USENIX Hotcloud ‘10, IEEE Cloud ‘10
Component Hosting
Expense of hosting the whole of CiteSeerx maybe prohibitive.
Solution: Host a component or service i.e., Component/service code Data on which the component acts Interfaces, etc. associated with the component
Goal: Identify optimal subset/components based on: Service growth Service usage New services
Component Hosting - Costs
Least expensive option - host the index for cases.
Most expensive - host web services.
Component Amazon EC2 Google App Engine
Initial Monthly Costs
Initial Monthly Costs
Web Services
0 1448.18 0 942.53
Repository 0 1000 163.8 593.21
Database 0 858.89 12 348.05
Index 0 527.08 3.1 83.48
Extraction 0 499.02 0 90.6
Crawler 0 513.4 0 105
Component Hosting – Lessons Learned
Hosting components is reasonable
Having a service oriented architecture helps
Amazon EC2
Computation costs dominate.
Google App Engine
Refactoring costs ?
Refactoring required not just for components, but other services.
Storage and transfer costs maybe optimized
A study of data transfer in the application gives insights to costs.
Approach suitable for meeting fixed budgets
How many components of an application can be hosted for a fixed budget.
Content Hosting – Lessons Learned
Hosting specific content relevant to peak load scenarios Easy to do – minimal refactoring required, affects a
minimal set of components (presentation layer).
More complex scenarios need to be examined Hosting papers from the repository Hosting shards of the index Database
Peak Load – Lessons Learned
Hosting only during peak load conditions is economically feasible.
Growth potential Can be used to handle growth in traffic, instead of
procuring new hardware. Hosting a specific component under stress; such as a
database In such a case it will cost 400$ to host the database in
Amazon EC2.
Research Directions• Similar to many discussed at this work shop applied to
DLs
• Explore policies for hosting based on Privacy/Security Integrity/reliability QoS; $ Costs Local vs public
Architecture redesign utilizing cloud primitives and systems spanning multiple sites
Queues, Key Stores, Clusters
Optimization of existing features for automated VM-ing
ConclusionsAdvantages of cloudizing CiteSeerx
Reliability, maintenance, potential costs savings
Different costs of hosting for all or parts of Components
Content
Peak load
CiteSeerX working system – testbed? Data, storage, access, databases
Growth, evolving features, users
$ savings depend on continued support; working local system may still be needed
Archival issues/ Google Scholar
NSF cloud focus?
Besides what has been proposed at this workshop
Clouds for science – what industry will not or can not support
Both for big and small science
Clouds for the “new” sciences – social, political, historical,… that have growing amounts of data
Also focus on data:
Data rules, Without data, there is no science.
Future Work
Cost of refactoring – particularly for Google App Engine.
Cost comparisons for other cloud offerings – Azure, Eucalyptus.
Privacy and user issues – myCiteSeer and private clouds.
Technical issues with cross hosting – load balancing, latency needed to be addressed.
Virtualization in SeerSuite, components built with cloud hosting in mind (Federated Services).
ReferencesGROSSMAN , R., AND G U , Y. Data mining using high performance data clouds: Experimental studies using sector
and sphere. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), ACM,pp. 920–927.
MIKA , P., AND T UMMARELLO , G. Web semantics in the clouds. IEEE Intelligent Systems 23, 5 (2008), 82–87.
NURMI , D., WOLSKI , R., G RZEGORCZYK , C., O BERTELLI , G., S OMAN , S., YOUSEFF , L., AND ZAGORODNOV, D. The eucalyptus open-source cloud-computing system. In Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid-Volume 00 (2009), IEEE Computer Society, pp. 124–131.
SINGH , A., S RIVATSA , M., AND L IU , L. Search-as-a-service: Outsourced search over outsourced storage. ACM Trans. Web 3, 4 (2009), 1–33.
TEREGOWDA , P. URGAONKAR , B., AND GILES , C. Cloud computing: A digital libraries perspective. In 3rd IEEE 2010 International Conference on Cloud Computing (2010).
TEREGOWDA , P. B., COUNCILL , I. G., FERNANDEZ , J. P. R., KASBHA , M., ZHENG , S., AND GILES , L. C. Seersuite: Developing a scalable and reliable application framework for building digital libraries by crawling the web. In USENIX Conference on Web Application Development (2010).
P.B Teregowda, B. Urgaonkar, C.L. Giles, "Cost Implications Of Moving To The Cloud: A Digital Libraries Perspective," 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '10), 2010.
VAN DE SOMPEL , H., NELSON , M., LAGOZE , C., AND WARNER , S. Resource harvesting within the OAI-PMH framework. D-Lib Magazine 10, 12 (2004), 1082–9873.
WALKER , E., BRISKEN , W., AND ROMNEY, J. To lease or not to lease from storage clouds. Computer 43 (2010), 44–50.
WEIGEL , F., PANDA , B., RIEDEWALD , M., GEHRKE , J., AND CALIMLIM , M. Large-scale collaborative analysis and extraction of web data. Proc. VLDB Endow. 1, 2 (2008), 1476–1479.
WOOD , T., CECCHET, E., RAMAKRISHNANY, K., SHENOY, P., VAN DER MERWEY, J., AND V ENKATARAMANI , A. Disaster recovery as a cloud service: Economic benefits & deployment challenges. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (2010).