[IEEE 2009 Fifth International Conference on Soft Computing, Computing with Words and Perceptions in...

Highly Scalable Text Mining – Parallel Tagging Application

Firat Tekiner

School of Computing Engineering and Physical Sciences, UCLAN, Preston, PR1 2HE, UK.

[email protected]

Yoshimasa Tsuruoka, Jun'ichi Tsujii, Sophia Ananiadou

National Centre for Text Mining, School of Computer Science, University of Manchester.

Abstract— Abstract: There is an urgent need to develop new text mining solutions using High Performance Computing (HPC) and grid environments to tackle exponential growth in text data. Problem sizes are increasing by the day by addition of new text docments. The task of labelling sequence data such as part-of-speech (POS) tagging, chunking (shallow parsing) and named entity recognition is one of the most important tasks in Text Mining. Genia is a POS tagger which is specifically tuned for biomedical text. Genia is built with maximum entropy modelling and state of the art tagging algorithm. A Parallel version of genia tagger application has been implemented and performance has been compared on a number of different architectures. The focus has been particularly on scalability of the application. Scaling of 512 processors has been achieved and a method to scale to 10000 processors is proposed for massively parallel Text Mining applications. The parallel implementation of genia tagger is done using MPI for achieving portable code. Keywords-component; HPC, Text Mining, Parsing, Parallel Computing.

I. INTRODUCTION The continuing rapid growth of data and knowledge expressed in the scientific literature has spurred huge interest in text mining (TM). The individual researcher cannot easily keep up with the literature in their domain, and knowledge silos further prevent integration and cross-disciplinary knowledge sharing [1]. The expansion to new domains and the increase in scale will massively increase the amount of data to be processed by TM applications (from Gigabytes to Terabytes). This work investigates approaches using high performance computing (HPC) to tackle the problem of data deluge for large-scale TM applications. Although TM applications are data independent, data handling of large text data is an issue when full text data is considered due to the problem sizes in consideration. Each of the steps in the TM pipeline adds further information to the initial raw text and data size increases as processing progresses throughout this process. Initially, text mining efforts focused on abstracts, because information density is greatest in the abstract compared to

other sections of the text [2]. Furthermore, access to abstracts is easier and considerably smaller amount of storage and computational resources are needed [3][4]. However, processing full text articles instead of abstracts will allow researchers to discover hidden relationships from text that were not known before. Recent studies showed that an abstract’s length is on average 3% of the entire article [5] and includes only 20% of the useful information that can be learned from the text [6][7]. TM applications are widely applied in the biology domain and these applications will benefit from processing more information that is available in full text documents [8][9]. Medline is a huge database of around 17 million references to articles. The collection of Medline abstracts contain around 1.7 billion words and is around 7GB (compressed) in size before processing. The output generated after processing Medline abstracts using our TM tools is around 400GB. We expect the output of processing full Medline articles to be in the order 10s of Terabytes. This work is the first step towards creating a general TM framework to enable large scale text mining. The aim is to create a suite of text mining applications based on state-of-the-art TM approaches that exploit a number of HPC and Grid architectures in order to process and handle terabytes of text in reasonable time. The initial work focuses on the application of tagging using HPC and Grid environments. In this work, it will be shown that how tagging application has been scaled linearly to 96 processing cores which will be used as the basis for the framework. The application developed is portable as it is based on the de facto Message Passing Interface (MPI) and it has already run on three different HPC and Grid architectures without modifications. However, when scaling to a larger number of processors, data and work distribution will also be an issue and more sophisticated load distribution models will need to be investigated. Due to the unstructured nature of the data available, this will become a major issue. In this work, we are laying the foundations towards a parallel TM framework which should enable processing terabytes of text in reasonable

9781-4244-3428-2/09/$25.00 ©2009 IEEE

time. The paper presents and discusses the associated challenges, the progress to date and the future work needed to handle full papers.

The paper is organised as follows: Section 2 provides about TM pipeline and tagger; Section 3 focuses on parallel implementation and the execution environment; this is followed by experimental results and discussion in Section 4; finally, Section 5 concludes the work and discusses future work.

II. BACKGROUND TM normally involves sequential processing of documents and the data generated from those documents. First, the documents are processed by natural language processing (NLP) techniques to analyse linguistic structures of the sentences. The documents are then passed to an information extraction (IE) engine which generates data by semantically analysing the documents. NLP is becoming increasingly important for accurate information extraction/retrieval from scientific literature [10]. The role of NLP in TM is to provide the tools in the IE phase with sophisticated linguistic analyses. Often this is done by annotating documents with information like sentence boundaries, part-of-speech tags and deep semantic parsing results, which can then be read by the IE tools. Each of the above steps in the TM pipeline adds further information to the initial raw text and data sizes increases as processing proceeds throughout the whole process. The data generated after every step is either saved to disk to be used in the future or passed to the next step for further processing. Our approaches focus only on data parallel approaches as task parallel approaches in this area do not provide the desired speed-up [11]. Dynamic work distribution approaches, master/slave models (particularly task farming approaches) would appear to be ideally suited for use on supercomputing and grid resources by employing data parallel approaches to process unbalanced data sets (the length and structure of the sentences is not known before processing starts) [12][13]. Furthermore, the I/O requirements of each stage and I/O usage will need to be balanced in order to achieve the optimum outcome. Therefore, in this work we are applying a master slave approach to parallelise applications, which will be discussed in detail in Section 3. In this work our focus is on part-of-speech (POS) tagging which is often one of the first processing steps used in language based text mining. This process is used to add appropriate linguistic knowledge to text in order to assist further analysis by other tools. Knowing the lexical class of a word makes it much easier to perform deeper linguistic analysis such as parsing [15][16][17]. The first stage of this process involves tokenising the text by splitting it into a sequence of single word units and punctuations. This includes splitting of hyphenation, parentheses, quotations and contractions, which can otherwise

cause errors with POS tagging algorithms. At this point it is possible to introduce linguistic stemming into the annotation, which predicts the base form of a word to assist in later analysis or searching [18]. In order to ensure high accuracy it is recommended that any tagging software used is trained on annotated texts from the same domain as the target documents. With this process being in the early stages of the whole TM chain any errors at this stage may grow cumulatively so it is important to have a POS tagger that is highly accurate [17][19].

III. IMPLEMENTATION AND ENVIRONMENT The parallel implementation of the GENIA Tagger works as follows. Abstracts are cleaned and prepared initially and stored as an ASCII text file. Then a rule-based sentence splitter developed in-house is applied to separate the sentences. Each sentence is written to a new line and a new line is inserted between each abstract to detect the end of abstracts. This process is not computationally expensive and is completed in less than a minute for a hundred thousand abstracts. Therefore, in order to retain interoperability and portability of the tools we have not integrated this into the GENIA Tagger. Figure 1 shows an abstract view of how the parallel implementation of the GENIA Tagger works. Once the data is cleaned and prepared, the master node reads the cleaned and split abstracts, packs them into groups of sentences (i.e. as entire abstract) and sends them to the slave nodes. The master continues to read and distribute the data until it reaches to the end of the abstracts.

Figure 1: GENIA tagger parallel implementation

On the other hand, each slave node loads the probabilistic models that are obtained by training the application on annotated data. Then slave nodes wait for data from the master node. When the slave node receives the abstract, it splits the abstract into sentences to process as the GENIA tagger works only per sentence. Once the processing of the abstract is completed, the POS-tagged abstract is then sent back to the master node. The slave process therefore can continue processing the next abstract without waiting for the completion of the send process.

Blugene/L Bluegene/P Cray XT4 Cluster

CPU 700Mhz - IBM

850Mhz - IBM

2.8Ghz - AMD

Opteron

3.0Ghz - Intel

XEON Memory/Core 1GB 2GB 3GB 1GB Interconnect 1.4GB/s 3.4GB/s 7.6GB/s 1GB/s

I/O System GPFS GPFS Parallel Lustre Raid Disk

SMP Nodes None 4 4 4 No of Cores 2048 4096 11328 128

O/S Compiler Linux -

xlC/C++ v8.0

Linux - xlC/C++

v8.0

Linux - PGI v7.04

Linux - Intel v10.0

Table 1: HPC Architectures

Each of these architectures are unique in terms of their processing capability, interconnects and file systems available to them. Due to the amount of I/O being done a number of approaches have been investigated and it is found out that, for a text mining application it is best to use one I/O node. This is due to application being data parallel and having high processing time per data element. However, we believe that this approach is also the reason for not being able to scale beyond 512 processors. Therefore, we are proposing a hierarchical approach where multiple master/worker nodes will be used to scale thousands of processors.

IV. RESULTS AND DISCUSSION To test and run the simulations we used a local HPC system at the University of Manchester. The Bull Itanium2 system has 192 Itanium2 processing cores for users together with 16 cores reserved for the system. It is configured as 24x compute nodes, each with 4x Intel Itanium2 Montecito Dual Core 1.6GHz/8M cache (i.e. 8 cores per node) and 16GB RAM. Each node also has up to 512GB of local scratch disk space, which is hidden from the user. Furthermore, four of these nodes are configured as high memory nodes and have 32GB RAM. All of the nodes are connected by a single rail Quadrics QsNetII (elan4) interconnect. The system has around 10TB of central file store based on the Lustre distributed file system. MPI is supported across the whole system with OpenMP possible within the nodes. The operating system on Horace is Redhat Enterprise version 3. The application is compiled with Intel compiler 9.1 with the highest possible optimisation level.

DataSet lines words characters 1 Million Abstracts 9642050 192938020 1317747750

Entire Medline Abstracts 129559448 1766364087 12220578650

Figure 2 shows that our application scaled linearly for up to 96 processors. It can be concluded from the figure that, the application scaled both when the number of processors are increased as well as when the problem size is increased.

0

5000

10000

15000

20000

25000

30000

35000

16 32 64 128 256 512 1024Cray XT4 BG/L BG/P

Figure 2: GENIA tagger scaling

For example, ten thousand abstracts takes around 30mins if processed using a single processing core whereas it will take around 27 seconds on 64 processors when the parallel version of the application is used. This indicates good scaling due to the data independent nature of the application itself. Processing could be parallelised at the sentence level, which would make it easy to distribute the tasks. However, early simulation results showed when small amounts of data is sent too frequently (i.e. master distributed the data sentence on a sentence basis) performance was poorer compared to sending larger chunks of data (i.e. abstracts) in terms of overall processing time. This is due to poor utilisation bandwidth (each MPI message has a header) and due to lower latency achieved by establishing communication every time between sender and receiver. The time taken to process a specific abstract is not known until processing starts as it depends not only on the number and length of the sentences but also on the structure of the sentence. On the other hand, when the problem size is increased by ten fold from around ten thousand abstracts to a hundred thousand abstracts, it can be seen from the results that time taken to process increased by 10 fold. Although the size of the text datasets depends on the length of abstracts we can consider in this case that on average number of abstracts has a certain number of words and sentences. Hence this is reflected with the experimental results as shown in figure 2. Figure 3 shows the speedup gained when the resources are increased. It shows that as the number of processors is increased processing time has been reduced correspondingly. Therefore, one can say that application scales linearly. This was an expected result given the data independent nature of the application, the size of the dataset and the limited number of processors used. However, the aim of our work is to be able to process problem and data sizes which are almost a thousand times larger than the examples given above. Hence the application is needed to scale up to thousands of processors in order to be able to process the given terabytes of dataset in

reasonable time. Furthermore, one of the bottlenecks would be to maintain and handle the data due to the data sizes. Disk and network I/O would be an issue as well as the network of the supercomputing machine.

10

100

1000

16 32 64 128 256 512 1024No of Cores

Spee

dup

Cray XT4 BG/L BG/P Cluster Ideal

Figure 3: GENIA tagger achieved speed up

V. CONCLUDING REMARKS In this work, the parallel TM application developed has achieved linear scalability. Scaling up to 96 processors has been achieved by using data independent approaches, and a hundred thousand abstracts have been processed in less than 5 minutes, whereas serial processing would take around 8 hours. This was an expected result due to the data independent behaviour of the algorithms in consideration.

ACKNOWLEDGMENT We thank the DEISA Consortium (www.deisa.eu), co-

funded through the EU FP6 project RI-031513 and the FP7 project RI-222919, for support within the DEISA Extreme Computing Initiative.

REFERENCES 1:Ananiadou, S. & McNaught, J. (2006) Introduction to Text Mining in Biology. In Ananiadou, S. & McNaught, J. (Eds) Text Mining for Biology and Biomedicine, pp. 1-12, Artech House Books. 2: L. Shi and F. Campagne, "Building a protein name dictionary from full text: a machine learning term extraction approach", BMC Bioinformatics 2005, 6:88. 3: E. P. G. Martin, E. G. Bremer, M. C. Guerin, C. DeSesa, O. Jouve, "Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles", KELSI 2004: 96-108. 4: P. K. Shah, C. Perez-Iratxeta, and M. A. Andrade, "Information extraction from full text scientific articles: Where are the keywords?", BMC Bioinformatics 2003, 4:20. 5: David P. A. Corney , Bernard F. Buxton , William B. Langdon and David T. Jones, "BioRAT: extracting biological information from full-length papers", Journal of Bioinformatics, Oxford Journals, 2004, 20(17):3206-3213.

6: J. Natarajan, D. Berrar, W. Dubitzky, C. Hack, Y. Zhang, C. DeSesa, J. R. Van Brocklyn and E. G. Bremer, "Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line", BMC Bioinformatics 2006, 7:373. 7: M. J. Schuemie, M. Weeber, B. J. A. Schijvenaars, E. M. van Mulligen, C. C. van der Eijk, R. Jelier, B. Mons, and J. A. Kors, "Distribution of information in biomedical abstracts and full-text publications", Journal of Bioinformatics, 2004, 20: 2597-2604. 8: M. Hilario, A. Mitchell, J.-H. Kim, P. Bradley and T. Attwood, "Classifying Protein Fingerprints", PKDD2004, Pisa Italy. 9: S. Ananiadou, D. B. Kell and J. Tsujii, "Text mining and its potential applications in systems biology", Journal of Trends in Biotechnology, 24(12), 11 October 2006. 10: Miyao, Y., T. Ohta, et al. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases. Coling/ACL, Sydney, Australia, Association for Computational Linguistics. 11: T Ninomiya, K Torisawa, J Tsujii, "An Agent-based Parallel HPSG Parser for Shared-memory Parallel Machines", Journal of Natural Language Processing, Volume 8 Pages 21-48 Ref number 1, January 2001, ISSN 1340761 12: X. Qin, "Performance Comparisons of Load Balancing Algorithms for I/O-Intensive Workloads on Clusters", Journal of Network and Computer Applications. July 2006 13: Horacio Gonzalez-Velez, "Self-adaptive skeletal task farm for computational grids", Parallel Computing, Volume 32, Issues 7-8, September 2006, pp 479-490 14: Matsuzaki, Takuya, Yusuke Miyao and Jun'ichi Tsujii. Efficient HPSG Parsing with Supertagging and CFG-filtering. In the Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. January 2007 15: Toutanova, K., Klein, D., Manning, C., Singer, Y. (2003). "Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network." 467-474. 16: Tsuruoka, Y., Y. Tateishi, et al. (2005). "Developing a robust part-of-speech tagger for biomedical text." Advances in Informatics, Proceedings 3746: 382-392. 17: Yoshida, K. (2007). "Ambiguous Part-of-Speech Tagging for Improving Accuracy and Domain Portability of Syntactic Parsers." Proceedings of the Twentieth International Joint Conference on Artificial Intelligence. 18: Hull, D. A. (1996). "Stemming algorithms: A case study for detailed evaluation." Journal of the American Society for Information Science 47(1): 70-84. 19: Yakushiji, A., Y. T. Miyao, and, et al. (2005). Biomedical information extraction with predicate-argument structure patterns. First International Symposium on Semantic Mining in Biomedicine.

Date post:	05-Dec-2016
Category:	Documents
Upload:	sophia
View:	213 times
Download:	0 times

[IEEE 2009 Fifth International Conference on Soft Computing, Computing with Words and Perceptions in...

Documents