XML processing and website scraping in Java - …samples.leanpub.com/javaxml-sample.pdf · XML...

XML processing and website scraping inJavaHow to use JSoup and XMLBeam in practice

Gabor Laszlo Hajba

This book is for sale at http://leanpub.com/javaxml

This version was published on 2017-11-16

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishingprocess. Lean Publishing is the act of publishing an in-progress ebook using lightweight toolsand many iterations to get reader feedback, pivot until you have the right book and buildtraction once you do.

© 2014 - 2017 Gabor Laszlo Hajba

http://leanpub.com/javaxml

http://leanpub.com/

http://leanpub.com/manifesto

Tweet This Book!Please help Gabor Laszlo Hajba by spreading the word about this book on Twitter!

The suggested hashtag for this book is #WebsiteScrapingWithJava.

Find out what other people are saying about the book by clicking on this link to search for thishashtag on Twitter:

#WebsiteScrapingWithJava

http://twitter.com

https://twitter.com/search?q=%23WebsiteScrapingWithJava

https://twitter.com/search?q=%23WebsiteScrapingWithJava

Also By Gabor Laszlo HajbaCDI

Python 3 in Anger

http://leanpub.com/u/ghajba

http://leanpub.com/cdijavaee7

http://leanpub.com/python3inanger

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1LeanPub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1What took me the most time? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

XML Processing and the Google App Engine . . . . . . . . . . . . . . . . . . . . . . . 3Why GAE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Getting the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3XML to HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3XML to PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4XML to RTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5XML to “.*X” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Exporting the files in GAE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

XML Processing Advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

XML processing when memory matters . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Website scraping with JSoup and XMLBeam . . . . . . . . . . . . . . . . . . . . . . . . 10

Runtime comparison advanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Upgrade to Java 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Custom printing for HTML with JSoup . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Printing XMLBeam projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

PrefaceThis is a book about using XML and HTML processing tools with the Java platform.

I do not want to explain every tool you can find on the internet. It would be overwhelming andbad for the time-line of this book. So I just look at those tools I encountered in my daily work.Sometimes I toyed with those tools outside my daily work and researched something new ortried out some new features.

One of those new features war the performance-tuning of the Website Scraper applicationmentioned in Chapter 3, where I thought about the performance of the application (you’llsee running with the whole data-set was a pain) and I changed some measurement parts andoptimized the tools for a better comparison. And at the end I did parallelize the whole processwith Java 8 and streams.

The tools I will mention and explain a bit are XMLBeam and JSoup. The first one is an awesomeXML processing engine, the second one is a website scraper tool. However you could useXMLBeam to do the same as JSoup – but the query language is a bit bothersome if you donot know XPath.

LeanPub

I publish my books on LeanPub because

• I can distribute new parts faster• there are only digital versions: fewer trees have to die• cheaper because you need less infrastructure (and I too buy the cheaper books myself)• every book purchased gets the updates and corrections (no need to look-up some errata)

So if someone do not want to pay some dollars because there won’t be any updates and he/shehas to use some online errata, I can just say: do not worry, here you’ll get every update of thebook – and they are already included in the price.

It can happen that I expand the book with some chapters on my own or I write another sampleapplication – and at this time the new version of the book will be available here too. This becausethe technology is altering so fast and after some time a book stuffed with old knowledge won’tbe useful.

What took me the most time?

As I was writing this book the most time went on to have a stable base of application what I canuse for performance measurement and displaying the results. I had many ideas in mind and as

1

Preface 2

I developed the pieces I got on more and better solutions and things I wanted to show you so Iwent on, refactored a part of the code.

This meant that sometimes I had to redo runs with different configurations to have better results.

But I do not regret that I did these errands – if I want them call so. I learned new things and Ihave many ideas where to go on. Perhaps this will result in another book or more blog articles.

Acknowledgement

I have to thank Sven Ewald, the creator of XMLBeam1, for reviewing my book’s chapters aboutXMlBeam. Beside this he found time to answer my question and provide me samples withanswers.

Because LeanPub currently cannot display the whole book’s Table of Contents in the sample,I use this workaround to show you what you’ll get if you buy this book. I know, it is a bitawkward (you get some empty pages in your sample PDF) but so you can see what comes in theother version.

1http://xmlbeam.org

http://xmlbeam.org/

http://xmlbeam.org/

XML Processing and the Google AppEngineIn this chapter I’ll introduce you to XML processing and the Google App Engine (GAE).

Why GAE?

This is a good question. Mostly because I’ve worked with the GAE and I encountered someproblems with it and the XML processing. So I thought I could share my problems and solutionswith you. Perhaps you are interested in it or even it helps you to solve some problems.

Writing about development to a GAE environment is always kind of “fun” because you haveyour solutions – and at the end you get a punch in your face from the GAE: some classes youwant to use are not permitted. Then you start looking for a solution inside of the feasible area.

This was the case when we (a co-worker and I) had the task to render an XML (providedfrom somewhere somehow – it is not important in the current context) in various formats:PDF and RTF. And as a bonus (because rendering those documents was not the easiest thing) Iimplemented a web-based display too to see if we get the right data. Visualizing XML as HTMLis always the easiest thing. For me at least.

Getting the data

The data came through a SOAP interface in an XML-bundle. I will not go into detail how toaccess the SOAP interface because it was not the easiest thing, and I’ve written an article in myblog about it some time ago. And SOAP is dead, REST is in, and currently HATEOAS is the newpath you should walk when you work with remote structured data.

However XML is a good structured data format which you can use in many ways.

As you’ll see later, we needed an XML parser to get the data extracted from the transmittedXML. For this I created a quick and easy XML extractor which took the XML and extracted therequired data with some XPath expressions into objects. It was not the best solution but it wasleast time-consuming. And it was a good practice for me to work with XML.

XML to HTML

As I mentioned this was no requirement but I wanted to see results as soon as possible so I addedan HTML display of the XML input.

Converting XML to HTML is easy: you only need to do an XSL Transformation (XSLT) andthen you are done. The result you get is an HTML file (or XML or text – depending on

3

XML Processing and the Google App Engine 4

your configuration). But this is for GAE a no-go because you are not allowed to create filesdynamically from your application.

Nevertheless you can end up with a solution to display your XML data represented as an HTMLpage: you only have to add the stylesheet to your data and most of the browsers will display itcorrectly.

How to add the stylesheet?

You have to add a tag containing the stylesheet to your XML-Data. For example:

<?xml-stylesheet type="text/xsl" href="stylesheets/detailHtml.xsl"?>

to transform the XML to HTML with XSLT (the detailHtml.xsl contains the transformationinformation).

If you get your data from an interface (for example from a SOAP service) you have to be a bittricky to get your XSL into your XML – because you get all of the data in one XML dataset.However if you think about a solution you would end up with: replacing the starting root nodewith itself and the stylesheet-node. With this workaround you can alter the XML dataset anddisplay it along with XSLT. And this works with GAE too.

String rootNode = "<rootNode>";

xmlString.replaceFirst(rootNode,

"<?xml-stylesheet type=\"text/xsl\" href=\"stylesheets/detailHtml.xsl\"?>" + rootNode);

The example above is a little hack but you have to do this to add the stylesheet to the XML data.

XML to PDF

Converting an XML to a PDF is something simple too: with XSLT you create an XSL-FO (FO forFormatted Objects) document from your XML. An FO document is an XML using element names(node names) from the FO namespace. After this you can send your resulting FO document to arender-engine (for example Apache FOP) and you get your PDF.

Sounds simple however GAE does not allow some of the classes which are used by Apache FOP(for example AWT graphics). So there is need for another workaround.

iText is a good alternative to FOP however it does not handle FO documents. Nevertheless, iTexthas an XmlWorker project which should be used to render XML (XHTML) documents. So thissounds very good so I gave it a try. To get an XHTML from the XML I used again XSLT.

Unfortunately I had some problems with applying the required CSS to the XHTML output (someof themworked, some not) and as far as I can remember the XmlWorker had some problems withdisplaying the required images too. And beside the images there is a requirement of specific fontsto use when displaying the texts – and this is hardly manageable too when it comes to XHTMLto PDF conversion (or at least I did not find a good-enough solution).


So I ended up creating the PDF manually with iText added each element on it’s own, program-matically. To achieve this I created a custom XML extractor which split the provided XML resultdocument into some classes (grouped by coherence) and added display-information to theseclasses.

This was the least time-consuming solution. Eventually I could have taken a look at Flying Saucer(which has the same purpose as the XmlWorker: to create PDF from XHTML) but as I mentionedwe needed the data as quick as possible. However if I get some free time between my projectsI’ll take a look at Flying Saucer and try out how good it is to generate the required PDF from anXHTML.

XML to RTF

The second requirement was to create an RTF document. Why RTF? Because it can be displayedover various platforms (Windows, Mac OS, Linux) – and there exist some open source tools tocreate RTF documents and it has been used in another project successfully.

Well. What you have to know about RTF is: it was a Microsoft standard until 2008. Since thenMicrosoft gave it up and does not improve it any further. They work on their new standard (the“.*X” means in my terminology for .docX, .xlsX, .pptX). Besides this, RTF was not supported with100% by any other document editor than MS Word. And the tool which had been used (namelyjRTF) does not implement all of the features so creating documents with it can be a pain in theneck.

iText had an RTF generator too (alongsidewith the PDF generator) but it isn’t improved anymore.So we did not even try to use it for our purposes.

But we did it – as far as we could. Some features (like embedding fonts) do not work so we hadto loosen the requirements for the RTF documents. We used jRTF in the end because it was theonly tool which was available and fairly up-to-date. And as I mentioned previously: it was a painin the neck to create the RTF documents. For example: pictures in the document are displayedas a single line in my Mac’s Open Office. Not the best thing, is it?

You could ask why are RTF documents needed if you have a PDF? Well, the PDF can containtoo much information or the display order of the data could be not the best (a good example forthis is a CV or some management reports) and the users want to alter the document. Alternativefor this would be a customizable application where you could pick and order the data which youwant to display. I suggested this feature but it had been rejected because it would need somemore time to finish the task.

XML to “.*X”

Yes. Finally here it comes. I mentionedMicrosoft’s new standard for documents above. And thereis a possibility that the management will decide to create a Word document from the providedXML data.

Currently we’re evaluating the possibilities and features of frameworks such as Apache POIand docx4j because we have already the data extracted from the XML to create the docx


programmatically. Parallel we are evaluating a solution to transform the XML data into a docxwith XSLT. If we get to any results I’ll end up with a book-update about the topic but currentlythere is nothing in sight.

Exporting the files in GAE

As I mentioned previously: GAE does not allow to write files to the file system. So if you want tocreate a document (PDF, RTF, docX or anything) you have to return it from your web application– and not saving it to the file system of the server. How to do it?

If you have a class extending the javax.servlet.http.HttpServlet in GAE this is easy.

Most of the libraries give you the ability to create the document (PDF or RFT – let’s stick to thesetwo) to a java.io.OutputStream which you can convert to a byte array (byte[]). And you usethis byte array to enable downloading the created document. Let me show an example code withthe overridden doGet method and a PDF document:

@Override

protected void doGet(HttpServletRequest req, HttpServletResponse resp)

throws ServletException, IOException {

OutputStream os = resp.getOutputStream();

// set the content type of the result

resp.setContentType("application/pdf");

// set the name of the resulting PDF file

resp.setHeader("Content-Disposition", "filename=Result.pdf");

// createPDF() returns a ByteArrayOutputStream

os.write(createPDF().toByteArray());

os.flush();

os.close();

}

If you want to export an RTF you have to alter the content type as follows:

// set the content type of the result

resp.setContentType("application/rtf");

// set the name of the resulting RTF file

resp.setHeader("Content-Disposition", "filename=Result.rtf");

I included the setHeader function only to show that it should be not bad if you alter the fileextension to .rtf from .pdf.

With this settings RTF files will be downloaded PDF files only displayed in the browser (if youhave this option enabled and your browser is capable to render the PDF file without downloadingit). However if you do not want the browser to display the PDF file just download it, you canenhance the content disposition as follows:


// set the name of the resulting PDF file

resp.setHeader("Content-Disposition", "attachment; filename=Result.pdf");

XML Processing Advanced

8

XML processing when memorymatters

9

Website scraping with JSoup andXMLBeam

10

Runtime comparison advanced

11

Upgrade to Java 8

12

Custom printing for HTML withJSoup

13

Printing XMLBeam projections

14

Date post:	18-Sep-2018
Category:	Documents
Upload:	dohuong
View:	239 times
Download:	0 times

XML processing and website scraping in Java - …samples.leanpub.com/javaxml-sample.pdf · XML...

Documents