Date post: | 30-Aug-2014 |
Category: |
Technology |
Upload: | tommorris |
View: | 2,292 times |
Download: | 0 times |
Don’t Scrape,Glean.
Tom Morris
Scraping sucks.
def lastlogin (@hmodel/"//td[@class='text'][@width='193']").first.innerHTML.split("<br />"[9].strip[-10..-1] return date[-4..-1] + "-" + date[-7..-6] + "-" + date[-10..-9]end
Hpricot for ‘Last login’ date on
MySpace.
try: lastlogin = self.soup.findAll(True, {"width": "193"})[0].br.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.nextSibling.string loginregex = re.compile( r"[0-9]/[0-9]+/[0-9]*") loginregex_inst = loginregex.search(lastlogin) if loginregex_inst is not None: self.lastlogin = loginregex_inst.group() except: pass
Taken from a Python/BeautifulSo
up library.
(The Ruby is prettier, but who’s
counting?)
getElementsByClassName(“foo”)[0].children
It’s an edge case. MySpace’s HTML is
worse than average.
But it is an ugly recipe for mental
turmoil.
The alternative?
flickr.getPhotos()
And you get back nice XML or JSON(or even SOAP!)
But ‘D.R.Y.’!APIs break that
principle.
This is the data equivalent of the
‘accessible version’.
Enter GRDDL.
GRDDL defines a transformation
process for XHTML » RDF.
XHTML?That’s what the
spec says.
HTML 4 works too.Tidy!
RDF?Yes. Trust me.It’s not evil.
GRDDL can worklike a data stylesheet
on top of your HTML.
You simply use HTML (or XML) in the normal way...
...and define how the data
transformation.
You can even use it as a bridge for
exisiting APIs and services.
Could even be used
for other formatsthan RDF. Atom?
Simple example:‘Not Safe For Work’
I can write that.I can’t write xFolk
by hand.
Is ‘nsfw’ a good class name? No.
Do I care? No.
The data layer becomes
separated like CSS is from HTML.
That’s the theory.Now for the demo.
irc.freenode.net#swig
#swhack