[320] Web 3: Selenium · for Selenium Java module for Selenium Ruby module for Selenium JavaScript...

[320] Web 3: SeleniumTyler Caraza-Harter

Review Decorators

def cache(fn): results = {} def wrapper(*args): if not args in results: rv = fn(*args) results[args] = rv return results[args] return wrapper

@cache def add(x, y): print("ADD") return x+y

@cache def range_sum(limit): total = 0 for i in range(limit): total += i return total

print(add(1,2)) print(add(3,4)) print(add(1,2))

print(range_sum(50000000)) #1 print(range_sum(50000000)) #2

what is printed?

which call is faster?

how many functions get defined total?

how many results dicts will there be?

Review Document Object Model

url: http://domain/rsrc.html HTTP Response

HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74

<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>

Browser

What does a web browser do when it gets some HTML in an HTTP response?

url: http://domain/rsrc.html HTTP Response<html> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html> HTTP/1.0 200 OK

Content-Type: text/html; charset=utf-8 Content-Length: 74



before displaying a page, the browser uses HTML to generate a Document

Object Model(DOM Tree)




html

body

ah1 a

vocab: elements




html

body

ah1 a

attr: hrefattr: href

Elements may contain• attributes




html

body

ah1 a


ContactAboutWelcome

Elements may contain• attributes• text




html

ah1


ContactAboutWelcome

Elements may contain• attributes• text• other elements

body

a

parent

child




html

ah1


ContactAboutWelcome

body

a

parent

child



table

JavaScript (if there's an engine to execute it) may directly edit the DOM!

original .html file doesn't change, but the result is equivalent


Welcome

About Contact

browser renders (displays) the DOM tree, based on original file

and any JavaScript changes



Web Scraping: Simple and Complicated

requests vs. Selenium

computer 2(Virtual Machine)

IP address: 18.216.110.65

computer 1(laptop)

index.html, please [GET]

<html> <body> <img src="A.png"> <b>Hello</b> <script src="B.js"> </script> </body> </html>

requests module- can fetch .html, .js, .etc file

Selenium- can fetch .html, .js, .etc file- can run a .js file in browser- can grab HTML version of DOM after JavaScript has modified it

Jupyter : import requests

r=requests.get(...)

Flask Application

requests vs. Selenium

computer 2(Virtual Machine)

IP address: 18.216.110.65

computer 1(laptop)

index.html, please [GET]

<html> <body> <img src="A.png"> <b>Hello</b> <script src="B.js"> </script> </body> </html>

requests module- can fetch .html, .js, .etc file

Selenium- can fetch .html, .js, .etc file- can run a .js file in browser- can grab HTML version of DOM after JavaScript has modified it

from selenium import webdriverdriver=webdriver.Chrome()

chromedriver

Flask Application

note: Selenium is most commonly used for testing websites, but it works

great for tricky scraping too

Tricky Pages

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/scrape.html

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/scrape.html

Install

Selenium Install

computer 1(laptop)


chromedriver

https://chromedriver.chromium.org/downloads

wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zipunzip chromedriver_linux64.zipecho $PATHmv chromedriver ~/.local/bin/

sudo apt install chromium-browser

pip install selenium


Selenium Install

computer 1(laptop)


chromedriver


wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zipunzip chromedriver_linux64.zipecho $PATHmv chromedriver ~/.local/bin/

sudo apt install chromium-browser

pip install selenium

trh@instance-1:/tmp$ chromium-browser --version Chromium 80.0.3987.87 Built on Ubuntu , ... trh@instance-1:/tmp$ chromedriver --version ChromeDriver 80.0.3987.106 (...)

Check...


Why Drivers?

Python Java Ruby JavaScript

Python module for Selenium

Java module for Selenium

Ruby module for Selenium

JavaScript mod for Selenium

Chrome Driver Firefox Driver Edge Driver

Examples

Starter Codefrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException

options = Options() #options.headless = True b = webdriver.Chrome(options=options)

b.get(????)

print(b.page_source)

try: elem = browser.find_element_by_id(element_id) print("found it") except NoSuchElementException: print("couldn't find it")

b.close()

open browser window

go to a URL

get HTML for current page (including JavaScript changes)

search for id=???? attributes

no such element

Example 1a: Late Loading Table (page1.html)

https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html

added after 1 second


Example 1b: Headless Mode and Screenshotsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException

options = Options() options.headless = True b = webdriver.Chrome(options=options)

b.get(????)

from IPython.core.display import Image b.save_screenshot("out.png") Image("out.png")

b.close()

Example 2: Auto-Clicking Buttonsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException


b.get(????)

btn = b.find_element_by_id("BTN_ID") btn.click()

b.close()

auto clickhttps://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html


Example 3: Entering Passwordsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException


b.get(????)

pw = b.find_element_by_id("pw") pw.send_keys("fido")

b.close()



Example 4: Many Queries



Date post:	01-May-2020
Category:	Documents
Upload:	others
View:	62 times
Download:	0 times

[320] Web 3: Selenium · for Selenium Java module for Selenium Ruby module for Selenium JavaScript...

Documents