class: center, middle, inverse, title-slide # Data analysis II ## Webscraping, advanced ### Laurent Bergé ### University of Bordeaux, BxSE ### 09/12/2021 --- # A large project - you have to deal with a large project, involving scraping 1000's of webpages - the pages are in the vector `all_webpages` - you've written a scraper function, `scrape_format`, that scrapes a single web page and formats it appropriately -- .block[Q: How would you write the code for this project?] --- # A large project: Code If you wrote the following code .comment[or similar]: ```r n = all_webpages res = vector("list", n) for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i]) } ``` -- .fs-22[Then congratulations! 🎉 ] -- .fs-22[.bold1[You didn't miss any of the rookie mistakes!]] --- # What are the problems with my code? .block[Q: Look closely at the code and find out three ***.underline[major] problems***.] ```r n = all_webpages res = vector("list", n) for(i in 1:n){ res[[i]] = scrape_format(all_webpages[i]) } ``` -- .h-2em[] This code breaks the .bold1[three laws of webscraping]. --- class: section # The three laws of webscraping --- class: fs-30 # The three laws of webscraping 1. Thou shalt not be in a hurry -- 2. Thou shalt separate the scraping from the formatting -- 3. Thou shalt anticipate failure --- # Thou shalt not be in a hurry - you would have soon figured out this law by yourself - requests to the server are costly for the server .comment[costs run time + bandwidth] - very easy to ply a server with thousands of requests per second, just write a loop -- .block[Q: What happens when you flood the server with requests?] -- .block[A: You get your IP blocked .comment[and rightly so]. End of the game..footnote[{star}Mostly temporarily though.]] --- # Law 1: What to do - always wait a few seconds between two pages requests - it's a bit of a gentleman agreement to be *slow* in scraping.footnote[{star}Anyway, your bot will work at night so who cares if your scraping is not instantaneous?] - anyway the incentives are aligned: trespass this agreement and you will quickly feel the consequences! --- # Thou shalt separate the scraping from the formatting Remember that you have 1000's of pages to scrape. -- .block[Q: How do you create your data formatting code?] -- ### Typically: 1. scrape a few pages 2. write code to extract the relevant data from the HTML 3. create a function out of it to systematize to each page --- # Law 2: Consequences of infringement .block[Q: What happens if you tied the formatting with the scraping?] -- .block[A: If the format of the pages changes, you're roasted! .bold1[You have to painfully rerun the scraping!]] --- # Law 2: What to do? In a large project you .color1[cannot anticipate] the format of all the pages from a mere sub-selection. So: 1. always save the HTML code on disk after scraping.footnote[{star}Since HTML pages tend to be pretty big, instead you can only save on disk an HTML container that you're sure will be in all the pages .comment[it reduces the size on disk without incurring problems].] 2. apply, and then debug, the formatting script only after all, .color2[or a large chunk of], the data is scraped -- .block[{To remember} 1. the data always change 2. you always want more stuff than what you anticipated .comment[and be sure this epiphany only happens *ex post*!] ] --- # Thou shalt anticipate failure Be prepared, even if it's coded properly: -- .auto-fit.strong1[Your code will fail!] -- If you don't acknowledge that... well.. I'm afraid to say you gonna suffer. --- # Law 3: Why can my scraping function fail? There are numerous reasons for scraping-functions to fail: - a CAPTCHA catches you red handed and diverts the page requests - your IP gets blocked - the page you get is different from the one you anticipate - the url doesn't exists - timeout - other random reason --- # Law 3: Consequences .block[Q: What happens if you didn't anticipate failure?] -- .block[A: You have to: 1. find out where your scraping stopped.footnote[{altstar}Note that this is easy thanks to .bold1[Law 2].] 2. try to debug *ex post* 3. manually restart the loop where it last stopped, etc... 4. in sum: *pain*, *pain*, *pain*, .color1[which could be avoided] ] --- # Law 3: What to do? 1. anticipate failure by: - checking the pages you obtain: are they what you expect? Save the problematic pages. Stop as you go. - catching errors such as connection problems. Parse the error, and either continue or stop .comment[depending on the severity of the error]. - catch higher level errors -- 2. rewrite the loop to make it start at the last problem --- # Application of the laws .fs-16[ ```r save_path = "path/to/saved/webpages/" # SCRAPING all_files = list.files(save_path, full.names = TRUE) i_start = length(all_files) + 1 n = length(all_webpages) for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } Sys.sleep(1.2) } # FORMATTING all_pages = list.files(save_path, full.names = TRUE) n = length(all_pages) res = vector("list", n) for(i in 1:n){ res[[i]] = formatter(all_pages[i]) } ``` ] --- # Law 1: Thou shalt not be in a hurry .fs-16[ ```r save_path = "path/to/saved/webpages/" # SCRAPING all_files = list.files(save_path, full.names = TRUE) i_start = length(all_files) + 1 n = length(all_webpages) for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } * # wait * Sys.sleep(1.2) } # FORMATTING all_pages = list.files(save_path, full.names = TRUE) n = length(all_pages) res = vector("list", n) for(i in 1:n){ res[[i]] = formatter(all_pages[i]) } ``` ] --- # Law 2: Thou shalt separate the scraping from the formatting .fs-16[ ```r save_path = "path/to/saved/webpages/" *# SCRAPING all_files = list.files(save_path, full.names = TRUE) i_start = length(all_files) + 1 n = length(all_webpages) for(i in i_start:n){ status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } Sys.sleep(1.2) } *# FORMATTING all_pages = list.files(save_path, full.names = TRUE) n = length(all_pages) res = vector("list", n) for(i in 1:n){ res[[i]] = formatter(all_pages[i]) } ``` ] --- # Law 3: Thou shalt anticipate failure .fs-16[ ```r *save_path = "path/to/saved/webpages/" # SCRAPING # of course, I assume URLs are UNIQUE and ONLY web pages end up in the folder *all_files = list.files(save_path, full.names = TRUE) *i_start = length(all_files) + 1 *n = length(all_webpages) *for(i in i_start:n){ # files are saved on disk with the scraper function # we give an appropriate prefix to the files to make it tidy * status = scraper(all_webpages[i], save_path, * prefix = paste0("id_", i, "_")) # (optional) the status variable returns 1 if everything is OK # otherwise it contains information helping to debug * if(!identical(status, 1)){ * stop("Error during the scraping, see 'status'.") * } # wait Sys.sleep(1.2) } ``` ] --- # Law 3: Cont'd .fs-16[ ```r # FORMATTING all_pages = list.files(save_path, full.names = TRUE) n = length(all_pages) res = vector("list", n) for(i in 1:n){ * # error handling can be looser here since the data formatting * # is typically very fast. We can correct errors as we go. * # If the formatting is slow, we can use the same procedure as for * # the scraper, by saving the results on the hard drive as we advance. res[[i]] = formatter(all_pages[i]) } ``` ] --- # Law 3: Still cont'd .fs-16[ ```r scraper = function(url, save_path, prefix){ # simplified version * page = try(read_html(url)) * if(inherits(page, "try-error")){ return(page) } writeLines(page, paste0(save_path, prefix, url, ".html")) * if(!is_page_content_ok(page)){ return(page) } return(1) } formatter = function(path){ # simplified version page = readLines(path) * if(!is_page_format_ok(page)){ stop("Wrong formatting of the page. Revise code.") } extract_data(page) } ``` ] ??? ```r save_path = "path/to/saved/webpages/" # SCRAPING # of course, I assume URLs are UNIQUE and ONLY web pages end up in the folder all_files = list.files(save_path, full.names = TRUE) i_start = length(all_files) + 1 n = length(all_webpages) for(i in i_start:n){ # files are saved on disk with the scraper function # we give an appropriate prefix to the files to make it tidy status = scraper(all_webpages[i], save_path, prefix = paste0("id_", i, "_")) # (optional) the status variable returns 1 if everything is OK # otherwise it contains information helping to debug if(!identical(status, 1)){ stop("Error during the scraping, see 'status'.") } # wait Sys.sleep(1.2) } # FORMATTING all_pages = list.files(save_path, full.names = TRUE) n = length(all_pages) res = vector("list", n) for(i in 1:n){ # error handling can be looser here since the data formatting # is typically very fast. We can correct errors as we go. # If the formatting is slow, we can use the same procedure as for # the scraper, by saving the results on the hard drive as we advance. res[[i]] = formatter(all_pages[i]) } # Done. You may need some extra formatting still, but then it really depends # on the problem. ``` ] --- # Laws make your life easy Making the code fool-proof and easy to adapt require some planning. But it's worth it! -- If you follow the .color1[three laws of webscraping], you're ready to .bold1[handle large projects in peace]! --- class: section # Dynamic web pages --- class: fs-21 # Static vs Dynamic - .invisible[dynm]static: HTML in source code `\(=\)` HTML in browser -- - .invisible[stt]dynamic: HTML in source code `\(\neq\)` HTML in browser -- .block[Q: What makes the HTML in your browser change?] -- .h-3em[] .center.bold1.fs-40[javascript] --- class: fs-28 # The language of the web ## Web's trinity: -- .w-50.ta-right[HTML for] .color1[content] -- .w-50.ta-right[CSS for] .color1[style] -- .w-50.ta-right[javascript for] .color1[manipulation] --- # What's JS? ### A programming language - .strong1[javascript] is a regular programming language .comment[with the typical package: conditions, loops, functions, classes] -- ### Which... - specializes in modifying HTML content --- # Why JS? - imagine you're the webmaster of an e-commerce website - if you had no javascript and a client searched "shirt" on your website, you'd have to manually create the results page in HTML. -- - with javascript, you fetch the results from the query in a data base, and the HTML content is updated to fit the results of the query, with real time information - javascript is simply indispensable --- # JS: what's the connection to webscraping? - some webpages may decide to display some information only after some .color1[event] has occurred - the event can be: - the main HTML has loaded - an HTML box becomes, or is close to become, on-screen .comment[e.g. think to facebook] - something is clicked - etc! -- .block[Q: So far, we only queried the server to have the source code of the webpage. What's the problem with that?] -- .block[A: If you wanted to have access to some information that only appears after these events... well... you can't.] --- # JS: How does it work? - you can add javascript in an HTML page with the `<script>` tag -- ### Example: ```default <script> let all_p = document.querySelectorAll("p"); for(p of all_p) p.style.display = "none"; </script> ``` --- Use a .bold1[CSS selector] to select all paragraphs in the document. ```default <script> * let all_p = document.querySelectorAll("p"); for(p of all_p) p.style.display = "none"; </script> ``` --- Remove all paragraphs from view. ```default <script> let all_p = document.querySelectorAll("p"); * for(p of all_p) p.style.display = "none"; </script> ``` --- # Back to our webpage Let's go back to the webpage we have created in the previous course. Add the following code: ```default <button type="button" id="btn"> What is my favourite author?</button> <script> let btn = document.querySelector("#btn"); showAuthor = function(){ let p = document.createElement("p"); p.innerHTML = "My favourite author is Shakespeare"; this.replaceWith(p); } btn.addEventListener("click", showAuthor); </script> ``` --- # What happened? - you can access a critical information only after an event was triggered .comment[here the click] -- - that's how dynamic web pages work! ![](images/dynamic_webpage.png) --- # Dynamic webpages: Can we scrape them? - yes, but... -- .block[Q: What do we need? .comment[and I hope the answer will be natural after this long introduction!]] -- .block[A: Indeed! We need to run javascript on the source code, and keep it running as the page updates.] -- ### In other words... We need a .bold1[web browser]. --- class: section # Python + Selenium --- # Requirements - to scrape dynamic webpages, we'll use .color1[python] in combination with .color1[selenium] - you need: - [python 3.X](https://www.python.org/downloads/) + an IDE (e.g. [pycharm](https://www.jetbrains.com/pycharm/)) - install [selenium](https://selenium-python.readthedocs.io/index.html) in python using `pip install selenium` on the terminal - download the appropriate [driver to the browser](https://selenium-python.readthedocs.io/installation.html#drivers) (Chrome or Firefox) and put it on the path or in your WD --- # Checking the install If the installation is all right, the following code should open a browser: ```python from selenium import webdriver driver = webdriver.Chrome() ``` --- # How does selenium works? - selenium controls a browser: typically anything that .bold1[you] can do, .bold1[it] can do - most common actions include: - access to URLs - clicking on buttons - typing/filling forms - scrolling - .color2[do you really do more than that?].footnote[{star}Selenium can do much more actually, it can even execute custom javascript code.] --- # Selenium 101: An example ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get("https://stackoverflow.com/") btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") btn_cookies.click() search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") search_input.send_keys("webscraping") search_input.send_keys(Keys.RETURN) ``` --- Importing only the classes we'll use. ```python *from selenium import webdriver *from selenium.webdriver.common.by import By *from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get("https://stackoverflow.com/") btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") btn_cookies.click() search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") search_input.send_keys("webscraping") search_input.send_keys(Keys.RETURN) ``` --- Launching the browser .comment[empty at the moment]. ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys *driver = webdriver.Chrome() driver.get("https://stackoverflow.com/") btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") btn_cookies.click() search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") search_input.send_keys("webscraping") search_input.send_keys(Keys.RETURN) ``` --- Accessing the stackoverflow.footnote[{altstar}I'm sorry to target stackoverflow for webscraping but it's only for instructional purposes!] URL. ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() *driver.get("https://stackoverflow.com/") btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") btn_cookies.click() search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") search_input.send_keys("webscraping") search_input.send_keys(Keys.RETURN) ``` --- It's our first visit on the page, so cookies need to be agreed upon. After selecting.footnote[{star}Do not mistaken `find_element` with `find_elements` (.color1[the s!]). The former returns an HTML element while the latter returns an array.] the button to click with a .bold1[CSS selector], we click on it with the `click()` method. ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get("https://stackoverflow.com/") *btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") *btn_cookies.click() search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") search_input.send_keys("webscraping") search_input.send_keys(Keys.RETURN) ``` --- Finally we search the SO posts containing the term .color1[webscraping]. We first select the input element containing the search text. Then we type webscraping with the `send_keys()` method and end with pressing enter (`Keys.RETURN`). ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get("https://stackoverflow.com/") btn_cookies = driver.find_element(By.CSS_SELECTOR, "button.js-accept-cookies") btn_cookies.click() *search_input = driver.find_element(By.CSS_SELECTOR, "input.s-input__search") *search_input.send_keys("webscraping") *search_input.send_keys(Keys.RETURN) ``` --- # Saving the results To obtain the HTML of an element: ```python body = driver.find_element(By.CSS_SELECTOR, "body") body.get_attribute("innerHTML") ``` The variable `driver.find_element(By.CSS_SELECTOR, "body").get_attribute("innerHTML")`.footnote[{altstar}Please remind that the term `driver` is only a generic name which was taken from the previous example. It could be anything else.] contains the HTML code .bold1[as it is currently displayed in the browser]. It has nothing to do with the source code! --- # Saving the results II To write the HTML in a file, you can still do it Python way: ```r outFile = open(path_to_save, "w") outFile.write(html_to_save) outFile.close() ``` --- # Good to know Scrolling (executes javascript) ```python # Scrolls down by 1000 pixels driver.execute_script("window.scrollBy(0,1000)") # Goes at the top of the page driver.execute_script("window.scrollTo(0, 0)") ``` --- # Dynamic webpages: is that it? Well, that's it folks! You just have to automate the browser and save the results. Then you can do the data processing in your favorite language. --- class: section # Practice --- Go on the website of [Peugeot](https://www.peugeot.fr/index.html). 1. Scrape information on the price of the motors for a few models. 2. Create a class extracting the information for one model. 3. Run on the first 3 models.