class: center, middle, inverse, title-slide # Data analysis II ## Webscraping, introduction ### Laurent Bergé ### University of Bordeaux, BxSE ### 09/12/2021 --- # What is webscraping? - webscraping is the art of collecting data from web pages - **anything** you see when browsing the internet *is* data - any data in a web page can be collected --- # Why doing that? Sometimes that's the only way to get the information you want! -- .block[{To consider} Web scraping is time consuming and is also costly in terms of resources (.color2[both for you and the server you're scraping]). You should think hard to alternative solutions first!.footnote[{`\\(\\star\\)`} One solution is just to ask the owner, e.g. there's often a dedicated API provided.]] --- # Three types of task in webscraping 1. organizing the web scraping (.color2[only for large tasks]) -- 2. getting the data from the web (.color2[actual web scraping]) -- 3. formatting the data ??? Steps 1 and 3 are overlooked but are extremely important --- # Typologies of web scraping tasks ![](images/web_typologies.png) --- # Objective - get you going for ambitious web scraping projects (.color1[the fun ones]) -- .h-1em[] .center.strong1[So you'll need to have a correct understanding of .underline[how the web works]!] --- # Outline - how static web pages work - practice with R - handling large projects - how dynamic web pages work - practice with Python and Selenium --- class: section # How does the web work? --- # The HTTP protocol -- ![](images/http_infography.png) ??? examples: you query wikipedia => messi client -> first gets the IP of the server via DNS (cached or via DNS servers) -> sends a GET HTTP (hypertext transfert protocol) -> ISP -> Routers -> server -> response -> all the way back if in HTTPS -> first shake hands (server sends certificate) -> then encryptions + decryptions, same route encryption = run time IP adresses! uniquely identify you in the network --- # Example of HTTP GET request ```{} GET HTTP/1.1 Host: developer.mozilla.org User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT) Accept-Language: en-us Accept-Encoding: gzip, deflate Connection: Keep-Alive ``` ??? talk about the user agent file => you can write whatever you want here, but you can also look like one regular browser --- # Server sends back a response .source[[mozilla](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)] ```{} HTTP/1.1 200 OK Date: Sat, 09 Oct 2010 14:28:02 GMT Server: Apache Last-Modified: Tue, 01 Dec 2009 20:18:22 GMT ETag: "51142bc1-7449-479b075b2891b" Accept-Ranges: bytes Content-Length: 29769 Content-Type: text/html <!DOCTYPE html... (here come the 29769 bytes of the requested web page) ``` ??? famous codes: 200 / 404 / 403 (forbidden) --- # What's a webpage? -- - a web page is just code that is interpreted by your browser - the language in which the content is written is .bold1[HTML] -- - HTML is just about .color1[content]! -
let's inspect [the webscraping's wikipedia page](https://en.wikipedia.org/wiki/Web_scraping) ??? we look at the wikipedia page then inspect the html and find out the content --- # How does HTML work? - this is a markup language - the content of each HTML element is enclosed in tags - tags can have attributes - content only: multiple spaces are ignored -- .color2[... that's basically it!] --- # How does HTML work? Tags .source[[mozilla](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)] ![](images/html_moz_tags.JPG) --- # How does HTML work? Attributes .source[[mozilla](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)] ![](images/html_moze_attributes.JPG) --- # Empty tags Some tags don't need closing tags: - like `<img>` or `<br>` --- # Most common tags - `h1`-`h4`: headers - `p`: paragraph - `a`: link - `img`: image - `strong`: to emphasize text - `div`: generic box (.color2[this may be the most popular]) --- # HTML is just about boxes ![](images/html_box_model.JPG) --- # HTML: Example -
let's write our first web page! -- - Let's create a personal web page containing: - a brief description of who you are - stuff that you like, with at least one link - stuff that you don't like - a quote - an image ??? Do it in VScode, it's easier --- # Why is our web page boring? - HTML is only about content, .color1[not about style] -- - HTML is nothing without its best friend .bold1[CSS] -- - CSS is .it1[only about style] --- # How does CSS work? - CSS is a language indicating (to the browser) how to style your .color1[HTML] elements -- ![](images/css_moz.JPG) -- .h-1em[] .center.strong1[You can do a lot with CSS!] --- .source[[Adam Kuhn](https://codepen.io/cobra_winfrey/pen/YBNMdw), with only HTML and CSS.] <p class="codepen" data-height="600" data-default-tab="result" data-slug-hash="YBNMdw" data-user="cobra_winfrey" style="height: 600px; box-sizing: border-box; display: flex; align-items: center; justify-content: center; border: 2px solid; margin: 1em 0; padding: 1em;"> <span>See the Pen <a href="https://codepen.io/cobra_winfrey/pen/YBNMdw"> Stay Positive</a> by Adam Kuhn (<a href="https://codepen.io/cobra_winfrey">@cobra_winfrey</a>) on <a href="https://codepen.io">CodePen</a>.</span> </p> <script async src="https://cpwebassets.codepen.io/assets/embed/ei.js"></script> --- # CSS: Example -
let's add CSS to our web page! - use CSS to: - increase the [font-size](https://developer.mozilla.org/fr/docs/Web/CSS/font-size) of the paragraphs and set the [font-family](https://developer.mozilla.org/fr/docs/Web/CSS/font-family) to sans-serif - change the [background-color](https://developer.mozilla.org/fr/docs/Web/CSS/background-color) to .bg-color-Linen[Linen].footnote[{star}There are [140 predefined colors](https://www.w3schools.com/tags/ref_colornames.asp) in HTML.] - add a [border-radius](https://developer.mozilla.org/fr/docs/Web/CSS/border-radius) and a [box-shadow](https://developer.mozilla.org/fr/docs/Web/CSS/box-shadow) to the image --- # CSS: Can we do more? I would like to have: - the first paragraph in italic - the stuff that I like in green (.color-ForestGreen[**ForestGreen**]) - the stuff that I dislike in red (.color-Crimson[**Crimson**]) - the things that I *really*, **really**, like or dislike in bold -- .block[Q: At the moment, can I do that?] -- .block[A: Not yet, because we need to select precisely some elements! In sum, we need **selectors**.] --- class: section # CSS selectors --- # CSS selectors - .bold1[CSS selectors] indicate precisely which HTML element you want to style -- - typically, HTML tags will contain .color1[attributes] in order to be found via CSS selectors -- - the main attribute used in HTML is the `class`.footnote[The `id` attribute is usually less useful in webscraping.] --- # CSS selectors: Most common ways to select HTML elements .footer[See a list in [w3schools](https://www.w3schools.com/cssref/css_selectors.asp), and the [test page](https://www.w3schools.com/cssref/trysel.asp).] ```{} p : all "p" tags p span : all "span" contained in "p" tags p, a : all "p" and "a" tags #id1 : all elements with id equal to id1 .class1 : all elements of class "class1" p.class1 : all "p" elements of class "class1" p.class1 span : all "span" in "p" tags of class "class1" p > span : all "span" that are direct children of p h1 + p : all "p" that follow *directly* an "h1" (direct sibling) h1 ~ p : all "p" that follow an "h1" (siblings placed after) [id] : all elements with an existing "id" attribute [class^=my] : all elements whose class starts with "my" p[class*=low] : all "p" elements whose class contains the string low etc! ``` ??? LATER: two columns. Left the HTML, right the CSS selector. As I go along the different selectors, I highlight the HTML element that is selected --- # Exercize 1 .pull-left[ .block[Q: Select the following paragraph.] ] .pull-right[ .invisible[stuff] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> * <p class="extra">I forgot to say that I like scoring over 100 * goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div> </div> ``` ] --- # Exercize 1 .pull-left[ .block[Q: Select the following paragraph.] ] .pull-right[ .block[A: .w-1em[] `p.extra`] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> <div> <p id="like">What I like:</p> <ul> <li>Barcelona</li> <li>winning the Ballon d'Or every odd year</li> </ul> * <p class="extra">I forgot to say that I like scoring over 100 * goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div> </div> ``` ] --- # Exercize 2 .pull-left[ .block[Q: Select the two highlighted `div`s.] ] .pull-right[ .invisible[stuff] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> * <div> * <p id="like">What I like:</p> * <ul> * <li>Barcelona</li> * <li>winning the Ballon d'Or every odd year</li> * </ul> * <p class="extra">I forgot to say that I like scoring over 100 * goals per season.</p> * </div> * <div> * <p class="dislike">What I don't like:</p> * <ul> * <li><span class="deadly-foe">Real Madrid</span></li> * <li>leaving the club in which I've played since * <span class="age">13</span></li> * </ul> * </div> </div> ``` ] --- # Exercize 2 .pull-left[ .block[Q: Select the two highlighted `div`s.] ] .pull-right[ .block[A: .w-1em[] `div.info > div`] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> * <div> * <p id="like">What I like:</p> * <ul> * <li>Barcelona</li> * <li>winning the Ballon d'Or every odd year</li> * </ul> * <p class="extra">I forgot to say that I like scoring over 100 * goals per season.</p> * </div> * <div> * <p class="dislike">What I don't like:</p> * <ul> * <li><span class="deadly-foe">Real Madrid</span></li> * <li>leaving the club in which I've played since * <span class="age">13</span></li> * </ul> * </div> </div> ``` ] --- # Exercize 3 .pull-left[ .block[Q: Select the two highlighted `li`s.] ] .pull-right[ .invisible[stuff] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> <div> <p id="like">What I like:</p> <ul> * <li>Barcelona</li> * <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div> </div> ``` ] --- # Exercize 3 .pull-left[ .block[Q: Select the two highlighted `li`s.] ] .pull-right[ .block[A: .w-1em[] `#like ~ ul > li`] ] .fs-14[ ```default <h2>Who am I?</h2> <div class="intro"> <p>I'm <span class="age">34</span> and measure <span class="unit">1.70m</span>.</p> </div> <div class="info"> <div> <p id="like">What I like:</p> <ul> * <li>Barcelona</li> * <li>winning the Ballon d'Or every odd year</li> </ul> <p class="extra">I forgot to say that I like scoring over 100 goals per season.</p> </div> <div> <p class="dislike">What I don't like:</p> <ul> <li><span class="deadly-foe">Real Madrid</span></li> <li>leaving the club in which I've played since <span class="age">13</span></li> </ul> </div> </div> ``` ] --- # A note on classes An HTML element can have several classes separated with spaces: ```default <p class="first main low-key"> That's only an example! </p> ``` -- - when using `.class`, the class **is not** the full string `"first main low-key"` - there are three separate classes: `first`, `main` and `low-key`, which can be selected with the `p.class` syntax -- This means that the paragraph can be selected with either: ```css p.first p.main p.low-key p.first.main ``` --- # A note on classes: `[attr]` selection ```default <p class="first main low-key"> That's only an example! </p> ``` - even though you can select with `p.main`, you **cannot** with `p[class^=main]` - in `p[class^=text]` only the full string of the class is considered, not the three classes separately - you would have to use `p[class*=main]` -- ### Caveat... - you would also en up selecting the following element: ```default <p class="mainiac"> On the floor. </p> ``` --- # Selectors: Limitation Imagine a large HTML file. -- .block[Q: Can you select the following div? .comment[Tip: there's a hint in the title.]] ```default <div> This is an important text. <p id = "p1"> lorem ipsum </p> </div> ``` -- .block[A: No. There is no selector equal to: "select the div such that it contains a paragraph with id equal to p1".] -- - you **cannot select** .underline[parent elements] from children.footnote[{altstar} The selector `:has()` is new though but mostly unsupported.] - in other words: you can only go down the tree .comment[sometimes it's an issue when the child node is easier to select] --- # Selectors: XPath - .bold1[XPath] is a language to make selections in XML documents .comment[it is not linked to CSS] - it's like... a path to a document: `/path/to/object` but instead of having folders, you have tags.footnote[{altstar}It's actually more complicated than that, but it's a fair first approximation.] -- ```{} MAIN SYNTAX / : selects the direct descendant only // : selects any descendant .. : selects the parent @ : selects attributes (used in conditions) ``` -- - .bold1[Tip:] always start with a double slash. Ex: `//div/etc` will first select all `div` in the document before applying the rest of the commands..footnote[{star}Starting with a single slash leads to an absolute path from the root which is faster but much more error prone in webscraping where you .comment[likely] deal with deeply embedded webpages.] --- # XPath: predicates - you can add predicates in brackets .fs-14[ ```{} //div[@class='extra'] : selects all 'div's whose class is equal to 'extra' //div/p[2] : selects the second 'p's that are direct children of 'div's //div/p[last()] : selects the last 'p's that are direct children of 'div's //div[contains(@class, 'extra')] : selects all 'div's whose class contains 'extra' ``` ] ??? cond1 and cond2 not(cond) --- # XPath: Axes - you can add, before the tag, an axis in the form `axis::tag`.footnote[{star} You can find the list of axes in [W3schools](https://www.w3schools.com/xml/xpath_axes.asp).] .fs-14[ ```{} //span/parent::li : selects all 'li' that are parents of 'span' //p/following-sibling::* : selects any element following a 'p' that is a sibling of it ``` ] ??? //span[@class='deadly-foe']/parent::li --- # Why are selectors important? - when you scrape a web page, you don't want all the content from the web page: you focus only on .color1[specific elements] - you select elements using .bold1[CSS selectors] or .bold1[XPath] --- # Wrapping up .auto-fit[Selectors are .bold1[powerful tools] to select HTML elements] -- .block[{Resources} - [CSS](https://www.w3schools.com/cssref/css_selectors.asp) - [XPath](https://www.w3schools.com/xml/xpath_intro.asp) - a [website](https://quizzical-engelbart-d15a44.netlify.app/lean_css_xpath.html) I created to learn and test selectors ] -- .h-1em[] .center.strong1[Now let's go back to CSS!] --- # More CSS to our webpage -
let's add more CSS to our webpage - let's have: - the first paragraph in italic - the stuff that I like in green (.color-ForestGreen[**ForestGreen**]) - the stuff that I dislike in red (.color-Crimson[**Crimson**]) - the things that I *really*, **really**, like or dislike in bold ??? Idealement, il faut leur demander de creer une page web précise : par exemple un blog --- # What I see and what I scrape .block[{To remember} You don't scrape what you see in the browser, you scrape the HTML code which, after application of CSS styles, renders on your browser. It's important to set the CSS aside (.color2[that's why we need to understand what it does])! ] ??? You can scrape much more than visible elements! --- class: section # Scraping in R --- # Tools ### Good news You have readily available tools to webscrape in R and Python. In R, you'll need: - [rvest](https://rvest.tidyverse.org/) -- - and that's it! (.color2[for the easy stuff!]) --- # Exercise 1 1. go into [Messi's wikipedia page](https://en.wikipedia.org/wiki/Lionel_Messi) 2. scrape the statistics table reporting the goals scored in club 3. plot the evolution of the number of goals in the national championship and in the Champion's League -- .block[{The only thing you need is...} .fs-18[ 1. `rvest` - `read_html` to get the webpage content - `html_elements` to get the HTML elements with CSS selectors - `html_table` to extract the table 2. basic data management ] ] --- # Exercise 2 1. select all paragraphs containing the name .color1[Messi] twice 2. extract these paragraphs and highlight the name of Messi with `<strong>` 3. write an HTML document containing only these paragraphs 4. add the following style in the headers: ```css <style> p { border: 1pt solid; border-radius: 5px; box-shadow: 12px 12px 5px 2px rgba(0, 0, 255, 0.2); margin: 2em; padding: 1em; } </style> ``` --- # Exercise 3 1. gather information on your next neighbor only using Google queries 2. keep only relevant information 3. compile that information in an HTML document 4. write a function automatizing the process and apply it to another neighbor