Data analysis II

# Data analysis II
## Webscraping, introduction
### Laurent Bergé
### University of Bordeaux, BxSE
### 09/12/2021

---

# What is webscraping?

- webscraping is the art of collecting data from web pages

- **anything** you see when browsing the internet *is* data

- any data in a web page can be collected

---

# Why doing that?

Sometimes that's the only way to get the information you want!

.block[{To consider}
Web scraping is time consuming and is also costly in terms of resources (.color2[both for you and the server you're scraping]). You should think hard to alternative solutions first!.footnote[{`\\(\\star\\)`} One solution is just to ask the owner, e.g. there's often a dedicated API provided.]]

---

# Three types of task in webscraping

1. organizing the web scraping (.color2[only for large tasks])

2. getting the data from the web (.color2[actual web scraping])

3. formatting the data

???

Steps 1 and 3 are overlooked but are extremely important

---

# Typologies of web scraping tasks

![](images/web_typologies.png)

---

# Objective

- get you going for ambitious web scraping projects (.color1[the fun ones])

.center.strong1[So you'll need to have a correct understanding of .underline[how the web works]!]

---

# Outline

- how static web pages work

- practice with R

- handling large projects

- how dynamic web pages work

- practice with Python and Selenium

---

# How does the web work?

---

# The HTTP protocol

![](images/http_infography.png)

???

examples: you query wikipedia => messi

client -> first gets the IP of the server via DNS (cached or via DNS servers) -> 
  sends a GET HTTP (hypertext transfert protocol) -> ISP -> Routers -> server -> response ->
  all the way back
  
if in HTTPS -> first shake hands (server sends certificate) -> then encryptions + decryptions, same route
encryption = run time

IP adresses! uniquely identify you in the network

---

# Example of HTTP GET request

```{}
GET HTTP/1.1
Host: developer.mozilla.org
User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
```

???

talk about the user agent file => you can write whatever you want here, but you can also look like one regular browser

---

# Server sends back a response

```{}
HTTP/1.1 200 OK
Date: Sat, 09 Oct 2010 14:28:02 GMT
Server: Apache
Last-Modified: Tue, 01 Dec 2009 20:18:22 GMT
ETag: "51142bc1-7449-479b075b2891b"
Accept-Ranges: bytes
Content-Length: 29769
Content-Type: text/html

<!DOCTYPE html... (here come the 29769 bytes of the requested web page)
```

???
famous codes: 200 / 404 / 403 (forbidden)

---

# What's a webpage?

- a web page is just code that is interpreted by your browser

- the language in which the content is written is .bold1[HTML]

- HTML is just about .color1[content]!

- <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M0 352a160 160 0 0 0 160 160h64a160 160 0 0 0 160-160V224H0zM176 0h-16A160 160 0 0 0 0 160v32h176zm48 0h-16v192h176v-32A160 160 0 0 0 224 0z"/></svg> let's inspect [the webscraping's wikipedia page](https://en.wikipedia.org/wiki/Web_scraping)

???

we look at the wikipedia page
then inspect the html and find out the content

---

# How does HTML work?

- this is a markup language

- the content of each HTML element is enclosed in tags

- tags can have attributes

- content only: multiple spaces are ignored

---

# How does HTML work? Tags

.source[[mozilla](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)]

![](images/html_moz_tags.JPG)

---

# How does HTML work? Attributes

.source[[mozilla](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)]

![](images/html_moze_attributes.JPG)

---

# Empty tags

Some tags don't need closing tags:

- like `<img>` or `<br>`

---

# Most common tags

- `h1`-`h4`: headers
- `p`: paragraph
- `a`: link
- `img`: image
- `strong`: to emphasize text
- `div`: generic box (.color2[this may be the most popular])

---

# HTML is just about boxes

![](images/html_box_model.JPG)

---

# HTML: Example

- Let's create a personal web page containing:

- a brief description of who you are
  - stuff that you like, with at least one link
  - stuff that you don't like
  - a quote
  - an image

???

Do it in VScode, it's easier

---

# Why is our web page boring?

- HTML is only about content, .color1[not about style]

- HTML is nothing without its best friend .bold1[CSS]

- CSS is .it1[only about style]

---

# How does CSS work?

- CSS is a language indicating (to the browser) how to style your .color1[HTML] elements

![](images/css_moz.JPG)

.center.strong1[You can do a lot with CSS!]

---

<p class="codepen" data-height="600" data-default-tab="result" data-slug-hash="YBNMdw" data-user="cobra_winfrey" 
style="height: 600px; box-sizing: border-box; display: flex; align-items: center; 
justify-content: center; border: 2px solid; margin: 1em 0; padding: 1em;">
  <span>See the Pen <a href="https://codepen.io/cobra_winfrey/pen/YBNMdw">
  Stay Positive</a> by Adam Kuhn (<a href="https://codepen.io/cobra_winfrey">@cobra_winfrey</a>)
  on <a href="https://codepen.io">CodePen</a>.</span>
</p>
<script async src="https://cpwebassets.codepen.io/assets/embed/ei.js"></script>

---

# CSS: Example

- use CSS to:

- increase the [font-size](https://developer.mozilla.org/fr/docs/Web/CSS/font-size) of the paragraphs and set the [font-family](https://developer.mozilla.org/fr/docs/Web/CSS/font-family) to sans-serif
  - change the [background-color](https://developer.mozilla.org/fr/docs/Web/CSS/background-color) to .bg-color-Linen[Linen].footnote[{star}There are [140 predefined colors](https://www.w3schools.com/tags/ref_colornames.asp) in HTML.]
  - add a [border-radius](https://developer.mozilla.org/fr/docs/Web/CSS/border-radius) and a [box-shadow](https://developer.mozilla.org/fr/docs/Web/CSS/box-shadow) to the image
  
  
---

# CSS: Can we do more?

I would like to have:

- the first paragraph in italic
- the stuff that I like in green (.color-ForestGreen[**ForestGreen**])
- the stuff that I dislike in red (.color-Crimson[**Crimson**]) 
- the things that I *really*, **really**, like or dislike in bold

.block[A: Not yet, because we need to select precisely some elements! In sum, we need **selectors**.]

---

# CSS selectors

---

# CSS selectors

- .bold1[CSS selectors] indicate precisely which HTML element you want to style

- typically, HTML tags will contain .color1[attributes] in order to be found via CSS selectors

- the main attribute used in HTML is the `class`.footnote[The `id` attribute is usually less useful in webscraping.]

---

# CSS selectors: Most common ways to select HTML elements

.footer[See a list in [w3schools](https://www.w3schools.com/cssref/css_selectors.asp), and the [test page](https://www.w3schools.com/cssref/trysel.asp).]

```{}
p             : all "p" tags
p span        : all "span" contained in "p" tags
p, a          : all "p" and "a" tags
#id1          : all elements with id equal to id1
.class1       : all elements of class "class1"
p.class1      : all "p" elements of class "class1"
p.class1 span : all "span" in "p" tags of class "class1"
p > span      : all "span" that are direct children of p
h1 + p        : all "p" that follow *directly* an "h1" (direct sibling)
h1 ~ p        : all "p" that follow an "h1" (siblings placed after)
[id]          : all elements with an existing "id" attribute
[class^=my]   : all elements whose class starts with "my"
p[class*=low] : all "p" elements whose class contains the string low
etc!
```

???

LATER: two columns. Left the HTML, right the CSS selector.
As I go along the different selectors, I highlight the HTML element that is selected

---

# Exercize 1

```default
<h2>Who am I?</h2>