Tutorial-3.knit

# Modelling and quantifying mortality and longevity risk 
## Tutorial session 3

### Katrien Antonio, Torsten Kleinow, Frank van Berkum, Michel Vellekoop and Jens Robben
### <a href="https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/">35th International Summer School of the Swiss Association of Actuaries </a> | June 3-7, 2024
###  <a> </a>

---

# Prologue

---

# Introduction

### Workshop

<img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/

The workshop's landing page on GitFront, where we upload slide decks, R code and data sets.

### Us

[Katrien Antonio](https://katrienantonio.github.io) [<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#106EB6;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg>](mailto:katrien.antonio@kuleuven.be) &nbsp; &nbsp; &nbsp; [Michel Vellekoop](https://www.uva.nl/profiel/v/e/m.h.vellekoop/m.h.vellekoop.html) [<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#106EB6;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg>](mailto:m.h.vellekoop@uva.nl) &nbsp; &nbsp;  [Torsten Kleinow](https://www.uva.nl/profiel/k/l/t.kleinow/t.kleinow.html) [<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#106EB6;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg>](mailto:t.kleinow@uva.nl) &nbsp; &nbsp; &nbsp; [Frank van Berkum](https://www.uva.nl/profiel/b/e/f.vanberkum/f.vanberkum.html) [<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#106EB6;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg>](mailto:F.vanBerkum@uva.nl) &nbsp; &nbsp; [Jens Robben](https://jensrobben.github.io) [<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#106EB6;overflow:visible;position:relative;"><path d="M64 112c-8.8 0-16 7.2-16 16v22.1L220.5 291.7c20.7 17 50.4 17 71.1 0L464 150.1V128c0-8.8-7.2-16-16-16H64zM48 212.2V384c0 8.8 7.2 16 16 16H448c8.8 0 16-7.2 16-16V212.2L322 328.8c-38.4 31.5-93.7 31.5-132 0L48 212.2zM0 128C0 92.7 28.7 64 64 64H448c35.3 0 64 28.7 64 64V384c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V128z"/></svg>](mailto:jens.robben@kuleuven.be)

---

# Goals of tutorial session 3

---

# Goals of tutorial session 3

In this tutorial session you will learn how to:

* download and visualize the NUTS 3 regions in Switzerland

* download and illustrate weekly mortality and weather data for Switzerland

* calibrate a weekly mortality baseline model

* use a machine learning model to associate deviations from the mortality baseline with weather data

* construct in-sample mortality predictions.

The tutorial is a simplified version of the material presented in Module 7 on the analysis of fine-grained mortality data.

]

* [NUTS 3 regions in Switzerland](#nuts3)

* [Download and import mortality data](#mortalitydata)

* [Download and import weather data](#weatherdata)

* [Weekly mortality baseline model](#mortalitybaseline)

* [Modelling deviations from the mortality baseline](#mortalitydeviations)

* [Results](#results)

The full working paper is available on arXiv: [https://arxiv.org/abs/2405.18020](https://arxiv.org/abs/2405.18020).

Corresponding R code will be made available on GitHub: [https://github.com/jensrobben/EnvVar-Mortality-Europe](https://github.com/jensrobben/EnvVar-Mortality-Europe).

]

---

# NUTS 3 regions in Switzerland

---

# NUTS 3 regions in Switzerland
We focus on mortality data in NUTS 3 regions in Switzerland. NUTS 3 is the most detailed level in the Nomenclature of Territorial Units for Statistics (see [here](https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics)).

We read the NUTS level shapefile using the `read_sf` function from the {sf} package. This shapefile contains geospatial data for all NUTS regions in Europe. We have pre-downloaded it from [Eurostat](https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics) and stored it in our GitHub repository.

```r
shapef <- read_sf('../data/shapefile/NUTS_RG_01M_2021_4326.shp')
```

We then filter the shapefile to include only NUTS 3 level regions (`LEVL_CODE == 3`) in Switzerland (`CNTR_CODE == 'CH'`) using the `filter` function from the {dplyr} package, and arrange the data by NUTS 3 code.

```r
shapef <- shapef %>% 
  dplyr::filter(LEVL_CODE == 3, CNTR_CODE == "CH") %>%
  arrange(NUTS_ID)
```

We extract the `NAME_LATN` attribute from the shapefile and find that the Swiss NUTS 3 regions correspond to the 26 Swiss cantons.

```r
head(shapef$NAME_LATN)
## [1] "Vaud"      "Valais"    "Genève"    "Bern"      "Freiburg"  "Solothurn"
```

---
class: clear

The `geometry` attribute stores the geographic data for Swiss cantons. Let's examine the geometry of the first canton in the shapefile:

```r
shapef[1,]$geometry
## Geometry set for 1 feature 
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 6.064003 ymin: 46.18696 xmax: 7.248677 ymax: 46.98672
## Geodetic CRS:  WGS 84
```

- The geometry type `MULTIPOLYGON` indicates that the shapefile is made up of multiple polygons outlining the boundaries of the Swiss cantons.

- The dimension `XY` indicates that the polygons are defined in two dimensions (longitude and latitude).

- The bounding box (`xmin`,`ymin`, `xmax`, `ymax`) reflects the rectangular area of (long, lat) coordinates encompassing the canton.

- The geodetic CRS `WGS 84` is the coordinate reference system (CRS) used for this shapefile.

---
class: clear

We can visualize the Swiss NUTS 3 regions (cantons) using `ggplot`. The key function for plotting geographical data is the layer `geom_sf` from the {ggplot2} package. This function uses the geometry column from our shapefile. The `fill = NUTS_ID` aesthetic fills each NUTS 3 region with a different colour.

```r
ggplot(shapef) + 
  geom_sf(aes(fill = NUTS_ID, geometry = geometry)) 
```

The above code provides a basic plot. We can enhance it by adding a location icon for Lausanne and specifying a custom color palette.

```r
colours <- distinctColorPalette(k = 26)

load.fontawesome()
location <- fontawesome(c('fa-map-marker'))

ggplot(shapef) + 
  geom_sf(aes(fill = NUTS_ID,
              geometry = geometry), col = 'gray70') + 
  geom_text(aes(x = 6.633597, y = 46.519962, label = location), 
            family = 'fontawesome-webfont', vjust = 0.05,
            data = shapef[1,], size = 7) + 
  scale_fill_manual(name = 'Regions', values = colours) + 
  theme_bw(base_size = 15)  + xlab('') + ylab('') +
  ggtitle("NUTS 3 regions (cantons) Switzerland")
```

---
class: clear

---

# Download and import mortality data

---

# Download and import mortality data

We do a live download using the function `get_eurostat` from the {eurostat} package to retrieve Swiss death counts by week (from 2000 on), sex, 5-year age group and NUTS 3 region from Eurostat.

```r
# Download weekly death counts from Eurostat (approx. 1 min)
dxtw.raw <- get_eurostat(id = 'demo_r_mweek3', time_format = 'raw', 
                         cache = FALSE, filters = list(geo = shapef$NUTS_ID))
```
- `id` refers to the unique identifier of the data set we download (see [here](https://ec.europa.eu/eurostat/databrowser/view/demo_r_mweek3/default/table?lang=en))
- `time_format = 'raw'` specifies we want to preserve the original date object
- `cache = FALSE` avoids caching and speeds up the running process
- `filters` limits the data to the NUTS 3 Swiss regions.

If you encounter issues with the live download, we have pre-downloaded this weekly mortality data set, and made it available in the GitHub repository. You can load this R object as follows:

```r
dxtw.raw <- readRDS('../data/eurostat/dxtw_CH_NUTS3.rds')
```

We set appropriate column names using the `colnames` function:

```r
colnames(dxtw.raw) <- c('Freq', 'Unit', 'Sex', 'Age', 'Region', 'Time', 'Deaths')
```

---
class: clear

We visualize an extract of the downloaded Eurostat data:

```r
head(dxtw.raw)
```
<br>

<table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Freq </th>
   <th style="text-align:left;"> Unit </th>
   <th style="text-align:left;"> Sex </th>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:left;"> Region </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Deaths </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W01 </td>
   <td style="text-align:right;"> 156 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W02 </td>
   <td style="text-align:right;"> 148 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W03 </td>
   <td style="text-align:right;"> 120 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W04 </td>
   <td style="text-align:right;"> 126 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W05 </td>
   <td style="text-align:right;"> 113 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> W </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 2000-W06 </td>
   <td style="text-align:right;"> 124 </td>
  </tr>
</tbody>
</table>

---
class: clear

```r
dxtw.CH <- dxtw.raw %>% 
  dplyr::filter(Age == 'TOTAL', Sex == 'T') %>%
  mutate(ISOYear = as.integer(substr(Time, 1, 4)), 
         ISOWeek = as.integer(substr(Time, 7, 8))) 
```

]

We filter the mortality data set `dxtw.raw` on the total age class (`Age == 'TOTAL'`) and unisex data (`Sex == 'T'`) using the `filter` function from the {dplyr} package. Using the function `mutate`, we create two new columns, `ISOYear` and `ISOWeek`. Hereto we extract the year (digits 1-4) and week (digits 7-8) from the `Time` column in `dxtw.raw` using the function `substr`.

]

```r
dxtw.CH <- dxtw.CH %>% 
  dplyr::filter(ISOYear <= 2019) %>%
  select(c('ISOYear', 'ISOWeek', 'Region', 'Deaths'))
```

]

To exclude the COVID-19 pandemic, we filter the data up to ISO-year 2019. We retain only the relevant columns: `ISO-year`, `ISO-week`, `Region`, and `Deaths` using the `select` function from {dplyr}.

]

```r
# Add date
wdate   <- ISOweek2date(sprintf("%d-W%02d-%d", 
                                dxtw.CH$ISOYear, 
                                dxtw.CH$ISOWeek, 1))
dxtw.CH <- dxtw.CH %>% 
  mutate('Date' = wdate)
```

]

Next, we use the `ISOweek2date` function from the {ISOweek} package to create a new column `Date`, representing the start date of each ISO-week.

This concludes the construction of a data set with weekly deaths for Switzerland, stored in `dxtw.CH`. On to the exposures!

]

---
class: clear

.left-column[
<br>
<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:2.5em;width:2.5em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:black;overflow:visible;position:relative;"><path d="M373.1 24.97C401.2-3.147 446.8-3.147 474.9 24.97L487 37.09C515.1 65.21 515.1 110.8 487 138.9L289.8 336.2C281.1 344.8 270.4 351.1 258.6 354.5L158.6 383.1C150.2 385.5 141.2 383.1 135 376.1C128.9 370.8 126.5 361.8 128.9 353.4L157.5 253.4C160.9 241.6 167.2 230.9 175.8 222.2L373.1 24.97zM440.1 58.91C431.6 49.54 416.4 49.54 407 58.91L377.9 88L424 134.1L453.1 104.1C462.5 95.6 462.5 80.4 453.1 71.03L440.1 58.91zM203.7 266.6L186.9 325.1L245.4 308.3C249.4 307.2 252.9 305.1 255.8 302.2L390.1 168L344 121.9L209.8 256.2C206.9 259.1 204.8 262.6 203.7 266.6zM200 64C213.3 64 224 74.75 224 88C224 101.3 213.3 112 200 112H88C65.91 112 48 129.9 48 152V424C48 446.1 65.91 464 88 464H360C382.1 464 400 446.1 400 424V312C400 298.7 410.7 288 424 288C437.3 288 448 298.7 448 312V424C448 472.6 408.6 512 360 512H88C39.4 512 0 472.6 0 424V152C0 103.4 39.4 64 88 64H200z"/></svg>
## Your turn

]

.right-column[
**<span style="color:red">Q</span>**: Visualize the evolution of the weekly death counts in the NUTS 3 region that includes Lausanne.
<br>
<br>

```r
# Select region encompassing Laussanne
region <- ...

# Filter the data for the specified region
dxtw.VD <- ... %>% filter(...)

# Create the plot
ggplot(...) + 
  geom_line(aes(x = ..., y = ...), colour = RCLRbg) +
  xlab('Date') + ylab('Death counts') + 
  theme_bw(base_size = 15)  
```

]

---

]

.right-column[
**<span style="color:red">A</span>**: Lausanne is the capital city of the canton Vaud. The NUTS 3 code for the canton Vaud is CH011, which can be found in the `shapef` object.

```r
# Select region encompassing Laussanne
region <- `'CH011'`

# Filter the data for the specified region
dxtw.VD <- `dxtw.CH` %>% filter(`Region == region`)

# Create the plot
ggplot(`dxtw.VD`) + 
  geom_line(aes(x = `Date`, y = `Deaths`), colour = RCLRbg) +
  xlab('Date') + ylab('Death counts') + 
  theme_bw(base_size = 15)  
```

]

---

]

.right-column[
**<span style="color:red">A</span>**: Lausanne is the capital city of the canton Vaud. The NUTS 3 code for the canton Vaud is CH011, which can be found in the `shapef` object.

]

---
# Exposures

```r
Pxt.raw <- get_eurostat(id = 'demo_r_pjanaggr3', 
                        time_format = 'raw',
                        cache = FALSE, 
                        filters = 
                          list(geo = 
                                 shapef$NUTS_ID))

colnames(Pxt.raw) <- c('Freq', 'Unit', 'Sex', 'Age',
                       'Region', 'Time', 'Pop')
```
]

We perform a live download (again) using the `get_eurostat` function to retrieve population counts on January 1st, categorized by age group, sex, and NUTS 3 region. The function's parameters are similar to those used for downloading weekly death counts, with the main difference being the identifier: `demo_r_pjanaggr3` (see [here](https://ec.europa.eu/eurostat/databrowser/product/view/demo_r_pjanaggr3?lang=en)).

We specify the column names using the function `colnames`.

]

```r
Pxt.raw <- 
  readRDS('../data/eurostat/Pxt_CH_NUTS3.rds')
```

]

If you encounter issues with the live download, we have pre-downloaded the population count data set. You can load this R object using the `readRDS` function.

]

```r
unique(Pxt.raw$Age)
## [1] "TOTAL"  "Y_LT15" "Y15-64" "Y_GE65" "UNK"
unique(Pxt.raw$Sex)
## [1] "T" "M" "F"
```
]

We check the levels of the `Age` and `Sex` covariates in the data set.

]

---
class: clear
We display a sample of the data set `Pxt.raw`, which contains the population counts for Swiss regions and specific age groups:

```r
head(Pxt.raw)
```
<br>

<table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Freq </th>
   <th style="text-align:left;"> Unit </th>
   <th style="text-align:left;"> Sex </th>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:left;"> Region </th>
   <th style="text-align:left;"> Time </th>
   <th style="text-align:right;"> Pop </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1990 </td>
   <td style="text-align:right;"> 571973 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1991 </td>
   <td style="text-align:right;"> 577524 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1992 </td>
   <td style="text-align:right;"> 587283 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1993 </td>
   <td style="text-align:right;"> 593007 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1994 </td>
   <td style="text-align:right;"> 596736 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> A </td>
   <td style="text-align:left;"> NR </td>
   <td style="text-align:left;"> T </td>
   <td style="text-align:left;"> TOTAL </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:left;"> 1995 </td>
   <td style="text-align:right;"> 602099 </td>
  </tr>
</tbody>
</table>

---

]

.right-column[
**<span style="color:red">Q</span>**: Using a map of Switzerland, display the total population count in the year 2019, across the Swiss cantons. Consider both genders and all age groups. Identify the three cantons with the highest population.

```r
# Filter data for the year 2019, total age group, unisex data
pop.CH.2019 <- ... %>% 
  dplyr::filter(Time == ..., Age == ..., Sex == ...)

# Join population data to shapefile (consult ?left_join)
shapef.pop <- shapef %>% left_join(..., 
                                   by = c(... = ...))

# Plot
ggplot(...) + 
  geom_sf(aes(fill = ..., geometry = ...), col = 'gray70') +
  scale_fill_viridis_c(option = 'rocket', direction = -1) + 
  theme_bw(base_size = 15)  
```

]
---

]

.right-column[
**<span style="color:red">A</span>**: Zurich, Bern, and Vaud are the three most populated Swiss cantons, in that order. Below are the {ggplot} instructions used to visualize the data. We use the `left_join` function from the {dplyr} package to merge the population data set `pop.CH.2019` with the shapefile `shapef` by region. Since variable names differ in both data sets, we specify `by = c('NUTS_ID' = 'Region')` to indicate their correspondence.

```r
# Filter data for the year 2019, total age group, unisex data
pop.CH.2019 <- `Pxt.raw` %>% 
  dplyr::filter(Time == `2019`, Age == `'TOTAL'`, Sex == `'T'`)

# Join population data to shapefile (consult ?left_join)
shapef.pop <- shapef %>% left_join(`pop.CH.2019`, 
                                   by = c(`'NUTS_ID'` = `'Region'`))

# Plot
ggplot(`shapef.pop`) + 
  geom_sf(aes(fill = `Pop`, geometry = `geometry`), col = 'gray70') +
  scale_fill_viridis_c(option = 'rocket', direction = -1) + 
  theme_bw(base_size = 15)  
```

]

---
class: clear

]

]

---

```r
Pxt.CH <- Pxt.raw %>% 
  dplyr::filter(Age == 'TOTAL',
                Sex == 'T') %>%
  mutate(Time = as.integer(
    as.character(Time)))
```

]

We extract data from the `Pxt.raw` data set related to the total age group (`Age == 'TOTAL'`) and unisex population (`Sex == 'T'`) using the `filter` function from {dplyr}. Additionally, we convert the `Time` column from its original factor to an integer type.

]

```r
Extw.CH <- Pxt.CH %>% group_by(Region) %>% 
  reframe(Time, 'Expo' = 
            c((Pop[-length(Time)] + Pop[-1])/2, 
              Pop[length(Time)])/52.18) %>%
  dplyr::filter(Time <= 2019, Time >= 2000)
```

]

We define the annual exposure in each year `$t$`, region `$r$` as a mid-year population estimate and assume a constant weekly exposure within that year:
`\begin{align*}
E_{t}^{(r)} &= \dfrac{P_t^{(r)} + P_{t+1}^{(r)}}{2} \\
E_{t,w}^{(r)} &= \dfrac{E_t^{(r)}}{52.18},
\end{align*}`
with `$52.18$` the average number of weeks in a year.

]

```r
Df <- dxtw.CH %>% 
  left_join(Extw.CH, by = c('ISOYear' =  'Time',
                            'Region' = 'Region'))
```

]

We merge the death counts `dxtw.CH` with the exposure estimates `Extw.CH` based on the year and NUTS 3 region using the `left_join` function from {dplyr}.

]

---

]

.right-column[
**<span style="color:red">Q</span>**: Calculate the weekly mortality rates `$q_{t,w}^{(r)}$` using the exposure estimates for the canton Zurich. Visualize these as a function of time.

<br>

]

---

]

.right-column[
**<span style="color:red">A</span>**: We first filter the mortality data set `Df` (with regional, weekly death counts and exposures) to focus only on data from the canton of Zurich (CH040). Then, we compute the mortality rates `$q_{t,w}^{(r)}$` as `$1 - \exp(-d_{t,w}^{(r)}/E_{t,w}^{(r)})$`. Lastly, we visualize the results using {ggplot}.

```r
# Filter on canton Zurich (CH040)
Df.ZH     <- Df %>% dplyr::filter(Region == 'CH040')

# Construct mortality rates
Df.ZH$qtw <- 1 - exp(-Df.ZH$Deaths/Df.ZH$Expo)

# Plot
ggplot(Df.ZH) + 
  geom_line(aes(x = Date, y = qtw), col = RCLRbg) +
  ylab(bquote(q['t,w'])) + 
  ggtitle('Zurich') + 
  theme_bw(base_size = 15)  
```

]

---

]

]

---
class: inverse, center, middle
name: weatherdata

# Download and import weather data

---
# Download and import weather data

In the R script [eobs_data.R](https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/raw/scripts/eobs_data.R), we do a live download from the Copernicus Climate Data Store to retrieve daily weather factors throughout the years 2000-2019 (WARNING! - high computational time).

These weather factors include the daily levels of the maximum temperature (`tx`), average temperature (`tg`), minimum temperature (`tn`), relative humidity (`hu`), total precipitation amount (`rr`), and average wind speed (`fg`). The resulting data sets are stored for you in the GitHub repository. Let's read them into our environment:

```r
tn_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tn_daily.rds')
tg_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tg_daily.rds')
tx_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tx_daily.rds')
fg_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_fg_daily.rds')
hu_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_hu_daily.rds')
rr_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_rr_daily.rds')
```

We examine the structure of one of these datasets:

```r
head(tx_NUTS3_daily)
```

---
class: clear

<table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Date </th>
   <th style="text-align:right;"> ISOYear </th>
   <th style="text-align:right;"> ISOWeek </th>
   <th style="text-align:left;"> Region </th>
   <th style="text-align:right;"> tx </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 2000-01-01 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 52 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 1.865263 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2000-01-02 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 52 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 3.497631 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2000-01-03 </td>
   <td style="text-align:right;"> 2000 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 3.433158 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2000-01-04 </td>
   <td style="text-align:right;"> 2000 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 4.652895 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2000-01-05 </td>
   <td style="text-align:right;"> 2000 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 5.644210 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 2000-01-06 </td>
   <td style="text-align:right;"> 2000 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> CH011 </td>
   <td style="text-align:right;"> 4.355263 </td>
  </tr>
</tbody>
</table>

---
class: clear

]

.right-column[
**<span style="color:red">Q</span>**: Illustrate the evolution of the weather factors in the NUTS 3 region of Lausanne (Vaud) as a function of time using {ggplot}.

]

---

]

.right-column[
**<span style="color:red">A</span>**: We provide the {ggplot} instructions to visualize the evlolution of the minimum temperature in the NUTS 3 region of Lausanne (CH011). We start by selecting the relevant region (CH011). Then, we filter the daily maximum temperature data set `tn_NUTS3_daily` for this region. Finally, we create the plot.

```r
region <- 'CH011' # canton Vaud
tn_VD_daily <- tn_NUTS3_daily %>% 
  dplyr::filter(Region == region)

ggplot(tn_VD_daily) + 
  geom_line(aes(x = Date, y = tn), colour = RCLRbg) +
  xlab('Date') + ylab('Minimum temperature') + 
  theme_bw(base_size = 15)  
```

]

---

]

]

---

]

]

---

]

]

---

]

]

---

]

]

---

]

]

---
class: clear

]

.right-column[
**<span style="color:red">Q</span>**: Visualize the region-specific maximum temperature on the map of Switzerland on August 1, 2003 and the rain fall on January 1, 2000.

]

---
class: clear

]

.right-column[
**<span style="color:red">A</span>**: To visualize the maximum temperature data on August 1, 2003, on the map of Switzerland, we first filter the data set `tx_NUTS3_daily` for the specified date. Then, we join this data with the shape file `shapef` based on the region. Lastly, we plot the results using `geom_sf` from {ggplot} and apply the viridis color palette.

```r
tx.shp <- tx_NUTS3_daily %>% 
  dplyr::filter(Date == '2003-08-01') %>%
  left_join(shapef, by = c('Region' = 'NUTS_ID'))

ggplot(tx.shp) + 
  geom_sf(aes(geometry = geometry, fill = tx)) + 
  scale_fill_viridis_c(option = 'rocket', direction = -1) + 
  theme_bw(base_size = 15)
```
]

---
class: clear

]

<img src="Tutorial-3_files/figure-html/unnamed-chunk-44-1.svg" style="display: block; margin: auto;" />
]

---
class: clear

]

.right-column[
**<span style="color:red">A</span>**: We repeat the procedure for the rain fall on January 1, 2010, but use a different colour palette.

]

---

# From daily to weekly weather data

```r
Df.weather <- 
  plyr::join_all(list(tn_NUTS3_daily, tg_NUTS3_daily, 
                      tx_NUTS3_daily, fg_NUTS3_daily, 
                      hu_NUTS3_daily, rr_NUTS3_daily),
                 by = c('Date', 'ISOYear', 'ISOWeek', 
                        'Region'))
```

]

We merge all the daily weather data sets (`tn_NUTS3_daily`, `tg_NUTS3_daily`, `tx_NUTS3_daily`, `fg_NUTS3_daily`, `hu_NUTS3_daily`, `rr_NUTS3_daily`) into one large data frame `Df.weather` using the `join_all` function from the {plyr} package. The merging is done based on common columns: `Date`, `ISOYear`, `ISOWeek`, and `Region`.

]

```r
Df.weather <- Df.weather %>% 
  group_by(ISOYear, ISOWeek, Region) %>%
  reframe('tn' = mean(tn, na.rm = TRUE),
          'tg' = mean(tg, na.rm = TRUE),
          'tx' = mean(tx, na.rm = TRUE),
          'hu' = mean(hu, na.rm = TRUE),
          'fg' = mean(fg, na.rm = TRUE),
          'rr' = mean(rr, na.rm = TRUE))
```

]

To align with the weekly time scale of the mortality statistics, we convert the daily weather data in `Df.weather` to weekly data. This is done by grouping the data by `ISOYear`, `ISOWeek`, and `Region`, and then calculating the weekly average for each weather variable (`tn`, `tg`, `tx`, `hu`, `fg`, `rr`). We exclude any missing daily values in the averaging process using the argument `na.rm = TRUE`.

]

```r
Df <- Df %>% left_join(Df.weather)
```

]

.pull-right[
We join the mortality data set `Df` with the weather data set `Df.weather` based on common columns (default).
]

---
class: inverse, center, middle
name: mortalitybaseline

# Weekly mortality baseline model

---

# Model specifications

To model the weekly mortality baseline trend, we account for the seasonality observed in the region-specific weekly death counts by means of sine-cosine Fourier terms, defined as:

$$ \left[ \sin\left(2\pi \dfrac{n}{T} x\right), \cos\left(2\pi \dfrac{n}{T}x\right)\right]_{n\in\mathbb{N}},$$
with `$n$` the frequency or number of cycles and `$T$` the period of the sinusoidal terms.

We visualize the Fourier terms with a frequency of `$n=1$` and a period of 52.18 (annual cycle).
<br>

---
class: clear

We use the Serfling model specification to construct a baseline mortality model for the weekly death counts `$D_{t,w}^{(r)}$` in a specific region `$r$`. This model assumes that the number of deaths random variable follows a Poisson distribution, with as mean the weekly exposure `$E_{t,w}^{(r)}$` times the weekly force of mortality `$\mu_{t,w}^{(r)}$`: 
`\begin{align*}
D_{t,w}^{(r)} &\sim \text{Poisson}\left(E_{t,w}^{(r)}\: \mu_{t,w}^{(r)}\right).
\end{align*}`
Hereby:
`\begin{align*}
\log \mu_{t,w}^{(r)} &= \: \beta_0^{(r)} + \beta_1^{(r)} t + \beta_2^{(r)} \sin\left(\frac{2\pi w}{52.18}\right) + \beta_3^{(r)} \cos\left(\frac{2\pi w}{52.18}\right) + \:\:\:\beta_4^{(r)} \sin\left(\frac{2\pi w}{26.09}\right) + \beta_5^{(r)} \cos\left(\frac{2\pi w}{26.09}\right),
\end{align*}`
which includes an intercept, an annual time trend, and Fourier-terms to capture both annual and semi-annual frequencies.

You can determine the number of seasonal terms in each region using AIC or BIC (not covered in this tutorial).

We compute these Fourier terms and add them to our mortality dataset:

```r
Df <- Df %>%
  mutate('fsin52' = sin(2*pi*ISOWeek/52.1775),
         'fcos52' = cos(2*pi*ISOWeek/52.1775),
         'fsin26' = sin(4*pi*ISOWeek/52.1775),
         'fcos26' = cos(4*pi*ISOWeek/52.1775))
```

---
# Model calibration

```r
formula <- Deaths ~ ISOYear + fsin52 + 
  fcos52 + fsin26 + fcos26
```
]

.pull-right[
We construct the formula specification with the response variable `Deaths` on the one hand and the covariates `ISOYear` and the Fourier terms `fsin52`, `fcos52`, `fsin26`, and `fcos26` on the other hand.
]

```r
for(r in shapef$NUTS_ID){
  Dfr <- Df %>% 
    dplyr::filter(Region == r)
  
  fit.r <- glm(formula, data = Dfr,
               offset = log(Dfr$Expo),
               family = poisson(link = 'log'))
  
  Df[which(Df$Region == r),'bDeaths'] <- 
    fit.r$fitted.values
}
```
]

.pull-right[
For each Swiss canton `$r$`, we filter the mortality data set to include only data for that region, saving it as `Dfr`.

We then fit a Generalized Linear Model (GLM) on `Dfr` with as `offset` the logarithm of the exposure variable, i.e., `log(Dfr$Expo)`. We specify the Poisson distribution with a log-link using the `family` argument.

Finally, we add the estimated deaths to the mortality data set `Df` in a new column `bDeaths`.

]

---

]

.right-column[
**<span style="color:red">Q</span>**: Calculate and illustrate the estimated baseline and observed weekly mortality rates for the cantons of Vaud and Sankt Gallen.

]

---

]

.right-column[
**<span style="color:red">A</span>**: We start with the canton of Vaud (CH011). First, we filter the mortality data set `Df` on data for Vaud. Then we compute the observed and estimated weekly mortality rates and plot the results.

```r
# Filter on canton Vaud (CH011)
Df.VD <- Df %>% dplyr::filter(Region == 'CH011')

# Observed and estimated mortality rates
Df.VD$qtw.obs <- 1 - exp(-Df.VD$Deaths/Df.VD$Expo)
Df.VD$qtw.est <- 1 - exp(-Df.VD$bDeaths/Df.VD$Expo)

# Plot
ggplot(Df.VD) + 
  geom_line(aes(x = Date, y = qtw.obs), col = 'gray80') +
  geom_line(aes(x = Date, y = qtw.est), col = RCLRbg,
            linewidth = 0.8) + 
  ylab(bquote(q['t,w'])) + 
  ggtitle('Vaud') + 
  theme_bw(base_size = 15)  
```

]

---

]

.right-column[
**<span style="color:red">A</span>**: We start with the canton of Vaud (CH011). First, we filter the mortality data set `Df` on data for the Region of Vaud. Then we compute the observed and estimated weekly mortality rates and plot the results.

]

---

]

.right-column[
**<span style="color:red">A</span>**: We repeat the procedure for Sankt Gallen (CH055).

]

---

# Modelling deviations from the mortality baseline

---

# Model specifications

We associate deviations from the weekly mortality baseline model with region-specific weather features. We work with the following model assumptions:

`\begin{align*}
D_{t,w}^{(r)} &\sim \text{Poisson}\left(\hat{b}^{(r)}_{t,w} \: \phi_{t,w}^{(r)} \right) \\
\phi_{t,w}^{(r)} &= f\left(\boldsymbol{c}_{t,w}^{(r)},\: l^1\left(\boldsymbol{c}_{t,w}^{(r)}\right), \: l^2\left(\boldsymbol{c}_{t,w}^{(r)}\right), ..., \:l^s\left(\boldsymbol{c}_{t,w}^{(r)}\right)\right).
\end{align*}`
Here:
- `$D_{t,w}^{(r)}$` represents the observed deaths in region `$r$` during week `$w$` of year `$t$`
- `$\hat{b}^{(r)}_{t,w}$` denotes the estimated baseline deaths based on the Serfling model in region `$r$`, week `$w$` and year `$t$`
- `$\boldsymbol{c}_{t,w}^{(r)}$` the set of weather features
- `$l^j(\boldsymbol{c}_{t,w}^{(r)})$` denotes the set of weather features, lagged by `$j$` weeks
- `$f(\cdot)$` is a machine learning model that can possibly capture non-linear associations between target and inputs, as well as interaction effects between the input features. In this tutorial we work with an XGBoost model.

---

# Feature construction

```r
vars <- c('tn', 'tg', 'tx', 
          'hu', 'fg', 'rr')
```

]

We select the weather features of interest and save them in the variable `vars`.

]

```r
list.lagdf <- list()
for (v in vars){
  list.lagdf[[which(vars == v)]] <- Df %>% 
    group_by(Region) %>% 
    reframe(Date, 
            !!paste0(v,'_l1') := lag(!!sym(v), 1),
            !!paste0(v,'_l2') := lag(!!sym(v), 2))
}

df.lag <- 
  plyr::join_all(list.lagdf, 
                 by = c('Region', 'Date'))
```

```r
Df <- Df %>% 
  left_join(df.lag, by = c('Region', 'Date')) %>% 
  na.omit()
```

]

For each feature in `vars`, we group the data frame `Df` (containing mortality and weather data) by `Region`. We then use the `reframe` function from {dplyr} to create new columns with one-week and two-week lagged values for each weather feature by applying the `lag` function from {dplyr}. These lagged data frames are stored in the list `list.lagdf`. Lastly, we merge all the lagged data frames into a single data frame `df.lag` using the `join_all` function from {plyr}. Note the use of `!!` (injection operator) to force an early evaluation of the `paste0` object (check `help("!!")`).

We merge the lagged weather features with the main data frame `Df` based on the columns `Region` and `Date` and remove missing observations using `na.omit`.
]

---
class: clear

```r
vars.l1  <- paste0(vars, '_l1')
vars.l2  <- paste0(vars, '_l1')
vars.xgb <- c(vars, vars.l1, vars.l2)
```

]

We now create the input feature set to calibrate an XGBoost model. This set `vars.xgb` stores the original weather features and their lagged values (one-week and two-week lags).

]

```r
xgb.Df <- xgb.DMatrix(
  data  = as.matrix(Df %>% select(all_of(vars.xgb))),
  label = as.matrix(Df %>% select(Deaths))
)
```

]

We convert the data frame `Df` into an `xgb.DMatrix` object, which is required for the XGBoost model implementation in the package {xgboost}. The `data` argument includes the input features, and the `label` argument includes the response variable, which is the number of observed deaths.
]

```r
setinfo(xgb.Df, "base_margin", log(Df$bDeaths))
## [1] TRUE
```

]

We set the logarithm of the baseline number of deaths as an offset using the `setinfo` function from the {xgboost} package. As such, we try explaining deviations from the mortality baseline using the selected weather features.

]

---

# Model calibration

```r
xgbcv <- xgb.cv(
  params = list(eta = 0.01, 
                max_depth = 5, 
                base_score = 0,
                objective = 'count:poisson'),
  data = xgb.Df, 
  nfold = 10,
  early_stopping_rounds = 50,
  nrounds = 1000)
```

```r
xgbfit <- xgb.train(
  params = list(eta = 0.01, 
                max_depth = 5, 
                base_score = 0,
                objective = 'count:poisson'),
  data = xgb.Df, 
  nrounds = xgbcv$best_iteration)
```

]

Using the `xgb.cv` function from the {xgboost} package, we perform `nfold` cross-validation to tune the number of boosting iterations. To limit computational time, we fix the learning rate `eta` at 0.01 and the maximum tree depth `max_depth` at 5. The initial prediction score `base_score` is set to 0 here, cause we include the baseline number of deaths as the initial prediction (via the offset). We use Poisson regression by setting the `objective` to `'count:poisson'`. The `early_stopping_rounds` argument indicates that training will stop if performance does not improve for 50 rounds. The maximum number of boosting rounds is set to 1000 with `nrounds`.

We fit the XGBoost model on the entire training data `xgb.Df` using the optimal number of boosting rounds `xgbcv$best_iteration`.

]

---

# Results

---

# In-sample fit

]

.right-column[
**<span style="color:red">Q</span>**: We retrieve the estimated deaths from the calibrated machine learning model using the `predict` function. We add these to our mortality data set `Df` as a new column `xgbDeaths`.

```r
Df$xgbDeaths <- predict(object = xgbfit, newdata = xgb.Df)
```

You can now visualize the observed deaths (in gray) alongside the deaths estimated by the baseline model (in blue) and the machine learning model (in red pink) for the cantons of Vaud and Zurich. Complete the following {ggplot} code:

```r
# Filter on the Region of Vaud 
Df.VD <- Df %>% ...

# Plot
ggplot(...) + 
  geom_line(aes(x = ..., y = ..., col = 'Observed')) +
  geom_line(aes(x = ..., y = ..., col = 'XGBoost')) + 
  geom_line(aes(x = ..., y = ..., col = 'Baseline')) +
  scale_colour_manual(values = c('gray80', RCLRbg, red_pink),
                      breaks = c('Observed', 'Baseline', 'XGBoost'), name = '') + 
  xlab('Time') + ylab('Deaths') + theme_bw(base_size = 15) + ggtitle('Vaud')
```
]
---

# In-sample fit

]

.right-column[
**<span style="color:red">A</span>**: We extract the mortality data set restricted to the region Vaud (`CH011`) using the `filter` function from the {dplyr} package. We plot the observed deaths (`Deaths`) in gray, the deaths estimated by the baseline model (`bDeaths`) in blue, and the deaths estimated by the machine learning model (`xgbDeaths`) in red pink.

```r
# Filter on the Region of Vaud 
Df.VD <- Df %>% `dplyr::filter(Region == 'CH011')`

# Plot
ggplot(`Df.VD`) + 
  geom_line(aes(x = `Date`, y = `Deaths`, col = 'Observed')) +
  geom_line(aes(x = `Date`, y = `xgbDeaths`, col = 'XGBoost')) + 
  geom_line(aes(x = `Date`, y = `bDeaths`, col = 'Baseline')) +
  scale_colour_manual(values = c('gray80', RCLRbg, red_pink),
                      breaks = c('Observed', 'Baseline', 'XGBoost'),
                      name = '') + 
  xlab('Time') + ylab('Deaths') + theme_bw(base_size = 15) + 
  ggtitle('Vaud')
```
]
---

# In-sample fit

]

<img src="Tutorial-3_files/figure-html/unnamed-chunk-69-1.svg" style="display: block; margin: auto;" />
]
---

# In-sample fit

]

.right-column[
**<span style="color:red">A</span>**: We repeat this procedure for the canton of Zurich (CH040).

<img src="Tutorial-3_files/figure-html/unnamed-chunk-70-1.svg" style="display: block; margin: auto;" />
]

---

# In-sample fit

]

.right-column[
**<span style="color:red">A</span>**: Zoom in on the years 2002-2004 and visualize the minimum, average and maximum termperature in Vaud.

<img src="Tutorial-3_files/figure-html/unnamed-chunk-71-1.svg" style="display: block; margin: auto;" />
]
---

# In-sample fit

]

.right-column[
**<span style="color:red">A</span>**: Zoom in on the years 2001-2005: [European summer heat wave of 2003](https://idlcc.fc.ul.pt/pdf/Garcia_Herrera_heatwave_2010.pdf)

<img src="Tutorial-3_files/figure-html/unnamed-chunk-72-1.svg" style="display: block; margin: auto;" />
]
---

# Thanks!   <img src="img/xaringan.png" class="title-hex" width="50" height="50" align="right">

Slides created with the R package [xaringan](https://github.com/yihui/xaringan).
<br> <br> <br>
Course material available via 
<br>

<img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/