class: center, middle, inverse, title-slide, clear # Modelling and quantifying mortality and longevity risk ## Tutorial session 3 <html> <div style="float:left"></div> <hr align='center' color='#106EB6' size=1px width=97%> </html> ### Katrien Antonio, Torsten Kleinow, Frank van Berkum, Michel Vellekoop and Jens Robben ### <a href="https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/">35th International Summer School of the Swiss Association of Actuaries </a> | June 3-7, 2024 ### <a> </a> --- class: inverse, center, middle name: prologue # Prologue <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- name: introduction # Introduction ### Workshop <img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/ The workshop's landing page on GitFront, where we upload slide decks, R code and data sets. ### Us <img align="left" src="img/Katrien.jpg" width="160"/ hspace="0"> <img align="left" src="img/Michel.jpg" width="160" hspace="45"/> <img align="left" src="img/Torsten.jpg" width="160" hspace="0"/> <img align="left" src="img/Frank.jpg" width="160" hspace="45"/> <img align="left" src="img/Jens.png" width="160" hspace="0"/> <img align="left" src="img/Empty.png" width="100" height="160" hspace="0"/> [Katrien Antonio](https://katrienantonio.github.io) [
](mailto:katrien.antonio@kuleuven.be) [Michel Vellekoop](https://www.uva.nl/profiel/v/e/m.h.vellekoop/m.h.vellekoop.html) [
](mailto:m.h.vellekoop@uva.nl) [Torsten Kleinow](https://www.uva.nl/profiel/k/l/t.kleinow/t.kleinow.html) [
](mailto:t.kleinow@uva.nl) [Frank van Berkum](https://www.uva.nl/profiel/b/e/f.vanberkum/f.vanberkum.html) [
](mailto:F.vanBerkum@uva.nl) [Jens Robben](https://jensrobben.github.io) [
](mailto:jens.robben@kuleuven.be) --- class: inverse, center, middle name: goals # Goals of tutorial session 3 <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Goals of tutorial session 3 .pull-left[ In this tutorial session you will learn how to: * download and visualize the NUTS 3 regions in Switzerland * download and illustrate weekly mortality and weather data for Switzerland * calibrate a weekly mortality baseline model * use a machine learning model to associate deviations from the mortality baseline with weather data * construct in-sample mortality predictions. The tutorial is a simplified version of the material presented in Module 7 on the analysis of fine-grained mortality data. ] .pull-right[ Outline of the workshop: * [NUTS 3 regions in Switzerland](#nuts3) * [Download and import mortality data](#mortalitydata) * [Download and import weather data](#weatherdata) * [Weekly mortality baseline model](#mortalitybaseline) * [Modelling deviations from the mortality baseline](#mortalitydeviations) * [Results](#results) The full working paper is available on arXiv: [https://arxiv.org/abs/2405.18020](https://arxiv.org/abs/2405.18020). Corresponding R code will be made available on GitHub: [https://github.com/jensrobben/EnvVar-Mortality-Europe](https://github.com/jensrobben/EnvVar-Mortality-Europe). ] --- class: inverse, center, middle name: nuts3 # NUTS 3 regions in Switzerland <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # NUTS 3 regions in Switzerland We focus on mortality data in NUTS 3 regions in Switzerland. NUTS 3 is the most detailed level in the Nomenclature of Territorial Units for Statistics (see [here](https://en.wikipedia.org/wiki/Nomenclature_of_Territorial_Units_for_Statistics)). We read the NUTS level shapefile using the `read_sf` function from the {sf} package. This shapefile contains geospatial data for all NUTS regions in Europe. We have pre-downloaded it from [Eurostat](https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics) and stored it in our GitHub repository. ```r shapef <- read_sf('../data/shapefile/NUTS_RG_01M_2021_4326.shp') ``` We then filter the shapefile to include only NUTS 3 level regions (`LEVL_CODE == 3`) in Switzerland (`CNTR_CODE == 'CH'`) using the `filter` function from the {dplyr} package, and arrange the data by NUTS 3 code. ```r shapef <- shapef %>% dplyr::filter(LEVL_CODE == 3, CNTR_CODE == "CH") %>% arrange(NUTS_ID) ``` We extract the `NAME_LATN` attribute from the shapefile and find that the Swiss NUTS 3 regions correspond to the 26 Swiss cantons. ```r head(shapef$NAME_LATN) ## [1] "Vaud" "Valais" "Genève" "Bern" "Freiburg" "Solothurn" ``` --- class: clear The `geometry` attribute stores the geographic data for Swiss cantons. Let's examine the geometry of the first canton in the shapefile: ```r shapef[1,]$geometry ## Geometry set for 1 feature ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: 6.064003 ymin: 46.18696 xmax: 7.248677 ymax: 46.98672 ## Geodetic CRS: WGS 84 ``` - The geometry type `MULTIPOLYGON` indicates that the shapefile is made up of multiple polygons outlining the boundaries of the Swiss cantons. - The dimension `XY` indicates that the polygons are defined in two dimensions (longitude and latitude). - The bounding box (`xmin`,`ymin`, `xmax`, `ymax`) reflects the rectangular area of (long, lat) coordinates encompassing the canton. - The geodetic CRS `WGS 84` is the coordinate reference system (CRS) used for this shapefile. --- class: clear We can visualize the Swiss NUTS 3 regions (cantons) using `ggplot`. The key function for plotting geographical data is the layer `geom_sf` from the {ggplot2} package. This function uses the geometry column from our shapefile. The `fill = NUTS_ID` aesthetic fills each NUTS 3 region with a different colour. ```r ggplot(shapef) + geom_sf(aes(fill = NUTS_ID, geometry = geometry)) ``` The above code provides a basic plot. We can enhance it by adding a location icon for Lausanne and specifying a custom color palette. ```r colours <- distinctColorPalette(k = 26) load.fontawesome() location <- fontawesome(c('fa-map-marker')) ggplot(shapef) + geom_sf(aes(fill = NUTS_ID, geometry = geometry), col = 'gray70') + geom_text(aes(x = 6.633597, y = 46.519962, label = location), family = 'fontawesome-webfont', vjust = 0.05, data = shapef[1,], size = 7) + scale_fill_manual(name = 'Regions', values = colours) + theme_bw(base_size = 15) + xlab('') + ylab('') + ggtitle("NUTS 3 regions (cantons) Switzerland") ``` --- class: clear <img src="Tutorial-3_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> --- class: inverse, center, middle name: mortalitydata # Download and import mortality data <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Download and import mortality data We do a live download using the function `get_eurostat` from the {eurostat} package to retrieve Swiss death counts by week (from 2000 on), sex, 5-year age group and NUTS 3 region from Eurostat. ```r # Download weekly death counts from Eurostat (approx. 1 min) dxtw.raw <- get_eurostat(id = 'demo_r_mweek3', time_format = 'raw', cache = FALSE, filters = list(geo = shapef$NUTS_ID)) ``` - `id` refers to the unique identifier of the data set we download (see [here](https://ec.europa.eu/eurostat/databrowser/view/demo_r_mweek3/default/table?lang=en)) - `time_format = 'raw'` specifies we want to preserve the original date object - `cache = FALSE` avoids caching and speeds up the running process - `filters` limits the data to the NUTS 3 Swiss regions. If you encounter issues with the live download, we have pre-downloaded this weekly mortality data set, and made it available in the GitHub repository. You can load this R object as follows: ```r dxtw.raw <- readRDS('../data/eurostat/dxtw_CH_NUTS3.rds') ``` We set appropriate column names using the `colnames` function: ```r colnames(dxtw.raw) <- c('Freq', 'Unit', 'Sex', 'Age', 'Region', 'Time', 'Deaths') ``` --- class: clear We visualize an extract of the downloaded Eurostat data: ```r head(dxtw.raw) ``` <br> <table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Freq </th> <th style="text-align:left;"> Unit </th> <th style="text-align:left;"> Sex </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Deaths </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W01 </td> <td style="text-align:right;"> 156 </td> </tr> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W02 </td> <td style="text-align:right;"> 148 </td> </tr> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W03 </td> <td style="text-align:right;"> 120 </td> </tr> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W04 </td> <td style="text-align:right;"> 126 </td> </tr> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W05 </td> <td style="text-align:right;"> 113 </td> </tr> <tr> <td style="text-align:left;"> W </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 2000-W06 </td> <td style="text-align:right;"> 124 </td> </tr> </tbody> </table> --- class: clear .pull-left[ ```r dxtw.CH <- dxtw.raw %>% dplyr::filter(Age == 'TOTAL', Sex == 'T') %>% mutate(ISOYear = as.integer(substr(Time, 1, 4)), ISOWeek = as.integer(substr(Time, 7, 8))) ``` ] .pull-right[ We filter the mortality data set `dxtw.raw` on the total age class (`Age == 'TOTAL'`) and unisex data (`Sex == 'T'`) using the `filter` function from the {dplyr} package. Using the function `mutate`, we create two new columns, `ISOYear` and `ISOWeek`. Hereto we extract the year (digits 1-4) and week (digits 7-8) from the `Time` column in `dxtw.raw` using the function `substr`. ] .pull-left[ ```r dxtw.CH <- dxtw.CH %>% dplyr::filter(ISOYear <= 2019) %>% select(c('ISOYear', 'ISOWeek', 'Region', 'Deaths')) ``` ] .pull-right[ To exclude the COVID-19 pandemic, we filter the data up to ISO-year 2019. We retain only the relevant columns: `ISO-year`, `ISO-week`, `Region`, and `Deaths` using the `select` function from {dplyr}. ] .pull-left[ ```r # Add date wdate <- ISOweek2date(sprintf("%d-W%02d-%d", dxtw.CH$ISOYear, dxtw.CH$ISOWeek, 1)) dxtw.CH <- dxtw.CH %>% mutate('Date' = wdate) ``` ] .pull-right[ Next, we use the `ISOweek2date` function from the {ISOweek} package to create a new column `Date`, representing the start date of each ISO-week. This concludes the construction of a data set with weekly deaths for Switzerland, stored in `dxtw.CH`. On to the exposures! ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Visualize the evolution of the weekly death counts in the NUTS 3 region that includes Lausanne. <br> <br> ```r # Select region encompassing Laussanne region <- ... # Filter the data for the specified region dxtw.VD <- ... %>% filter(...) # Create the plot ggplot(...) + geom_line(aes(x = ..., y = ...), colour = RCLRbg) + xlab('Date') + ylab('Death counts') + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Lausanne is the capital city of the canton Vaud. The NUTS 3 code for the canton Vaud is CH011, which can be found in the `shapef` object. ```r # Select region encompassing Laussanne region <- `'CH011'` # Filter the data for the specified region dxtw.VD <- `dxtw.CH` %>% filter(`Region == region`) # Create the plot ggplot(`dxtw.VD`) + geom_line(aes(x = `Date`, y = `Deaths`), colour = RCLRbg) + xlab('Date') + ylab('Death counts') + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Lausanne is the capital city of the canton Vaud. The NUTS 3 code for the canton Vaud is CH011, which can be found in the `shapef` object. <img src="Tutorial-3_files/figure-html/unnamed-chunk-19-1.svg" style="display: block; margin: auto;" /> ] --- # Exposures .pull-left[ ```r Pxt.raw <- get_eurostat(id = 'demo_r_pjanaggr3', time_format = 'raw', cache = FALSE, filters = list(geo = shapef$NUTS_ID)) colnames(Pxt.raw) <- c('Freq', 'Unit', 'Sex', 'Age', 'Region', 'Time', 'Pop') ``` ] .pull-right[ We perform a live download (again) using the `get_eurostat` function to retrieve population counts on January 1st, categorized by age group, sex, and NUTS 3 region. The function's parameters are similar to those used for downloading weekly death counts, with the main difference being the identifier: `demo_r_pjanaggr3` (see [here](https://ec.europa.eu/eurostat/databrowser/product/view/demo_r_pjanaggr3?lang=en)). We specify the column names using the function `colnames`. ] .pull-left[ ```r Pxt.raw <- readRDS('../data/eurostat/Pxt_CH_NUTS3.rds') ``` ] .pull-right[ If you encounter issues with the live download, we have pre-downloaded the population count data set. You can load this R object using the `readRDS` function. ] .pull-left[ ```r unique(Pxt.raw$Age) ## [1] "TOTAL" "Y_LT15" "Y15-64" "Y_GE65" "UNK" unique(Pxt.raw$Sex) ## [1] "T" "M" "F" ``` ] .pull-right[ We check the levels of the `Age` and `Sex` covariates in the data set. ] --- class: clear We display a sample of the data set `Pxt.raw`, which contains the population counts for Swiss regions and specific age groups: ```r head(Pxt.raw) ``` <br> <table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Freq </th> <th style="text-align:left;"> Unit </th> <th style="text-align:left;"> Sex </th> <th style="text-align:left;"> Age </th> <th style="text-align:left;"> Region </th> <th style="text-align:left;"> Time </th> <th style="text-align:right;"> Pop </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1990 </td> <td style="text-align:right;"> 571973 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1991 </td> <td style="text-align:right;"> 577524 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1992 </td> <td style="text-align:right;"> 587283 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:right;"> 593007 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:right;"> 596736 </td> </tr> <tr> <td style="text-align:left;"> A </td> <td style="text-align:left;"> NR </td> <td style="text-align:left;"> T </td> <td style="text-align:left;"> TOTAL </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:left;"> 1995 </td> <td style="text-align:right;"> 602099 </td> </tr> </tbody> </table> --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Using a map of Switzerland, display the total population count in the year 2019, across the Swiss cantons. Consider both genders and all age groups. Identify the three cantons with the highest population. ```r # Filter data for the year 2019, total age group, unisex data pop.CH.2019 <- ... %>% dplyr::filter(Time == ..., Age == ..., Sex == ...) # Join population data to shapefile (consult ?left_join) shapef.pop <- shapef %>% left_join(..., by = c(... = ...)) # Plot ggplot(...) + geom_sf(aes(fill = ..., geometry = ...), col = 'gray70') + scale_fill_viridis_c(option = 'rocket', direction = -1) + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Zurich, Bern, and Vaud are the three most populated Swiss cantons, in that order. Below are the {ggplot} instructions used to visualize the data. We use the `left_join` function from the {dplyr} package to merge the population data set `pop.CH.2019` with the shapefile `shapef` by region. Since variable names differ in both data sets, we specify `by = c('NUTS_ID' = 'Region')` to indicate their correspondence. ```r # Filter data for the year 2019, total age group, unisex data pop.CH.2019 <- `Pxt.raw` %>% dplyr::filter(Time == `2019`, Age == `'TOTAL'`, Sex == `'T'`) # Join population data to shapefile (consult ?left_join) shapef.pop <- shapef %>% left_join(`pop.CH.2019`, by = c(`'NUTS_ID'` = `'Region'`)) # Plot ggplot(`shapef.pop`) + geom_sf(aes(fill = `Pop`, geometry = `geometry`), col = 'gray70') + scale_fill_viridis_c(option = 'rocket', direction = -1) + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Zurich, Bern, and Vaud are the three most populated Swiss cantons, in that order. Below are the {ggplot} instructions used to visualize the data. We use the `left_join` function from the {dplyr} package to merge the population data set `pop.CH.2019` with the shapefile `shapef` by region. Since variable names differ in both data sets, we specify `by = c('NUTS_ID' = 'Region')` to indicate their correspondence. <img src="Tutorial-3_files/figure-html/unnamed-chunk-27-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .pull-left[ ```r Pxt.CH <- Pxt.raw %>% dplyr::filter(Age == 'TOTAL', Sex == 'T') %>% mutate(Time = as.integer( as.character(Time))) ``` ] .pull-right[ We extract data from the `Pxt.raw` data set related to the total age group (`Age == 'TOTAL'`) and unisex population (`Sex == 'T'`) using the `filter` function from {dplyr}. Additionally, we convert the `Time` column from its original factor to an integer type. ] .pull-left[ ```r Extw.CH <- Pxt.CH %>% group_by(Region) %>% reframe(Time, 'Expo' = c((Pop[-length(Time)] + Pop[-1])/2, Pop[length(Time)])/52.18) %>% dplyr::filter(Time <= 2019, Time >= 2000) ``` ] .pull-right[ We define the annual exposure in each year `\(t\)`, region `\(r\)` as a mid-year population estimate and assume a constant weekly exposure within that year: `\begin{align*} E_{t}^{(r)} &= \dfrac{P_t^{(r)} + P_{t+1}^{(r)}}{2} \\ E_{t,w}^{(r)} &= \dfrac{E_t^{(r)}}{52.18}, \end{align*}` with `\(52.18\)` the average number of weeks in a year. ] .pull-left[ ```r Df <- dxtw.CH %>% left_join(Extw.CH, by = c('ISOYear' = 'Time', 'Region' = 'Region')) ``` ] .pull-right[ We merge the death counts `dxtw.CH` with the exposure estimates `Extw.CH` based on the year and NUTS 3 region using the `left_join` function from {dplyr}. ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Calculate the weekly mortality rates `\(q_{t,w}^{(r)}\)` using the exposure estimates for the canton Zurich. Visualize these as a function of time. <br> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We first filter the mortality data set `Df` (with regional, weekly death counts and exposures) to focus only on data from the canton of Zurich (CH040). Then, we compute the mortality rates `\(q_{t,w}^{(r)}\)` as `\(1 - \exp(-d_{t,w}^{(r)}/E_{t,w}^{(r)})\)`. Lastly, we visualize the results using {ggplot}. ```r # Filter on canton Zurich (CH040) Df.ZH <- Df %>% dplyr::filter(Region == 'CH040') # Construct mortality rates Df.ZH$qtw <- 1 - exp(-Df.ZH$Deaths/Df.ZH$Expo) # Plot ggplot(Df.ZH) + geom_line(aes(x = Date, y = qtw), col = RCLRbg) + ylab(bquote(q['t,w'])) + ggtitle('Zurich') + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We first filter the mortality data set `Df` (with regional, weekly death counts and exposures) to focus only on data from the canton of Zurich (CH040). Then, we compute the mortality rates `\(q_{t,w}^{(r)}\)` as `\(1 - \exp(-d_{t,w}^{(r)}/E_{t,w}^{(r)})\)`. Lastly, we visualize the results using {ggplot}. <img src="Tutorial-3_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle name: weatherdata # Download and import weather data <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Download and import weather data In the R script [eobs_data.R](https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/raw/scripts/eobs_data.R), we do a live download from the Copernicus Climate Data Store to retrieve daily weather factors throughout the years 2000-2019 (WARNING! - high computational time). These weather factors include the daily levels of the maximum temperature (`tx`), average temperature (`tg`), minimum temperature (`tn`), relative humidity (`hu`), total precipitation amount (`rr`), and average wind speed (`fg`). The resulting data sets are stored for you in the GitHub repository. Let's read them into our environment: ```r tn_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tn_daily.rds') tg_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tg_daily.rds') tx_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_tx_daily.rds') fg_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_fg_daily.rds') hu_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_hu_daily.rds') rr_NUTS3_daily <- readRDS('../data/eobs/CH_NUTS3_rr_daily.rds') ``` We examine the structure of one of these datasets: ```r head(tx_NUTS3_daily) ``` --- class: clear <br> <br> <table class="table" style="font-size: 25px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Date </th> <th style="text-align:right;"> ISOYear </th> <th style="text-align:right;"> ISOWeek </th> <th style="text-align:left;"> Region </th> <th style="text-align:right;"> tx </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 2000-01-01 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 1.865263 </td> </tr> <tr> <td style="text-align:left;"> 2000-01-02 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 52 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 3.497631 </td> </tr> <tr> <td style="text-align:left;"> 2000-01-03 </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 3.433158 </td> </tr> <tr> <td style="text-align:left;"> 2000-01-04 </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 4.652895 </td> </tr> <tr> <td style="text-align:left;"> 2000-01-05 </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 5.644210 </td> </tr> <tr> <td style="text-align:left;"> 2000-01-06 </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> CH011 </td> <td style="text-align:right;"> 4.355263 </td> </tr> </tbody> </table> --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Illustrate the evolution of the weather factors in the NUTS 3 region of Lausanne (Vaud) as a function of time using {ggplot}. ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We provide the {ggplot} instructions to visualize the evlolution of the minimum temperature in the NUTS 3 region of Lausanne (CH011). We start by selecting the relevant region (CH011). Then, we filter the daily maximum temperature data set `tn_NUTS3_daily` for this region. Finally, we create the plot. ```r region <- 'CH011' # canton Vaud tn_VD_daily <- tn_NUTS3_daily %>% dplyr::filter(Region == region) ggplot(tn_VD_daily) + geom_line(aes(x = Date, y = tn), colour = RCLRbg) + xlab('Date') + ylab('Minimum temperature') + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Minimum temperature. <img src="Tutorial-3_files/figure-html/unnamed-chunk-37-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Average temperature. <img src="Tutorial-3_files/figure-html/unnamed-chunk-38-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Maximum temperature. <img src="Tutorial-3_files/figure-html/unnamed-chunk-39-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Wind speed. <img src="Tutorial-3_files/figure-html/unnamed-chunk-40-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Humidity. <img src="Tutorial-3_files/figure-html/unnamed-chunk-41-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Rain fall. <img src="Tutorial-3_files/figure-html/unnamed-chunk-42-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Visualize the region-specific maximum temperature on the map of Switzerland on August 1, 2003 and the rain fall on January 1, 2000. ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: To visualize the maximum temperature data on August 1, 2003, on the map of Switzerland, we first filter the data set `tx_NUTS3_daily` for the specified date. Then, we join this data with the shape file `shapef` based on the region. Lastly, we plot the results using `geom_sf` from {ggplot} and apply the viridis color palette. ```r tx.shp <- tx_NUTS3_daily %>% dplyr::filter(Date == '2003-08-01') %>% left_join(shapef, by = c('Region' = 'NUTS_ID')) ggplot(tx.shp) + geom_sf(aes(geometry = geometry, fill = tx)) + scale_fill_viridis_c(option = 'rocket', direction = -1) + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: To visualize the maximum temperature data on August 1, 2003, on the map of Switzerland, we first filter the data set `tx_NUTS3_daily` for the specified date. Then, we join this data with the shape file `shapef` based on the region. Lastly, we plot the results using `geom_sf` from {ggplot} and apply the viridis color palette. <img src="Tutorial-3_files/figure-html/unnamed-chunk-44-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We repeat the procedure for the rain fall on January 1, 2010, but use a different colour palette. <img src="Tutorial-3_files/figure-html/unnamed-chunk-45-1.svg" style="display: block; margin: auto;" /> ] --- # From daily to weekly weather data .pull-left[ ```r Df.weather <- plyr::join_all(list(tn_NUTS3_daily, tg_NUTS3_daily, tx_NUTS3_daily, fg_NUTS3_daily, hu_NUTS3_daily, rr_NUTS3_daily), by = c('Date', 'ISOYear', 'ISOWeek', 'Region')) ``` ] .pull-right[ We merge all the daily weather data sets (`tn_NUTS3_daily`, `tg_NUTS3_daily`, `tx_NUTS3_daily`, `fg_NUTS3_daily`, `hu_NUTS3_daily`, `rr_NUTS3_daily`) into one large data frame `Df.weather` using the `join_all` function from the {plyr} package. The merging is done based on common columns: `Date`, `ISOYear`, `ISOWeek`, and `Region`. ] .pull-left[ ```r Df.weather <- Df.weather %>% group_by(ISOYear, ISOWeek, Region) %>% reframe('tn' = mean(tn, na.rm = TRUE), 'tg' = mean(tg, na.rm = TRUE), 'tx' = mean(tx, na.rm = TRUE), 'hu' = mean(hu, na.rm = TRUE), 'fg' = mean(fg, na.rm = TRUE), 'rr' = mean(rr, na.rm = TRUE)) ``` ] .pull-right[ To align with the weekly time scale of the mortality statistics, we convert the daily weather data in `Df.weather` to weekly data. This is done by grouping the data by `ISOYear`, `ISOWeek`, and `Region`, and then calculating the weekly average for each weather variable (`tn`, `tg`, `tx`, `hu`, `fg`, `rr`). We exclude any missing daily values in the averaging process using the argument `na.rm = TRUE`. ] .pull-left[ ```r Df <- Df %>% left_join(Df.weather) ``` ] .pull-right[ We join the mortality data set `Df` with the weather data set `Df.weather` based on common columns (default). ] --- class: inverse, center, middle name: mortalitybaseline # Weekly mortality baseline model <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Model specifications To model the weekly mortality baseline trend, we account for the seasonality observed in the region-specific weekly death counts by means of sine-cosine Fourier terms, defined as: $$ \left[ \sin\left(2\pi \dfrac{n}{T} x\right), \cos\left(2\pi \dfrac{n}{T}x\right)\right]_{n\in\mathbb{N}},$$ with `\(n\)` the frequency or number of cycles and `\(T\)` the period of the sinusoidal terms. We visualize the Fourier terms with a frequency of `\(n=1\)` and a period of 52.18 (annual cycle). <br> <img src="Tutorial-3_files/figure-html/unnamed-chunk-50-1.svg" style="display: block; margin: auto;" /> --- class: clear We use the Serfling model specification to construct a baseline mortality model for the weekly death counts `\(D_{t,w}^{(r)}\)` in a specific region `\(r\)`. This model assumes that the number of deaths random variable follows a Poisson distribution, with as mean the weekly exposure `\(E_{t,w}^{(r)}\)` times the weekly force of mortality `\(\mu_{t,w}^{(r)}\)`: `\begin{align*} D_{t,w}^{(r)} &\sim \text{Poisson}\left(E_{t,w}^{(r)}\: \mu_{t,w}^{(r)}\right). \end{align*}` Hereby: `\begin{align*} \log \mu_{t,w}^{(r)} &= \: \beta_0^{(r)} + \beta_1^{(r)} t + \beta_2^{(r)} \sin\left(\frac{2\pi w}{52.18}\right) + \beta_3^{(r)} \cos\left(\frac{2\pi w}{52.18}\right) + \:\:\:\beta_4^{(r)} \sin\left(\frac{2\pi w}{26.09}\right) + \beta_5^{(r)} \cos\left(\frac{2\pi w}{26.09}\right), \end{align*}` which includes an intercept, an annual time trend, and Fourier-terms to capture both annual and semi-annual frequencies. You can determine the number of seasonal terms in each region using AIC or BIC (not covered in this tutorial). We compute these Fourier terms and add them to our mortality dataset: ```r Df <- Df %>% mutate('fsin52' = sin(2*pi*ISOWeek/52.1775), 'fcos52' = cos(2*pi*ISOWeek/52.1775), 'fsin26' = sin(4*pi*ISOWeek/52.1775), 'fcos26' = cos(4*pi*ISOWeek/52.1775)) ``` --- # Model calibration .pull-left[ ```r formula <- Deaths ~ ISOYear + fsin52 + fcos52 + fsin26 + fcos26 ``` ] .pull-right[ We construct the formula specification with the response variable `Deaths` on the one hand and the covariates `ISOYear` and the Fourier terms `fsin52`, `fcos52`, `fsin26`, and `fcos26` on the other hand. ] .pull-left[ ```r for(r in shapef$NUTS_ID){ Dfr <- Df %>% dplyr::filter(Region == r) fit.r <- glm(formula, data = Dfr, offset = log(Dfr$Expo), family = poisson(link = 'log')) Df[which(Df$Region == r),'bDeaths'] <- fit.r$fitted.values } ``` ] .pull-right[ For each Swiss canton `\(r\)`, we filter the mortality data set to include only data for that region, saving it as `Dfr`. We then fit a Generalized Linear Model (GLM) on `Dfr` with as `offset` the logarithm of the exposure variable, i.e., `log(Dfr$Expo)`. We specify the Poisson distribution with a log-link using the `family` argument. Finally, we add the estimated deaths to the mortality data set `Df` in a new column `bDeaths`. ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: Calculate and illustrate the estimated baseline and observed weekly mortality rates for the cantons of Vaud and Sankt Gallen. ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We start with the canton of Vaud (CH011). First, we filter the mortality data set `Df` on data for Vaud. Then we compute the observed and estimated weekly mortality rates and plot the results. ```r # Filter on canton Vaud (CH011) Df.VD <- Df %>% dplyr::filter(Region == 'CH011') # Observed and estimated mortality rates Df.VD$qtw.obs <- 1 - exp(-Df.VD$Deaths/Df.VD$Expo) Df.VD$qtw.est <- 1 - exp(-Df.VD$bDeaths/Df.VD$Expo) # Plot ggplot(Df.VD) + geom_line(aes(x = Date, y = qtw.obs), col = 'gray80') + geom_line(aes(x = Date, y = qtw.est), col = RCLRbg, linewidth = 0.8) + ylab(bquote(q['t,w'])) + ggtitle('Vaud') + theme_bw(base_size = 15) ``` ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We start with the canton of Vaud (CH011). First, we filter the mortality data set `Df` on data for the Region of Vaud. Then we compute the observed and estimated weekly mortality rates and plot the results. <img src="Tutorial-3_files/figure-html/unnamed-chunk-55-1.svg" style="display: block; margin: auto;" /> ] --- class: clear .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We repeat the procedure for Sankt Gallen (CH055). <img src="Tutorial-3_files/figure-html/unnamed-chunk-56-1.svg" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle name: mortalitydeviations # Modelling deviations from the mortality baseline <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # Model specifications We associate deviations from the weekly mortality baseline model with region-specific weather features. We work with the following model assumptions: `\begin{align*} D_{t,w}^{(r)} &\sim \text{Poisson}\left(\hat{b}^{(r)}_{t,w} \: \phi_{t,w}^{(r)} \right) \\ \phi_{t,w}^{(r)} &= f\left(\boldsymbol{c}_{t,w}^{(r)},\: l^1\left(\boldsymbol{c}_{t,w}^{(r)}\right), \: l^2\left(\boldsymbol{c}_{t,w}^{(r)}\right), ..., \:l^s\left(\boldsymbol{c}_{t,w}^{(r)}\right)\right). \end{align*}` Here: - `\(D_{t,w}^{(r)}\)` represents the observed deaths in region `\(r\)` during week `\(w\)` of year `\(t\)` - `\(\hat{b}^{(r)}_{t,w}\)` denotes the estimated baseline deaths based on the Serfling model in region `\(r\)`, week `\(w\)` and year `\(t\)` - `\(\boldsymbol{c}_{t,w}^{(r)}\)` the set of weather features - `\(l^j(\boldsymbol{c}_{t,w}^{(r)})\)` denotes the set of weather features, lagged by `\(j\)` weeks - `\(f(\cdot)\)` is a machine learning model that can possibly capture non-linear associations between target and inputs, as well as interaction effects between the input features. In this tutorial we work with an XGBoost model. --- # Feature construction .pull-left[ ```r vars <- c('tn', 'tg', 'tx', 'hu', 'fg', 'rr') ``` ] .pull-right[ We select the weather features of interest and save them in the variable `vars`. ] .pull-left[ ```r list.lagdf <- list() for (v in vars){ list.lagdf[[which(vars == v)]] <- Df %>% group_by(Region) %>% reframe(Date, !!paste0(v,'_l1') := lag(!!sym(v), 1), !!paste0(v,'_l2') := lag(!!sym(v), 2)) } df.lag <- plyr::join_all(list.lagdf, by = c('Region', 'Date')) ``` ```r Df <- Df %>% left_join(df.lag, by = c('Region', 'Date')) %>% na.omit() ``` ] .pull-right[ For each feature in `vars`, we group the data frame `Df` (containing mortality and weather data) by `Region`. We then use the `reframe` function from {dplyr} to create new columns with one-week and two-week lagged values for each weather feature by applying the `lag` function from {dplyr}. These lagged data frames are stored in the list `list.lagdf`. Lastly, we merge all the lagged data frames into a single data frame `df.lag` using the `join_all` function from {plyr}. Note the use of `!!` (injection operator) to force an early evaluation of the `paste0` object (check `help("!!")`). We merge the lagged weather features with the main data frame `Df` based on the columns `Region` and `Date` and remove missing observations using `na.omit`. ] --- class: clear .pull-left[ ```r vars.l1 <- paste0(vars, '_l1') vars.l2 <- paste0(vars, '_l1') vars.xgb <- c(vars, vars.l1, vars.l2) ``` ] .pull-right[ We now create the input feature set to calibrate an XGBoost model. This set `vars.xgb` stores the original weather features and their lagged values (one-week and two-week lags). ] .pull-left[ ```r xgb.Df <- xgb.DMatrix( data = as.matrix(Df %>% select(all_of(vars.xgb))), label = as.matrix(Df %>% select(Deaths)) ) ``` ] .pull-right[ We convert the data frame `Df` into an `xgb.DMatrix` object, which is required for the XGBoost model implementation in the package {xgboost}. The `data` argument includes the input features, and the `label` argument includes the response variable, which is the number of observed deaths. ] .pull-left[ ```r setinfo(xgb.Df, "base_margin", log(Df$bDeaths)) ## [1] TRUE ``` ] .pull-right[ We set the logarithm of the baseline number of deaths as an offset using the `setinfo` function from the {xgboost} package. As such, we try explaining deviations from the mortality baseline using the selected weather features. ] --- # Model calibration .pull-left[ ```r xgbcv <- xgb.cv( params = list(eta = 0.01, max_depth = 5, base_score = 0, objective = 'count:poisson'), data = xgb.Df, nfold = 10, early_stopping_rounds = 50, nrounds = 1000) ``` ```r xgbfit <- xgb.train( params = list(eta = 0.01, max_depth = 5, base_score = 0, objective = 'count:poisson'), data = xgb.Df, nrounds = xgbcv$best_iteration) ``` ] .pull-right[ Using the `xgb.cv` function from the {xgboost} package, we perform `nfold` cross-validation to tune the number of boosting iterations. To limit computational time, we fix the learning rate `eta` at 0.01 and the maximum tree depth `max_depth` at 5. The initial prediction score `base_score` is set to 0 here, cause we include the baseline number of deaths as the initial prediction (via the offset). We use Poisson regression by setting the `objective` to `'count:poisson'`. The `early_stopping_rounds` argument indicates that training will stop if performance does not improve for 50 rounds. The maximum number of boosting rounds is set to 1000 with `nrounds`. We fit the XGBoost model on the entire training data `xgb.Df` using the optimal number of boosting rounds `xgbcv$best_iteration`. ] --- class: inverse, center, middle name: results # Results <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">Q</span>**: We retrieve the estimated deaths from the calibrated machine learning model using the `predict` function. We add these to our mortality data set `Df` as a new column `xgbDeaths`. ```r Df$xgbDeaths <- predict(object = xgbfit, newdata = xgb.Df) ``` You can now visualize the observed deaths (in gray) alongside the deaths estimated by the baseline model (in blue) and the machine learning model (in red pink) for the cantons of Vaud and Zurich. Complete the following {ggplot} code: ```r # Filter on the Region of Vaud Df.VD <- Df %>% ... # Plot ggplot(...) + geom_line(aes(x = ..., y = ..., col = 'Observed')) + geom_line(aes(x = ..., y = ..., col = 'XGBoost')) + geom_line(aes(x = ..., y = ..., col = 'Baseline')) + scale_colour_manual(values = c('gray80', RCLRbg, red_pink), breaks = c('Observed', 'Baseline', 'XGBoost'), name = '') + xlab('Time') + ylab('Deaths') + theme_bw(base_size = 15) + ggtitle('Vaud') ``` ] --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We extract the mortality data set restricted to the region Vaud (`CH011`) using the `filter` function from the {dplyr} package. We plot the observed deaths (`Deaths`) in gray, the deaths estimated by the baseline model (`bDeaths`) in blue, and the deaths estimated by the machine learning model (`xgbDeaths`) in red pink. ```r # Filter on the Region of Vaud Df.VD <- Df %>% `dplyr::filter(Region == 'CH011')` # Plot ggplot(`Df.VD`) + geom_line(aes(x = `Date`, y = `Deaths`, col = 'Observed')) + geom_line(aes(x = `Date`, y = `xgbDeaths`, col = 'XGBoost')) + geom_line(aes(x = `Date`, y = `bDeaths`, col = 'Baseline')) + scale_colour_manual(values = c('gray80', RCLRbg, red_pink), breaks = c('Observed', 'Baseline', 'XGBoost'), name = '') + xlab('Time') + ylab('Deaths') + theme_bw(base_size = 15) + ggtitle('Vaud') ``` ] --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: <img src="Tutorial-3_files/figure-html/unnamed-chunk-69-1.svg" style="display: block; margin: auto;" /> ] --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: We repeat this procedure for the canton of Zurich (CH040). <img src="Tutorial-3_files/figure-html/unnamed-chunk-70-1.svg" style="display: block; margin: auto;" /> ] --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Zoom in on the years 2002-2004 and visualize the minimum, average and maximum termperature in Vaud. <img src="Tutorial-3_files/figure-html/unnamed-chunk-71-1.svg" style="display: block; margin: auto;" /> ] --- # In-sample fit .left-column[ <br>
## Your turn ] .right-column[ **<span style="color:red">A</span>**: Zoom in on the years 2001-2005: [European summer heat wave of 2003](https://idlcc.fc.ul.pt/pdf/Garcia_Herrera_heatwave_2010.pdf) <img src="Tutorial-3_files/figure-html/unnamed-chunk-72-1.svg" style="display: block; margin: auto;" /> ] --- # Thanks! <img src="img/xaringan.png" class="title-hex" width="50" height="50" align="right"> <br> <br> <br> <br> Slides created with the R package [xaringan](https://github.com/yihui/xaringan). <br> <br> <br> Course material available via <br> <img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/ <style type="text/css"> .pull-right ~ * { clear: unset; } .pull-right + * { clear: both; } </style> <style type="text/css"> .inverse .remark-slide-number { display: none; } </style>