class: center, middle, inverse, title-slide, clear # Modelling and quantifying mortality and longevity risk ## Prework <html> <div style="float:left"></div> <hr align='center' color='#106EB6' size=1px width=97%> </html> ### Katrien Antonio, Torsten Kleinow, Frank van Berkum, Michel Vellekoop and Jens Robben ### <a href="https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/">35th International Summer School of the Swiss Association of Actuaries </a> | June 3-7, 2024 ### <a> </a> --- name: introduction # Introduction ### Workshop <img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/ The workshop's landing page on GitFront, where we upload slide decks, R code and data sets. ### Us <img align="left" src="img/Katrien.jpg" width="160"/ hspace="0"> <img align="left" src="img/Michel.jpg" width="160" hspace="45"/> <img align="left" src="img/Torsten.jpg" width="160" hspace="0"/> <img align="left" src="img/Frank.jpg" width="160" hspace="45"/> <img align="left" src="img/Jens.png" width="160" hspace="0"/> <img align="left" src="img/Empty.png" width="100" height="160" hspace="0"/> [Katrien Antonio](https://katrienantonio.github.io) [
](mailto:katrien.antonio@kuleuven.be) [Michel Vellekoop](https://www.uva.nl/profiel/v/e/m.h.vellekoop/m.h.vellekoop.html) [
](mailto:m.h.vellekoop@uva.nl) [Torsten Kleinow](https://www.uva.nl/profiel/k/l/t.kleinow/t.kleinow.html) [
](mailto:t.kleinow@uva.nl) [Frank van Berkum](https://www.uva.nl/profiel/b/e/f.vanberkum/f.vanberkum.html) [
](mailto:F.vanBerkum@uva.nl) [Jens Robben](https://jensrobben.github.io) [
](mailto:jens.robben@kuleuven.be) --- name: checklist # Checklist
Do you have a fairly recent version of R on your machine (see [here](https://cran.r-project.org/bin/windows/base/))? ```r version$version.string ```
Do you have a fairly recent version of the RStudio IDE on your machine (see [here](https://posit.co/downloads/))? ```r RStudio.Version()$version ## Requires an interactive session but should return something like "[1] ‘1.2.5001’" ```
Do you have the R packages listed in the software requirements installed on your machine?
(Alternatively) Do you have an account on [posit cloud](https://login.posit.cloud/) and access to the workspace dedicated to the Summer School, if you prefer to run the tutorial instructions in the cloud? --- class: inverse, center, middle name: universe # What's out there - the R universe <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # What is R? > <font size="+2"> <p align="justify">The R environment is an integrated suite of software facilities for data manipulation, calculation and graphical display.</p></font> </br> A brief history: - R is a dialect of the S language. - R was written by .hi-RCLRbg[R]obert Gentleman and .hi-RCLRbg[R]oss Ihaka in 1992. - The R source code was first released in 1995. - In 1998, the Comprehensive R Archive Network [CRAN](http://CRAN.R-project.org/) was established. - The first official release, R version 1.0.0, dates to 2000-02-29. Currently R 4.4.0 (April, 2024). - R is open source via the [GNU General Public License](https://en.wikipedia.org/wiki/GNU_General_Public_License). --- # Explore the R architecture - R is like a car's engine (source: [https://moderndive.com/1-getting-started.html](https://moderndive.com/1-getting-started.html)) - RStudio is like a car's dashboard, an integrated development environment (IDE) for R. R: Engine | RStudio: Dashboard :-------------------------:|:-------------------------: <img src="img/engine.jpg" alt="Drawing" style="height: 300px;"/> | <img src="img/dashboard.jpg" alt="Drawing" style="height: 300px;"/> --- # How do I code in R? Keep in mind: - unlike other software like Excel, STATA, or SAS, R is an interpreted language - no point and click in R! - .hi-RCLRbg[you have to program in R]! R .hi-RCLRbg[packages] extend the functionality of R by providing additional functions, and can be downloaded for free from the internet. R: A new phone | R Packages: Apps you can download :-------------------------:|:-------------------------: <img src="img/iphone.jpg" alt="Drawing" style="height: 150px;"/> | <img src="img/apps.jpg" alt="Drawing" style="height: 150px;"/> --- # How to install and load an R package? .pull-left[ Install the {ggplot2} package for data visualisation ```r install.packages("ggplot2") ``` Load the installed package ```r library(ggplot2) ``` And give it a try ```r head(diamonds) ggplot(diamonds, aes(clarity, fill = cut)) + geom_bar() + theme_bw() ``` Packages are developed and maintained by R users worldwide. They are shared with the R community through CRAN: now 20 770 packages online (on May 25, 2024)! ] .pull-right[ <img src="Prework_files/figure-html/try_ggplot_plot-1.svg" width="90%" style="display: block; margin: auto;" /> ] --- # Why R and RStudio? ### Data science positivism - Next to Python, R has become the *de facto* language for data science, with an extensive *machine learning toolbox*. - See: [The Popularity of Data Science Software](http://r4stats.com/articles/popularity/) - R is open-source with a very active community of users spanning academia and industry. ### Bridge to actuarial science, econometrics and other tools - R has all of the statistics and econometrics support, and is adaptable as a “glue” language to other programming languages and APIs. - Widely used in actuarial undergraduate programs. - Advantages during the Summer School: allows to use a variety of visualization tools, data handling functions, statistical and machine learning functionalities, and a toolbox for simulations. ### Disclaimer + Read more - It's also the language that we know best. - If you want to read more: [R-vs-Python](https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best-for-language-for-data-science/), [when to use Python or R](https://www.datacamp.com/community/blog/when-to-use-python-or-r) or [Hadley Wickham on the future of R](https://qz.com/1661487/hadley-wickham-on-the-future-of-r-python-and-the-tidyverse/) --- class: clear, center, middle background-image: url("img/tidyverse2.1.png") background-size: cover background-size: 65% background-position: center --- # Welcome to the tidyverse! ><p align="justify">The .hi-RCLRbg[tidyverse] is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. </p> <center> <img src="img/tidyverse_wide.png" width="750"/> </center> More on: [tidyverse](https://www.tidyverse.org). Install the packages with `install.packages("tidyverse")`. Then run `library(tidyverse)` to load the core tidyverse. --- # Principles of tidy data Three interrelated rules from the [R for data science](https://r4ds.had.co.nz/) book by Garrett Grolemund and Hadley Wickham: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. <center> <img src="img/tidy_data.png" width="750"/> </center> .footnote[This figure is taken from Chapter 12 on Tidy data in [R for data science](https://r4ds.had.co.nz/).] --- # Workflow of a data scientist Here is a model of the .hi-pink[tools needed in a typical data science project]: > <p align="justify"> Together, tidying and transforming are called <b>wrangling</b>, because getting your data in a form that’s natural to work with often feels like a fight! </p> > <p align="justify"> Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Models are a fundamentally <b>mathematical or computational tool</b>, so they generally scale well. But every model makes <b>assumptions</b>, and by its very nature a model cannot question its own assumptions. That means <b>a model cannot fundamentally surprise you</b>.</p> <center> <img src="img/data_science_pipeline.png" width="600"/> </center> .footnote[Figure and quote taken from Chapter 1 in [R for data science](https://r4ds.had.co.nz/).] --- class: inverse, center, middle name: wrangling # Data wrangling and visualisation <html><div style='float:left'></div><hr color='#FAFAFA' size=1px width=796px></html> --- # A tibble instead of a data.frame <img src="img/tibble.png" class="title-hex"> Within the tidyverse `tibble` is a modern take on a `data.frame`: - keep the features that have stood the test of time - drop the features that used to be convenient but are now frustrating. You can use: - `tibble()` to create a new tibble - `as_tibble()` transforms an object (e.g., a data frame) into a tibble. Quick example: explore the differences! ```r mtcars # install.packages("tidyverse") library(tidyverse) as_tibble(mtcars) ``` --- # Chains with the pipe operator <img src="img/pipe.png" class="title-hex"> In R, the pipe operator is `%>%`. It takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”; with this operator it becomes easy to chain a sequence of calculations. For example, when you have an input data and want to call functions `foo` and `bar` in sequence, you can write `data %>% foo %>% bar`. A first example: - take the `diamonds` data (from the {ggplot2} package) - then subset ```r diamonds %>% filter(cut == "Ideal") ``` Some excellent blog posts about this operator: [Pipes in R tutorial for beginners](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) and [how to write this in base R](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec). --- # Data manipulation verbs <img src="img/dplyr.png" class="title-hex"> The {dplyr} package holds many useful data manipulation verbs: - `mutate()` adds new variables that are functions of existing variables - `select()` picks variables based on their names - `filter()` picks cases based on their values - `summarise()` reduces multiple values down to a single summary - `arrange()` changes the ordering of the rows. These all combine naturally with `group_by()` which allows you to perform any operation “by group”. A first example: ```r diamonds %>% mutate(price_per_carat = price/carat) %>% filter(price_per_carat > 1500) ``` or ```r diamonds %>% group_by(cut) %>% summarize(price = mean(price), carat = mean(carat)) ``` --- name: yourturn-tidyverse class: clear .left-column[ <!-- Add icon library --> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> ## <i class="fa fa-edit"></i> <br> Your turn ] .right-column[ To get warmed up, let's do some .hi-RCLRbg[basic explorations] of the {tidyverse} instructions. The idea is to get some feel for these functions. .hi-pink[Q]: you will work through the following exploratory steps. 1.1. Create a data frame (with `data.frame(.)`) or tibble (with `tibble(.)`) `df` with two variables `x` and `y`. Enter some values for these variables. 1.2. Create a new variable `z` that is the sum of `x` and `y`. Use `base` R instructions and then use the pipe operator and `mutate(.)`. 2.1. Create a new data vector `v` with some entries, use `c(.)`. 2.2. Try the following instructions: ```r round(mean(x), 2) mean(x) %>% round(2) x %>% mean %>% round(2) ``` ] --- class: clear .pull-left[ First, you put together the `data.frame` ```r df <- data.frame(x = c(0, 1), y = c(0, 1)) df ``` or the `tibble` ```r df <- tibble(x = c(0, 1), y = c(0, 1)) df ``` Next, you create a new variable ```r df$z <- df$x + df$y df ``` or with `mutate(.)` ```r df %>% mutate(z = x+y) df ``` ] .pull-right[ You create a vector `x` with some entries ```r x <- c(0.157, 0.135, 0.359) ``` and then you evaluate ```r round(mean(x), 2) mean(x) %>% round(2) x %>% mean %>% round(2) ``` These implementations all lead to the same result: ``` ## [1] 0.22 ``` Which one do you find most intuitive? ] --- # Plots with ggplot2 <img src="img/ggplot2.png" class="title-hex"> The aim of the {ggplot2} package is to create elegant data visualisations using the .hi-pink[grammar of graphics]. Here are the basic steps: - begin a plot with the function `ggplot()` creating a coordinate system that you can add layers to - the first argument of `ggplot()` is the dataset to use in the graph A first example ```r library(ggplot2) ggplot(data = mpg) ggplot(mpg) ``` creates an empty graph. You will now add layers to this graph! --- # Plots with ggplot2 <img src="img/ggplot2.png" class="title-hex"> You complete your graph by adding one or more .hi-pink[layers] to `ggplot()`. For example: - `geom_point()` adds a layer of points to your plot, which creates a scatterplot - `geom_smooth()` adds a smooth line - `geom_bar` a bar plot and many more, see [ggplot2 documentation](https://ggplot2.tidyverse.org/). Each `geom` function in `ggplot2` takes an aesthetic mapping argument: - maps variables in your dataset to visual properties - always paired with `aes()` and the `\(x\)` and `\(y\)` arguments of `aes()` specify which variables to map to the `\(x\)` and `\(y\)` axes. --- class: clear .pull-left[ ```r library(ggplot2) ggplot(mpg, `aes(displ, hwy, colour = class)`) + `geom_point()` + `theme_bw()` ``` <img src="Prework_files/figure-html/unnamed-chunk-17-1.svg" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ Extend the empty graph now with (here: global) aesthetic mapping argument `aes(displ, hwy, colour = class)`. This implies: `displ` on the x-axis, `hwy` on the y-axis and `class` to differentiate the color of the plotting symbol. With `geom_point` you add a layer of points to the empty graph. `theme_bw()` changes the `ggtheme` to a simple black-and-white theme. ] --- # What else is there? Recall ><p align="justify">The .hi-RCLRbg[tidyverse] is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. </p> There are .hi-RCLRbg[(multiple) alternative ways] to do what the packages and functions in the tidyverse do. For instance: - base R - the {data.table} package You can read more about comparisons on e.g. [how to write this in base R](https://gist.github.com/hadley/c430501804349d382ce90754936ab8ec) or [Base R, the tidyverse, and data.table: a comparison of R dialects to wrangle your data](https://wetlandscapes.com/blog/a-comparison-of-r-dialects/). --- # Thanks! <img src="img/xaringan.png" class="title-hex" width="50" height="50" align="right"> <br> <br> <br> <br> Slides created with the R package [xaringan](https://github.com/yihui/xaringan). <br> <br> <br> Course material available via <br> <img align="left" src="img/gitfront.png" width="35" hspace="10"/> https://gitfront.io/r/jensrobben/mUZ8CcDz5x1L/Lausanne-summer-school/