survutils/README.Rmd at master · tinyheero/survutils · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
output:
  github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/"
)
```

# survutils

[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/survutils)](https://cran.r-project.org/package=survutils)
[![Travis-CI Build Status](https://travis-ci.org/tinyheero/survutils.svg?branch=master)](https://travis-ci.org/tinyheero/survutils)
[![Downloads per month](http://cranlogs.r-pkg.org/badges/survutils)](https://cran.rstudio.com/web/packages/survutils)
[![Total downloads](https://cranlogs.r-pkg.org/badges/grand-total/survutils)](https://cran.rstudio.com/web/packages/survutils)
[![Anaconda-Server Badge](https://anaconda.org/fongchun/r-survutils/badges/version.svg)](https://anaconda.org/fongchun/r-survutils)
[![Say Thanks!](https://img.shields.io/badge/Say%20Thanks-!-1EAEDB.svg)](https://saythanks.io/to/tinyheero)


This package uses [functional programming principles](http://adv-r.had.co.nz/Functional-programming.html) to iteratively run Cox regression and plot its results. The results are reported in [tidy data frames](http://vita.had.co.nz/papers/tidy-data.pdf). Additional utility functions are available for working with other aspects of survival analysis such as survival curves, C-statistics, etc. It has the following features (grouped by major topics):

Cox Regression

* `get_cox_res`: Run univariate or multivariate cox regression.
* `iter_get_cox_res`: Wrapper over `get_cox_res` to faciliate ease of multiple `get_cox_res` runs. Internally, this makes use of `purrr:map` to iterate over a list of features.
* `plot_cox_res`: Generates a forest plot of the univariate or multivariate cox regression results from `get_cox_res`.

Kaplan Meier Estimates/Curves

* `get_surv_prob`: Calculates the survival probability at specified times from a survival curve.
* `get_nrisk_tbl`: Provides a number at risk table as [typically seen in publications](https://mcfromnz.wordpress.com/2011/11/06/kaplan-meier-survival-plot-with-at-risk-table/).
* `get_logrank_res`: Runs a log-rank test.

Other

* `get_c_stat`: Calculate C-statistics.

# How to Install

To get the released version from CRAN:

```{r, eval = FALSE}
install.packages("survutils")
```

You can also get survutils through conda:

```{bash, eval = FALSE}
conda install -c fongchun r-survutils
```

To install the latest developmental version from github:

```{r, eval = FALSE}
devtools::install_github("tinyheero/survutils")
```

# Cox Regression

`survutils` provides a nice wrapper function `get_cox_res` that allows you to quickly run an univariate or multivariate cox regression on a set of data. The input data is a data.frame for instance (taking the colon dataset from the `survival` R package as the example):

```{r, message = FALSE}
library("survival")
library("knitr")
library("survutils")
library("dplyr")

head(colon) %>%
	select(age, obstruct, time, status, rx) %>%
	kable()
```

The relevant columns are:

* `age` and `obstruct`: These are the features we want to regress on.
* `time`: Time to event.
* `status`: Event status (1 for event; 0 for non-event).
* `rx`: Different treatment groups.

Then to run `get_cox_res`:

```{r}
endpoint <- "time"
endpoint.code <- "status"

features <- c("age", "obstruct")
cox.res.df <- get_cox_res(colon, endpoint, endpoint.code, features)
kable(cox.res.df)
```

This runs a multivariate cox regression on the entire set of data. We can plot the results using `plot_cox_res`:

```{r, get_cox_res_example}
plot_cox_res(cox.res.df)
```

This gives us a forest plot with the hazard ratio and confidence evidence for each feature. If we are interested in running cox regression within each treatment group, we can make use of the `group` parameter.

```{r}
group <- "rx"
cox.res.df <- get_cox_res(colon, endpoint, endpoint.code, features, group)
kable(cox.res.df)
```

Notice how the output data.frame now has cox regression results for each treatment group (i.e. Obs, Lev, Lev+5FU). We can use the `plot_cox_res` function again and pass in a `facet.formula` to plot these results very easily:

```{r, get_cox_res_group_example}
plot_cox_res(cox.res.df,
             facet.formula = ". ~ group")
```

This will facet the groups (per column) so that we can visualize the cox regression results for each treatment group. The formula is the format for `ggplot2::facet_grid` with the full [documentation listed here](http://docs.ggplot2.org/current/facet_grid.html). In short, the left hand side of the formula indicates what you want to facet by row. The right hand side of the formula indicates what you want to facet by column. By specifically `. ~ group`, we are indicating we do not want to facet by row (this is indicated by the `.`) and we want to facet the `group` variable by column.

We could have facetted by row too very easily:

```{r, get_cox_res_group_example_facet_row}
plot_cox_res(cox.res.df,
             facet.formula = "group ~ .")
```

There are also other options (see `?plot_cox_res` for full options) such as the ability to add colors:

```{r, get_cox_res_group_colors_example, message = FALSE}
cox.res.df %>%
  mutate(sig_flag = p.value < 0.05) %>%
  plot_cox_res(facet.formula = ". ~ group", color.col = "sig_flag")
```

# Running Cox Regression Multiple Times

One useful function is the `iter_get_cox_res` which allows you to easily run the `get_cox_res` function multiple times without needing to setup a for loop yourself. This is useful in situations where you might need to perform multiple pairwise multivariate Cox regression analysis to test the independence of a novel prognostic biomarker to existing biomarkers.

The input to the `iter_get_cox_res` function is the same as `get_cox_res` with the only exception being the features parameter which takes a list of vectors. Each element in the list indicates the features you want to perform Cox regression on:

```{r}
features <- list(c("age", "obstruct"),
                 c("age"))

iter_get_cox_res.df <-
  iter_get_cox_res(colon, endpoint, endpoint.code, features)
```

The output is a data frame with a `iter_num` column indicating a separate Cox regression result from `get_cox_res`:

```{r}
kable(iter_get_cox_res.df, caption = "Iterative Cox Regression Results")
```

One could plot then the multiple Cox regression with facet by row as follows:

```{r iter-cox-res-example}
plot_cox_res(iter_get_cox_res.df,
             facet.formula = "iter_num ~ .", facet.scales = "free_y")
```

By default, all features will appear in each facet. The `facet.scales` parameter drops features on the y-axes that are not part of the specific Cox regression.

You can even combine this with the group parameter:

```{r}
iter_get_cox_res.group.df <-
  iter_get_cox_res(colon, endpoint, endpoint.code, features,
                   group = "rx")

kable(iter_get_cox_res.group.df, caption = "Iterative Cox Regression Results with Groups")
```

```{r iter-cox-res-group-example}
plot_cox_res(iter_get_cox_res.group.df,
             facet.formula = "iter_num ~ group", facet.scales = "free_y")
```

# Kaplan Meier Estimates/Curves

If you have generated a Kaplan-meier, there are several functions you can use to retrieve important statistics. For example, the `get_surv_prob` function is used for retrieving survival probability at certain times. Here is an example of how to generate survival probabilities for just the "Obs" arm at times 100, 200, and 300:

```{r}
library("dplyr")
library("survival")
library("survutils")

times <- c(100, 200, 300)

colon %>%
  filter(rx == "Obs") %>%
  survfit(Surv(time, status) ~ 1, data = .) %>%
  get_surv_prob(times)
```

Here is a small trick you can employ to get the survival probability that for both arms simultaneously:

```{r}
library("purrr")
library("reshape2")

surv.prob.res <-
  colon %>%
  split(.$rx) %>%
  map(~ survfit(Surv(time, status) ~ 1, data = .)) %>%
  map(get_surv_prob, times)

surv.prob.res.df <- as_data_frame(surv.prob.res)
colnames(surv.prob.res.df) <- names(surv.prob.res)
surv.prob.res.df <-
  surv.prob.res.df %>%
  mutate(surv_prob_time = times)

surv.prob.res.df %>%
  melt(id.vars = "surv_prob_time", value.name = "surv_prob",
       variable.name = "group") %>%
  kable()
```

You can also retrieve a number at risks table using the `get_nrisk_tbl` function. Here we will use it to get the number at risk at time 100, 200, and 300:

```{r}
survfit(Surv(time, status) ~ rx, data = colon) %>%
  get_nrisk_tbl(timeby = 100) %>%
  filter(time %in% c(100, 200, 300)) %>%
  kable()
```

# R Session

```{r}
devtools::session_info()
```