The Kaplan-Meier Estimatior in R
The Kaplan–Meier estimator, often known as the product limit estimator, is a non-parametric statistic used to estimate the survival function using lifetime data. In this post, we will see how to estimate the survival function in this method using R.
Packages & Data
For fitting a Kaplan-Meier curve, the survival
package is required -
library(survival)
For this example, I will use the addicts dataset -
remotes::install_github("lbraglia/suanselete3")
data(addicts, package = 'suanselete3')
Here, the package suanselete3
is an unofficial companion to the textbook “Survival Analysis - A Self-Learning Text” by D.G. Kleinbaum and M. Klein (3rd Ed., 2012) including all the accompanying datasets. I have loaded the addicts data set from this package.
Description of the data (event=dropped out) -
- ID – Patient ID
- CLINIC – Indicates which methadone treatment clinic the patient attended (coded 1 or 2)
- STATUS – Indicates whether the patient dropped out of the clinic (coded 1) or was censored (coded 0)
- SURVT – The time (in days) until the patient dropped out of the clinic or was censored
- PRISON – Indicates whether the patient had a prison record (coded 1) or not (coded 0)
- DOSE – A continuous variable for the patient’s maximum methadone dose (mg/day)
Data Preparation
Let’s have an idea about the data set -
head(addicts)
# id clinic status survt prison dose
# 1 1 1 1 428 0 50
# 2 2 1 1 275 1 55
# 3 3 1 1 262 0 55
# 4 4 1 1 183 0 30
# 5 5 1 1 259 1 65
# 6 6 1 1 714 0 55
Check the class of the data set imported in R -
class(addicts)
# [1] "data.frame"
It is a data frame. To work with this, firstly, we need to create a Surv
object which will be used as a response variable -
Y <- Surv(time = addicts$survt, event = addicts$status == 1)
Y
# [1] 428 275 262 183 259 714 438 796+ 892 393 161+ 836
# [13] 523 612 212 399 771 514 512 624 209 341 299 826+
# [25] 262 566+ 368 302 602+ 652 293 564+ 394 755 591 787+
# [37] 739 550 837 612 581+ 523 504 785 774 560 160 482
# [49] 518 683 147 563 646 899 857 180 452 760 496 258
# [61] 181 386 439+ 563+ 337 613+ 192 405+ 667 905+ 247 821
# [73] 821 517+ 346+ 294 244 95 376 212 96 532 522 679
# [85] 408+ 840+ 148+ 168 489 541+ 205 475+ 237 517 749 150
# [97] 465 708 713+ 146+ 450 555+ 460 53+ 122 35 532+ 684+
# [109] 769+ 591+ 769+ 609+ 932+ 932+ 587+ 26 72+ 641+ 367+ 633+
# [121] 661 232 13 563+ 969+ 1052+ 944+ 881+ 190 79 884+ 170
# [133] 286 358+ 326+ 769+ 161 564+ 268 611+ 322 1076+ 2+ 788+
# [145] 575+ 109 730+ 790+ 456+ 231 143 86+ 1021+ 684+ 878 216
# [157] 808+ 268 222+ 683+ 496+ 389 126 17 350 531+ 317+ 461+
# [169] 37 167 358 49 457 127 7 29 62 150+ 223 129+
# [181] 204+ 129 581 176 30 41 543+ 210+ 193 434 367 348
# [193] 28+ 337+ 175+ 149 546 84 283+ 533 207 216 28+ 67
# [205] 62+ 111+ 257 136 342+ 41 531+ 98+ 145 50 53+ 103+
# [217] 2+ 157 75 19 35 394+ 117 175 180 314 480+ 325+
# [229] 280 204 366 531+ 59 33 540 551+ 90 47
Here the time variable survt
has been assigned with the plus sign (+) indicating censored and not censored otherwise (based on status
variable).
Model
In the following code, Y~1 requests an intercept only model -
kmfit1 = survfit(Y~1)
kmfit1
# Call: survfit(formula = Y ~ 1)
#
# n events median 0.95LCL 0.95UCL
# 238 150 504 399 560
The output contains descriptive information on the number of records, the number at risk at time 0, the number of events, and the median estimated survival time with a 95% confidence interval.
The summary function shows the survival estimates -
summary(kmfit1)
# Call: survfit(formula = Y ~ 1)
#
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 7 236 1 0.996 0.00423 0.9875 1.000
# 13 235 1 0.992 0.00597 0.9799 1.000
# 17 234 1 0.987 0.00729 0.9731 1.000
# 19 233 1 0.983 0.00840 0.9667 1.000
# 26 232 1 0.979 0.00937 0.9606 0.997
# 29 229 1 0.975 0.01026 0.9546 0.995
# 30 228 1 0.970 0.01107 0.9488 0.992
# 33 227 1 0.966 0.01182 0.9431 0.989
# 35 226 2 0.957 0.01317 0.9320 0.984
# 37 224 1 0.953 0.01379 0.9265 0.981
# 41 223 2 0.945 0.01493 0.9158 0.974
# 47 221 1 0.940 0.01546 0.9105 0.971
# 49 220 1 0.936 0.01597 0.9053 0.968
# 50 219 1 0.932 0.01646 0.9001 0.965
# 59 216 1 0.927 0.01694 0.8949 0.961
# 62 215 1 0.923 0.01740 0.8897 0.958
# 67 213 1 0.919 0.01785 0.8845 0.954
# 75 211 1 0.914 0.01829 0.8793 0.951
# 79 210 1 0.910 0.01871 0.8742 0.948
# 84 209 1 0.906 0.01913 0.8691 0.944
# 90 207 1 0.901 0.01953 0.8639 0.940
# 95 206 1 0.897 0.01992 0.8588 0.937
# 96 205 1 0.893 0.02029 0.8537 0.933
# 109 202 1 0.888 0.02067 0.8486 0.930
# 117 200 1 0.884 0.02104 0.8435 0.926
# 122 199 1 0.879 0.02140 0.8384 0.922
# 126 198 1 0.875 0.02174 0.8333 0.919
# 127 197 1 0.870 0.02208 0.8282 0.915
# 129 196 1 0.866 0.02241 0.8232 0.911
# 136 194 1 0.862 0.02274 0.8181 0.907
# 143 193 1 0.857 0.02305 0.8131 0.903
# 145 192 1 0.853 0.02336 0.8080 0.900
# 147 190 1 0.848 0.02366 0.8030 0.896
# 149 188 1 0.844 0.02396 0.7979 0.892
# 150 187 1 0.839 0.02426 0.7929 0.888
# 157 185 1 0.835 0.02455 0.7878 0.884
# 160 184 1 0.830 0.02483 0.7828 0.880
# 161 183 1 0.826 0.02510 0.7777 0.876
# 167 181 1 0.821 0.02538 0.7727 0.872
# 168 180 1 0.816 0.02564 0.7676 0.868
# 170 179 1 0.812 0.02590 0.7626 0.864
# 175 178 1 0.807 0.02615 0.7576 0.860
# 176 176 1 0.803 0.02640 0.7526 0.856
# 180 175 2 0.794 0.02689 0.7425 0.848
# 181 173 1 0.789 0.02712 0.7375 0.844
# 183 172 1 0.784 0.02735 0.7325 0.840
# 190 171 1 0.780 0.02757 0.7275 0.836
# 192 170 1 0.775 0.02779 0.7226 0.832
# 193 169 1 0.771 0.02800 0.7176 0.827
# 204 168 1 0.766 0.02821 0.7127 0.823
# 205 166 1 0.761 0.02841 0.7077 0.819
# 207 165 1 0.757 0.02861 0.7027 0.815
# 209 164 1 0.752 0.02881 0.6978 0.811
# 212 162 2 0.743 0.02919 0.6878 0.802
# 216 160 2 0.734 0.02955 0.6779 0.794
# 223 157 1 0.729 0.02973 0.6729 0.790
# 231 156 1 0.724 0.02991 0.6679 0.785
# 232 155 1 0.720 0.03008 0.6630 0.781
# 237 154 1 0.715 0.03024 0.6580 0.777
# 244 153 1 0.710 0.03040 0.6531 0.772
# 247 152 1 0.706 0.03056 0.6481 0.768
# 257 151 1 0.701 0.03071 0.6432 0.764
# 258 150 1 0.696 0.03086 0.6383 0.759
# 259 149 1 0.692 0.03101 0.6333 0.755
# 262 148 2 0.682 0.03128 0.6235 0.746
# 268 146 2 0.673 0.03154 0.6138 0.738
# 275 144 1 0.668 0.03167 0.6089 0.733
# 280 143 1 0.663 0.03179 0.6040 0.729
# 286 141 1 0.659 0.03191 0.5991 0.724
# 293 140 1 0.654 0.03203 0.5942 0.720
# 294 139 1 0.649 0.03214 0.5893 0.716
# 299 138 1 0.645 0.03225 0.5844 0.711
# 302 137 1 0.640 0.03236 0.5796 0.707
# 314 136 1 0.635 0.03246 0.5747 0.702
# 322 134 1 0.631 0.03256 0.5698 0.698
# 337 131 1 0.626 0.03267 0.5648 0.693
# 341 129 1 0.621 0.03277 0.5598 0.689
# 348 126 1 0.616 0.03288 0.5547 0.684
# 350 125 1 0.611 0.03298 0.5496 0.679
# 358 124 1 0.606 0.03308 0.5446 0.675
# 366 122 1 0.601 0.03318 0.5395 0.670
# 367 121 1 0.596 0.03328 0.5343 0.665
# 368 119 1 0.591 0.03338 0.5292 0.660
# 376 118 1 0.586 0.03347 0.5241 0.656
# 386 117 1 0.581 0.03355 0.5189 0.651
# 389 116 1 0.576 0.03364 0.5138 0.646
# 393 115 1 0.571 0.03371 0.5087 0.641
# 394 114 1 0.566 0.03379 0.5036 0.636
# 399 112 1 0.561 0.03386 0.4984 0.631
# 428 109 1 0.556 0.03394 0.4932 0.627
# 434 108 1 0.551 0.03401 0.4879 0.622
# 438 107 1 0.546 0.03408 0.4827 0.617
# 450 105 1 0.540 0.03415 0.4774 0.612
# 452 104 1 0.535 0.03422 0.4722 0.607
# 457 102 1 0.530 0.03428 0.4668 0.602
# 460 101 1 0.525 0.03434 0.4615 0.597
# 465 99 1 0.519 0.03440 0.4562 0.591
# 482 96 1 0.514 0.03447 0.4507 0.586
# 489 95 1 0.509 0.03453 0.4452 0.581
# 496 94 1 0.503 0.03458 0.4398 0.576
# 504 92 1 0.498 0.03463 0.4342 0.570
# 512 91 1 0.492 0.03468 0.4287 0.565
# 514 90 1 0.487 0.03473 0.4232 0.560
# 517 89 1 0.481 0.03476 0.4178 0.554
# 518 87 1 0.476 0.03480 0.4122 0.549
# 522 86 1 0.470 0.03483 0.4067 0.544
# 523 85 2 0.459 0.03488 0.3956 0.533
# 532 80 1 0.453 0.03491 0.3899 0.527
# 533 78 1 0.448 0.03495 0.3841 0.522
# 540 77 1 0.442 0.03497 0.3783 0.516
# 546 74 1 0.436 0.03501 0.3723 0.510
# 550 73 1 0.430 0.03503 0.3664 0.504
# 560 70 1 0.424 0.03507 0.3603 0.498
# 563 69 1 0.418 0.03509 0.3542 0.492
# 581 62 1 0.411 0.03517 0.3474 0.486
# 591 59 1 0.404 0.03525 0.3404 0.479
# 612 54 2 0.389 0.03550 0.3252 0.465
# 624 51 1 0.381 0.03561 0.3175 0.458
# 646 48 1 0.373 0.03574 0.3095 0.450
# 652 47 1 0.365 0.03586 0.3015 0.443
# 661 46 1 0.357 0.03595 0.2935 0.435
# 667 45 1 0.350 0.03601 0.2856 0.428
# 679 44 1 0.342 0.03606 0.2777 0.420
# 683 43 1 0.334 0.03609 0.2699 0.412
# 708 39 1 0.325 0.03616 0.2614 0.404
# 714 37 1 0.316 0.03623 0.2527 0.396
# 739 35 1 0.307 0.03631 0.2437 0.387
# 749 34 1 0.298 0.03635 0.2348 0.379
# 755 33 1 0.289 0.03635 0.2260 0.370
# 760 32 1 0.280 0.03632 0.2173 0.361
# 771 28 1 0.270 0.03638 0.2075 0.352
# 774 27 1 0.260 0.03638 0.1978 0.342
# 785 26 1 0.250 0.03633 0.1882 0.332
# 821 20 2 0.225 0.03675 0.1635 0.310
# 836 17 1 0.212 0.03690 0.1506 0.298
# 837 16 1 0.199 0.03689 0.1380 0.286
# 857 14 1 0.184 0.03688 0.1246 0.273
# 878 13 1 0.170 0.03667 0.1116 0.260
# 892 10 1 0.153 0.03675 0.0958 0.245
# 899 9 1 0.136 0.03639 0.0807 0.230
For specified survival time, the times
argument can be used. For example, to produce, survival estimate for survival time 30 -
summary(kmfit1, times = 30)
# Call: survfit(formula = Y ~ 1)
#
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 30 228 7 0.97 0.0111 0.949 0.992
To get survival estimates at different survival times, say in interval of 100 -
summary(kmfit1,times=seq(0, max(addicts$survt), by=100))
# Call: survfit(formula = Y ~ 1)
#
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 0 238 0 1.000 0.0000 1.0000 1.000
# 100 203 25 0.893 0.0203 0.8537 0.933
# 200 168 27 0.771 0.0280 0.7176 0.827
# 300 137 27 0.645 0.0323 0.5844 0.711
# 400 111 17 0.561 0.0339 0.4984 0.631
# 500 92 11 0.503 0.0346 0.4398 0.576
# 600 57 17 0.404 0.0353 0.3404 0.479
# 700 39 9 0.334 0.0361 0.2699 0.412
# 800 21 9 0.250 0.0363 0.1882 0.332
# 900 8 8 0.136 0.0364 0.0807 0.230
# 1000 3 0 0.136 0.0364 0.0807 0.230
To plot the output -
plot(kmfit1, frame.plot=FALSE,
xlab = "Survival Time in Days",
ylab = "Survival Probabilities")
Stratify by Other Variable
To stratify the survival experience by CLINIC variable we need to fit a different model including the variable CLINIC instead of 1 -
kmfit2 = survfit(Y ~ clinic, data = addicts)
kmfit2
# Call: survfit(formula = Y ~ clinic, data = addicts)
#
# n events median 0.95LCL 0.95UCL
# clinic=1 163 122 428 348 514
# clinic=2 75 28 NA 661 NA
Let’s see the survival estimates at a time interval of 100 -
summary(kmfit2,times=seq(0,max(addicts$survt),by=100))
# Call: survfit(formula = Y ~ clinic, data = addicts)
#
# clinic=1
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 0 163 0 1.0000 0.0000 1.00000 1.000
# 100 137 20 0.8746 0.0262 0.82467 0.928
# 200 110 20 0.7420 0.0353 0.67601 0.814
# 300 87 20 0.6046 0.0399 0.53120 0.688
# 400 68 14 0.5025 0.0415 0.42741 0.591
# 500 53 9 0.4319 0.0418 0.35719 0.522
# 600 30 16 0.2951 0.0403 0.22570 0.386
# 700 20 8 0.2113 0.0383 0.14818 0.301
# 800 10 8 0.1268 0.0326 0.07660 0.210
# 900 1 7 0.0181 0.0172 0.00283 0.116
#
# clinic=2
# time n.risk n.event survival std.err lower 95% CI upper 95% CI
# 0 75 0 1.000 0.0000 1.000 1.000
# 100 66 5 0.932 0.0294 0.876 0.991
# 200 58 7 0.832 0.0442 0.750 0.924
# 300 50 7 0.730 0.0530 0.633 0.842
# 400 43 3 0.685 0.0558 0.584 0.804
# 500 39 2 0.653 0.0577 0.549 0.776
# 600 27 1 0.634 0.0590 0.528 0.761
# 700 19 1 0.606 0.0625 0.495 0.742
# 800 11 1 0.575 0.0669 0.457 0.722
# 900 7 1 0.517 0.0812 0.380 0.703
# 1000 3 0 0.517 0.0812 0.380 0.703
Notice for CLINIC=1, survival times stopped at 900 rather than 1000, because no subject was at risk on day 1000.
We can plot the survival curve for the two clinics -
plot(kmfit2, frame.plot=FALSE, lwd = 2,
lty = c("solid","dashed"),
col=c("#DC7633","#2471A3"),
xlab = "Survival Time in Days",
ylab = "Survival Probabilities")
legend("topright", c("Clinic 1", "Clinic 2"),
lty = c("solid","dashed"), col=c("#DC7633","#2471A3"))
The plot indicates that subjects from CLINIC=2 have a higher rate of survival than subjects from CLINIC=1.
Reference:
- Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: a self-learning text (Vol. 3). New York: Springer.