The Kaplan-Meier Estimatior in R


The Kaplan–Meier estimator, often known as the product limit estimator, is a non-parametric statistic used to estimate the survival function using lifetime data. In this post, we will see how to estimate the survival function in this method using R.

Packages & Data

For fitting a Kaplan-Meier curve, the survival package is required -

library(survival)

For this example, I will use the addicts dataset -

remotes::install_github("lbraglia/suanselete3")
data(addicts, package = 'suanselete3')

Here, the package suanselete3 is an unofficial companion to the textbook “Survival Analysis - A Self-Learning Text” by D.G. Kleinbaum and M. Klein (3rd Ed., 2012) including all the accompanying datasets. I have loaded the addicts data set from this package.

Description of the data (event=dropped out) -

  1. ID – Patient ID
  2. CLINIC – Indicates which methadone treatment clinic the patient attended (coded 1 or 2)
  3. STATUS – Indicates whether the patient dropped out of the clinic (coded 1) or was censored (coded 0)
  4. SURVT – The time (in days) until the patient dropped out of the clinic or was censored
  5. PRISON – Indicates whether the patient had a prison record (coded 1) or not (coded 0)
  6. DOSE – A continuous variable for the patient’s maximum methadone dose (mg/day)

Data Preparation

Let’s have an idea about the data set -

head(addicts)
#   id clinic status survt prison dose
# 1  1      1      1   428      0   50
# 2  2      1      1   275      1   55
# 3  3      1      1   262      0   55
# 4  4      1      1   183      0   30
# 5  5      1      1   259      1   65
# 6  6      1      1   714      0   55

Check the class of the data set imported in R -

class(addicts)
# [1] "data.frame"

It is a data frame. To work with this, firstly, we need to create a Surv object which will be used as a response variable -

Y <- Surv(time = addicts$survt, event = addicts$status == 1)
Y
#   [1]  428   275   262   183   259   714   438   796+  892   393   161+  836 
#  [13]  523   612   212   399   771   514   512   624   209   341   299   826+
#  [25]  262   566+  368   302   602+  652   293   564+  394   755   591   787+
#  [37]  739   550   837   612   581+  523   504   785   774   560   160   482 
#  [49]  518   683   147   563   646   899   857   180   452   760   496   258 
#  [61]  181   386   439+  563+  337   613+  192   405+  667   905+  247   821 
#  [73]  821   517+  346+  294   244    95   376   212    96   532   522   679 
#  [85]  408+  840+  148+  168   489   541+  205   475+  237   517   749   150 
#  [97]  465   708   713+  146+  450   555+  460    53+  122    35   532+  684+
# [109]  769+  591+  769+  609+  932+  932+  587+   26    72+  641+  367+  633+
# [121]  661   232    13   563+  969+ 1052+  944+  881+  190    79   884+  170 
# [133]  286   358+  326+  769+  161   564+  268   611+  322  1076+    2+  788+
# [145]  575+  109   730+  790+  456+  231   143    86+ 1021+  684+  878   216 
# [157]  808+  268   222+  683+  496+  389   126    17   350   531+  317+  461+
# [169]   37   167   358    49   457   127     7    29    62   150+  223   129+
# [181]  204+  129   581   176    30    41   543+  210+  193   434   367   348 
# [193]   28+  337+  175+  149   546    84   283+  533   207   216    28+   67 
# [205]   62+  111+  257   136   342+   41   531+   98+  145    50    53+  103+
# [217]    2+  157    75    19    35   394+  117   175   180   314   480+  325+
# [229]  280   204   366   531+   59    33   540   551+   90    47

Here the time variable survt has been assigned with the plus sign (+) indicating censored and not censored otherwise (based on status variable).

Model

In the following code, Y~1 requests an intercept only model -

kmfit1 = survfit(Y~1)
kmfit1
# Call: survfit(formula = Y ~ 1)
# 
#       n  events  median 0.95LCL 0.95UCL 
#     238     150     504     399     560

The output contains descriptive information on the number of records, the number at risk at time 0, the number of events, and the median estimated survival time with a 95% confidence interval.

The summary function shows the survival estimates -

summary(kmfit1)
# Call: survfit(formula = Y ~ 1)
# 
#  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#     7    236       1    0.996 0.00423       0.9875        1.000
#    13    235       1    0.992 0.00597       0.9799        1.000
#    17    234       1    0.987 0.00729       0.9731        1.000
#    19    233       1    0.983 0.00840       0.9667        1.000
#    26    232       1    0.979 0.00937       0.9606        0.997
#    29    229       1    0.975 0.01026       0.9546        0.995
#    30    228       1    0.970 0.01107       0.9488        0.992
#    33    227       1    0.966 0.01182       0.9431        0.989
#    35    226       2    0.957 0.01317       0.9320        0.984
#    37    224       1    0.953 0.01379       0.9265        0.981
#    41    223       2    0.945 0.01493       0.9158        0.974
#    47    221       1    0.940 0.01546       0.9105        0.971
#    49    220       1    0.936 0.01597       0.9053        0.968
#    50    219       1    0.932 0.01646       0.9001        0.965
#    59    216       1    0.927 0.01694       0.8949        0.961
#    62    215       1    0.923 0.01740       0.8897        0.958
#    67    213       1    0.919 0.01785       0.8845        0.954
#    75    211       1    0.914 0.01829       0.8793        0.951
#    79    210       1    0.910 0.01871       0.8742        0.948
#    84    209       1    0.906 0.01913       0.8691        0.944
#    90    207       1    0.901 0.01953       0.8639        0.940
#    95    206       1    0.897 0.01992       0.8588        0.937
#    96    205       1    0.893 0.02029       0.8537        0.933
#   109    202       1    0.888 0.02067       0.8486        0.930
#   117    200       1    0.884 0.02104       0.8435        0.926
#   122    199       1    0.879 0.02140       0.8384        0.922
#   126    198       1    0.875 0.02174       0.8333        0.919
#   127    197       1    0.870 0.02208       0.8282        0.915
#   129    196       1    0.866 0.02241       0.8232        0.911
#   136    194       1    0.862 0.02274       0.8181        0.907
#   143    193       1    0.857 0.02305       0.8131        0.903
#   145    192       1    0.853 0.02336       0.8080        0.900
#   147    190       1    0.848 0.02366       0.8030        0.896
#   149    188       1    0.844 0.02396       0.7979        0.892
#   150    187       1    0.839 0.02426       0.7929        0.888
#   157    185       1    0.835 0.02455       0.7878        0.884
#   160    184       1    0.830 0.02483       0.7828        0.880
#   161    183       1    0.826 0.02510       0.7777        0.876
#   167    181       1    0.821 0.02538       0.7727        0.872
#   168    180       1    0.816 0.02564       0.7676        0.868
#   170    179       1    0.812 0.02590       0.7626        0.864
#   175    178       1    0.807 0.02615       0.7576        0.860
#   176    176       1    0.803 0.02640       0.7526        0.856
#   180    175       2    0.794 0.02689       0.7425        0.848
#   181    173       1    0.789 0.02712       0.7375        0.844
#   183    172       1    0.784 0.02735       0.7325        0.840
#   190    171       1    0.780 0.02757       0.7275        0.836
#   192    170       1    0.775 0.02779       0.7226        0.832
#   193    169       1    0.771 0.02800       0.7176        0.827
#   204    168       1    0.766 0.02821       0.7127        0.823
#   205    166       1    0.761 0.02841       0.7077        0.819
#   207    165       1    0.757 0.02861       0.7027        0.815
#   209    164       1    0.752 0.02881       0.6978        0.811
#   212    162       2    0.743 0.02919       0.6878        0.802
#   216    160       2    0.734 0.02955       0.6779        0.794
#   223    157       1    0.729 0.02973       0.6729        0.790
#   231    156       1    0.724 0.02991       0.6679        0.785
#   232    155       1    0.720 0.03008       0.6630        0.781
#   237    154       1    0.715 0.03024       0.6580        0.777
#   244    153       1    0.710 0.03040       0.6531        0.772
#   247    152       1    0.706 0.03056       0.6481        0.768
#   257    151       1    0.701 0.03071       0.6432        0.764
#   258    150       1    0.696 0.03086       0.6383        0.759
#   259    149       1    0.692 0.03101       0.6333        0.755
#   262    148       2    0.682 0.03128       0.6235        0.746
#   268    146       2    0.673 0.03154       0.6138        0.738
#   275    144       1    0.668 0.03167       0.6089        0.733
#   280    143       1    0.663 0.03179       0.6040        0.729
#   286    141       1    0.659 0.03191       0.5991        0.724
#   293    140       1    0.654 0.03203       0.5942        0.720
#   294    139       1    0.649 0.03214       0.5893        0.716
#   299    138       1    0.645 0.03225       0.5844        0.711
#   302    137       1    0.640 0.03236       0.5796        0.707
#   314    136       1    0.635 0.03246       0.5747        0.702
#   322    134       1    0.631 0.03256       0.5698        0.698
#   337    131       1    0.626 0.03267       0.5648        0.693
#   341    129       1    0.621 0.03277       0.5598        0.689
#   348    126       1    0.616 0.03288       0.5547        0.684
#   350    125       1    0.611 0.03298       0.5496        0.679
#   358    124       1    0.606 0.03308       0.5446        0.675
#   366    122       1    0.601 0.03318       0.5395        0.670
#   367    121       1    0.596 0.03328       0.5343        0.665
#   368    119       1    0.591 0.03338       0.5292        0.660
#   376    118       1    0.586 0.03347       0.5241        0.656
#   386    117       1    0.581 0.03355       0.5189        0.651
#   389    116       1    0.576 0.03364       0.5138        0.646
#   393    115       1    0.571 0.03371       0.5087        0.641
#   394    114       1    0.566 0.03379       0.5036        0.636
#   399    112       1    0.561 0.03386       0.4984        0.631
#   428    109       1    0.556 0.03394       0.4932        0.627
#   434    108       1    0.551 0.03401       0.4879        0.622
#   438    107       1    0.546 0.03408       0.4827        0.617
#   450    105       1    0.540 0.03415       0.4774        0.612
#   452    104       1    0.535 0.03422       0.4722        0.607
#   457    102       1    0.530 0.03428       0.4668        0.602
#   460    101       1    0.525 0.03434       0.4615        0.597
#   465     99       1    0.519 0.03440       0.4562        0.591
#   482     96       1    0.514 0.03447       0.4507        0.586
#   489     95       1    0.509 0.03453       0.4452        0.581
#   496     94       1    0.503 0.03458       0.4398        0.576
#   504     92       1    0.498 0.03463       0.4342        0.570
#   512     91       1    0.492 0.03468       0.4287        0.565
#   514     90       1    0.487 0.03473       0.4232        0.560
#   517     89       1    0.481 0.03476       0.4178        0.554
#   518     87       1    0.476 0.03480       0.4122        0.549
#   522     86       1    0.470 0.03483       0.4067        0.544
#   523     85       2    0.459 0.03488       0.3956        0.533
#   532     80       1    0.453 0.03491       0.3899        0.527
#   533     78       1    0.448 0.03495       0.3841        0.522
#   540     77       1    0.442 0.03497       0.3783        0.516
#   546     74       1    0.436 0.03501       0.3723        0.510
#   550     73       1    0.430 0.03503       0.3664        0.504
#   560     70       1    0.424 0.03507       0.3603        0.498
#   563     69       1    0.418 0.03509       0.3542        0.492
#   581     62       1    0.411 0.03517       0.3474        0.486
#   591     59       1    0.404 0.03525       0.3404        0.479
#   612     54       2    0.389 0.03550       0.3252        0.465
#   624     51       1    0.381 0.03561       0.3175        0.458
#   646     48       1    0.373 0.03574       0.3095        0.450
#   652     47       1    0.365 0.03586       0.3015        0.443
#   661     46       1    0.357 0.03595       0.2935        0.435
#   667     45       1    0.350 0.03601       0.2856        0.428
#   679     44       1    0.342 0.03606       0.2777        0.420
#   683     43       1    0.334 0.03609       0.2699        0.412
#   708     39       1    0.325 0.03616       0.2614        0.404
#   714     37       1    0.316 0.03623       0.2527        0.396
#   739     35       1    0.307 0.03631       0.2437        0.387
#   749     34       1    0.298 0.03635       0.2348        0.379
#   755     33       1    0.289 0.03635       0.2260        0.370
#   760     32       1    0.280 0.03632       0.2173        0.361
#   771     28       1    0.270 0.03638       0.2075        0.352
#   774     27       1    0.260 0.03638       0.1978        0.342
#   785     26       1    0.250 0.03633       0.1882        0.332
#   821     20       2    0.225 0.03675       0.1635        0.310
#   836     17       1    0.212 0.03690       0.1506        0.298
#   837     16       1    0.199 0.03689       0.1380        0.286
#   857     14       1    0.184 0.03688       0.1246        0.273
#   878     13       1    0.170 0.03667       0.1116        0.260
#   892     10       1    0.153 0.03675       0.0958        0.245
#   899      9       1    0.136 0.03639       0.0807        0.230

For specified survival time, the times argument can be used. For example, to produce, survival estimate for survival time 30 -

summary(kmfit1, times = 30)
# Call: survfit(formula = Y ~ 1)
# 
#  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#    30    228       7     0.97  0.0111        0.949        0.992

To get survival estimates at different survival times, say in interval of 100 -

summary(kmfit1,times=seq(0, max(addicts$survt), by=100))
# Call: survfit(formula = Y ~ 1)
# 
#  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#     0    238       0    1.000  0.0000       1.0000        1.000
#   100    203      25    0.893  0.0203       0.8537        0.933
#   200    168      27    0.771  0.0280       0.7176        0.827
#   300    137      27    0.645  0.0323       0.5844        0.711
#   400    111      17    0.561  0.0339       0.4984        0.631
#   500     92      11    0.503  0.0346       0.4398        0.576
#   600     57      17    0.404  0.0353       0.3404        0.479
#   700     39       9    0.334  0.0361       0.2699        0.412
#   800     21       9    0.250  0.0363       0.1882        0.332
#   900      8       8    0.136  0.0364       0.0807        0.230
#  1000      3       0    0.136  0.0364       0.0807        0.230

To plot the output -

plot(kmfit1, frame.plot=FALSE,
     xlab = "Survival Time in Days", 
     ylab = "Survival Probabilities")

Stratify by Other Variable

To stratify the survival experience by CLINIC variable we need to fit a different model including the variable CLINIC instead of 1 -

kmfit2 = survfit(Y ~ clinic, data = addicts)
kmfit2
# Call: survfit(formula = Y ~ clinic, data = addicts)
# 
#            n events median 0.95LCL 0.95UCL
# clinic=1 163    122    428     348     514
# clinic=2  75     28     NA     661      NA

Let’s see the survival estimates at a time interval of 100 -

summary(kmfit2,times=seq(0,max(addicts$survt),by=100))
# Call: survfit(formula = Y ~ clinic, data = addicts)
# 
#                 clinic=1 
#  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#     0    163       0   1.0000  0.0000      1.00000        1.000
#   100    137      20   0.8746  0.0262      0.82467        0.928
#   200    110      20   0.7420  0.0353      0.67601        0.814
#   300     87      20   0.6046  0.0399      0.53120        0.688
#   400     68      14   0.5025  0.0415      0.42741        0.591
#   500     53       9   0.4319  0.0418      0.35719        0.522
#   600     30      16   0.2951  0.0403      0.22570        0.386
#   700     20       8   0.2113  0.0383      0.14818        0.301
#   800     10       8   0.1268  0.0326      0.07660        0.210
#   900      1       7   0.0181  0.0172      0.00283        0.116
# 
#                 clinic=2 
#  time n.risk n.event survival std.err lower 95% CI upper 95% CI
#     0     75       0    1.000  0.0000        1.000        1.000
#   100     66       5    0.932  0.0294        0.876        0.991
#   200     58       7    0.832  0.0442        0.750        0.924
#   300     50       7    0.730  0.0530        0.633        0.842
#   400     43       3    0.685  0.0558        0.584        0.804
#   500     39       2    0.653  0.0577        0.549        0.776
#   600     27       1    0.634  0.0590        0.528        0.761
#   700     19       1    0.606  0.0625        0.495        0.742
#   800     11       1    0.575  0.0669        0.457        0.722
#   900      7       1    0.517  0.0812        0.380        0.703
#  1000      3       0    0.517  0.0812        0.380        0.703

Notice for CLINIC=1, survival times stopped at 900 rather than 1000, because no subject was at risk on day 1000.

We can plot the survival curve for the two clinics -

plot(kmfit2, frame.plot=FALSE, lwd = 2, 
     lty = c("solid","dashed"), 
     col=c("#DC7633","#2471A3"),
     xlab = "Survival Time in Days", 
     ylab = "Survival Probabilities")
legend("topright", c("Clinic 1", "Clinic 2"), 
       lty = c("solid","dashed"), col=c("#DC7633","#2471A3"))

The plot indicates that subjects from CLINIC=2 have a higher rate of survival than subjects from CLINIC=1.

Reference:

  1. Kleinbaum, D. G., & Klein, M. (2012). Survival analysis: a self-learning text (Vol. 3). New York: Springer.
Md Ahsanul Islam
Md Ahsanul Islam
Freelance Data Analysis and R Programmer

Statistics graduate student currently researching on econometrics