<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://lingying.me/feed.xml" rel="self" type="application/atom+xml" /><link href="https://lingying.me/" rel="alternate" type="text/html" /><updated>2026-05-21T13:14:48+00:00</updated><id>https://lingying.me/feed.xml</id><title type="html">Homepage</title><subtitle>Data Science Undergraduate</subtitle><author><name>Ying LING</name><email>spiritsswin@gmail.com</email></author><entry><title type="html">Survival Analysis Report — Telco Customer Churn</title><link href="https://lingying.me/posts/2026/04/survival-analysis-telco-churn/" rel="alternate" type="text/html" title="Survival Analysis Report — Telco Customer Churn" /><published>2026-04-18T00:00:00+00:00</published><updated>2026-04-18T00:00:00+00:00</updated><id>https://lingying.me/posts/2026/04/survival-analysis-telco-churn</id><content type="html" xml:base="https://lingying.me/posts/2026/04/survival-analysis-telco-churn/"><![CDATA[<h2 id="1-data-preparation">1. Data Preparation</h2>

<h3 id="11-dataset">1.1 Dataset</h3>

<table>
  <thead>
    <tr>
      <th><strong>Item</strong></th>
      <th><strong>Value</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Source</td>
      <td><a href="https://github.com/IBM/telco-customer-churn-on-icp4d">IBM Telco Customer Churn</a></td>
    </tr>
    <tr>
      <td>Original rows (Bronze)</td>
      <td><strong>7,043</strong></td>
    </tr>
    <tr>
      <td>Columns</td>
      <td>21</td>
    </tr>
  </tbody>
</table>

<h3 id="12-filtering-for-survival-analysis">1.2 Filtering for Survival Analysis</h3>

<p>Two filters applied:</p>

<ol>
  <li><strong>Contract = “Month-to-month”</strong> only</li>
  <li><strong>InternetService ≠ “No”</strong> (internet subscribers only)</li>
</ol>

<table>
  <thead>
    <tr>
      <th>Stage</th>
      <th>Rows</th>
      <th>% of Original</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Bronze (raw)</td>
      <td>7,043</td>
      <td>100.0%</td>
    </tr>
    <tr>
      <td>Silver (filtered)</td>
      <td><strong>3,351</strong></td>
      <td>47.6%</td>
    </tr>
  </tbody>
</table>

<h3 id="13-churn-distribution-silver-table">1.3 Churn Distribution (Silver Table)</h3>

<table>
  <thead>
    <tr>
      <th><strong>Churn</strong></th>
      <th><strong>Count</strong></th>
      <th><strong>Percentage</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0 (Retained)</td>
      <td>1,795</td>
      <td>53.6%</td>
    </tr>
    <tr>
      <td>1 (Churned)</td>
      <td>1,556</td>
      <td>46.4%</td>
    </tr>
    <tr>
      <td><strong>Total</strong></td>
      <td><strong>3,351</strong></td>
      <td> </td>
    </tr>
  </tbody>
</table>

<h3 id="14-contract-distribution-full-dataset">1.4 Contract Distribution (Full Dataset)</h3>

<table>
  <thead>
    <tr>
      <th><strong>Contract</strong></th>
      <th><strong>Count</strong></th>
      <th><strong>Percentage</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Month-to-month</td>
      <td>3,875</td>
      <td>55.0%</td>
    </tr>
    <tr>
      <td>Two year</td>
      <td>1,695</td>
      <td>24.1%</td>
    </tr>
    <tr>
      <td>One year</td>
      <td>1,473</td>
      <td>20.9%</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="2-kaplan-meier-estimator">2. Kaplan-Meier Estimator</h2>

<h3 id="21-what-is-kaplan-meier">2.1 What is Kaplan-Meier?</h3>

<p>Kaplan-Meier is a <strong>non-parametric</strong> method that estimates the survival function S(t) — the probability that a customer survives beyond time t. It properly accounts for censored observations.</p>

<h3 id="22-population-level-survival-curve">2.2 Population-Level Survival Curve</h3>

<p><strong>Median survival time: 34.0 months</strong></p>

<table>
  <thead>
    <tr>
      <th><strong>Time Point</strong></th>
      <th><strong>Survival Probability</strong></th>
      <th><strong>Interpretation</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>6 months</td>
      <td>0.7803</td>
      <td>78.0% survive at least 6 months</td>
    </tr>
    <tr>
      <td>12 months</td>
      <td>0.6950</td>
      <td>69.5% survive at least 1 year</td>
    </tr>
    <tr>
      <td>24 months</td>
      <td>0.5753</td>
      <td>57.5% survive at least 2 years</td>
    </tr>
    <tr>
      <td>34 months</td>
      <td>0.5000</td>
      <td><strong>Median</strong> — half have churned</td>
    </tr>
    <tr>
      <td>48 months</td>
      <td>0.3872</td>
      <td>38.7% survive 4 years</td>
    </tr>
    <tr>
      <td>60 months</td>
      <td>0.2890</td>
      <td>28.9% survive 5 years</td>
    </tr>
  </tbody>
</table>

<h3 id="23-covariate-level-analysis-with-log-rank-test">2.3 Covariate-Level Analysis with Log-Rank Test</h3>

<p>The log-rank test determines whether survival curves for different groups are statistically distinguishable. <strong>Null hypothesis (H₀):</strong> the groups have the same survival distribution.</p>

<h4 id="results-for-all-15-categorical-variables">Results for all 15 categorical variables:</h4>

<table>
  <thead>
    <tr>
      <th><strong>Variable</strong></th>
      <th><strong>Levels</strong></th>
      <th><strong>Overall p-value</strong></th>
      <th><strong>Significant (p &lt; 0.05)?</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>onlineSecurity</strong></td>
      <td>3</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>onlineBackup</strong></td>
      <td>3</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>deviceProtection</strong></td>
      <td>3</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>techSupport</strong></td>
      <td>3</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>partner</strong></td>
      <td>2</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>dependents</strong></td>
      <td>2</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>internetService</strong></td>
      <td>2</td>
      <td>0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>paymentMethod</strong></td>
      <td>4</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>multipleLines</strong></td>
      <td>3</td>
      <td>&lt; 0.000001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>streamingMovies</strong></td>
      <td>2</td>
      <td>0.000023</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>streamingTV</strong></td>
      <td>2</td>
      <td>0.000322</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>paperlessBilling</strong></td>
      <td>2</td>
      <td>0.003876</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td>gender</td>
      <td>2</td>
      <td>0.153317</td>
      <td>❌ No</td>
    </tr>
    <tr>
      <td>phoneService</td>
      <td>2</td>
      <td>0.194432</td>
      <td>❌ No</td>
    </tr>
    <tr>
      <td>seniorCitizen</td>
      <td>2</td>
      <td>0.723174</td>
      <td>❌ No</td>
    </tr>
  </tbody>
</table>

<h4 id="key-findings">Key findings:</h4>

<ul>
  <li>
    <p><strong>12 out of 15 variables</strong> show statistically significant differences.</p>
  </li>
  <li>
    <p><strong>paymentMethod IS significant</strong> (overall p &lt; 0.000001).</p>
  </li>
  <li>
    <p><strong>Service-related features</strong> are the most significant.</p>
  </li>
</ul>

<h3 id="24-dsl-subscriber-survival-probabilities">2.4 DSL Subscriber Survival Probabilities</h3>

<table>
  <thead>
    <tr>
      <th><strong>Month</strong></th>
      <th><strong>Survival Probability</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>1.0000</td>
    </tr>
    <tr>
      <td>3</td>
      <td>0.8347</td>
    </tr>
    <tr>
      <td>6</td>
      <td>0.7839</td>
    </tr>
    <tr>
      <td>9</td>
      <td>0.7508</td>
    </tr>
    <tr>
      <td>12</td>
      <td>0.7270</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="3-cox-proportional-hazards-model">3. Cox Proportional Hazards Model</h2>

<h3 id="31-what-is-cox-ph">3.1 What is Cox PH?</h3>

<p>Cox Proportional Hazards is a <strong>semi-parametric</strong> regression model:</p>

\[h(t|X) = h_0(t) \times e^{\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p}\]

<ul>
  <li>
    <p><strong>HR &lt; 1</strong> → protective (reduces churn risk)</p>
  </li>
  <li>
    <p><strong>HR &gt; 1</strong> → risk factor (increases churn risk)</p>
  </li>
</ul>

<h3 id="32-feature-encoding">3.2 Feature Encoding</h3>

<table>
  <thead>
    <tr>
      <th><strong>Original Variable</strong></th>
      <th><strong>Kept Column</strong></th>
      <th><strong>Dropped (baseline)</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>dependents</td>
      <td>dependents_Yes</td>
      <td>dependents_No</td>
    </tr>
    <tr>
      <td>internetService</td>
      <td>internetService_DSL</td>
      <td>internetService_Fiber optic</td>
    </tr>
    <tr>
      <td>onlineBackup</td>
      <td>onlineBackup_Yes</td>
      <td>onlineBackup_No</td>
    </tr>
    <tr>
      <td>techSupport</td>
      <td>techSupport_Yes</td>
      <td>techSupport_No</td>
    </tr>
    <tr>
      <td>paperlessBilling</td>
      <td>paperlessBilling_Yes</td>
      <td>paperlessBilling_No</td>
    </tr>
  </tbody>
</table>

<h3 id="33-model-results">3.3 Model Results</h3>

<table>
  <thead>
    <tr>
      <th><strong>Covariate</strong></th>
      <th><strong>Coef (β)</strong></th>
      <th><strong>Hazard Ratio exp(β)</strong></th>
      <th><strong>p-value</strong></th>
      <th><strong>95% CI</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>onlineBackup_Yes</strong></td>
      <td>-0.7766</td>
      <td><strong>0.4600</strong></td>
      <td>&lt; 0.001</td>
      <td>[0.4096, 0.5165]</td>
    </tr>
    <tr>
      <td><strong>techSupport_Yes</strong></td>
      <td>-0.6392</td>
      <td><strong>0.5277</strong></td>
      <td>&lt; 0.001</td>
      <td>[0.4553, 0.6117]</td>
    </tr>
    <tr>
      <td><strong>dependents_Yes</strong></td>
      <td>-0.3287</td>
      <td><strong>0.7199</strong></td>
      <td>&lt; 0.001</td>
      <td>[0.6265, 0.8272]</td>
    </tr>
    <tr>
      <td><strong>internetService_DSL</strong></td>
      <td>-0.2173</td>
      <td><strong>0.8047</strong></td>
      <td>0.0002</td>
      <td>[0.7167, 0.9034]</td>
    </tr>
    <tr>
      <td><strong>Concordance Index: 0.6409</strong></td>
      <td> </td>
      <td> </td>
      <td> </td>
      <td> </td>
    </tr>
  </tbody>
</table>

<h3 id="34-interpretation">3.4 Interpretation</h3>

<ul>
  <li>
    <p><strong>Online Backup (HR = 0.460):</strong> 54% lower hazard of churning.</p>
  </li>
  <li>
    <p><strong>Tech Support (HR = 0.528):</strong> 47.2% lower hazard.</p>
  </li>
  <li>
    <p><strong>DSL Internet (HR = 0.805):</strong> 19.5% lower hazard compared to Fiber Optic.</p>
  </li>
</ul>

<h3 id="35-proportional-hazards-assumption-check">3.5 Proportional Hazards Assumption Check</h3>

<table>
  <thead>
    <tr>
      <th><strong>Variable</strong></th>
      <th><strong>p-value</strong></th>
      <th><strong>PH Assumption Violated?</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>internetService_DSL</strong></td>
      <td>&lt; 0.0001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>onlineBackup_Yes</strong></td>
      <td>&lt; 0.0001</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>techSupport_Yes</strong></td>
      <td>0.0002</td>
      <td>✅ Yes</td>
    </tr>
    <tr>
      <td><strong>dependents_Yes</strong></td>
      <td>&gt; 0.05</td>
      <td>❌ No</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="4-accelerated-failure-time-aft-model">4. Accelerated Failure Time (AFT) Model</h2>

<h3 id="41-what-is-aft">4.1 What is AFT?</h3>

<p>AFT models how covariates “accelerate” or “decelerate” the time to event:</p>

\[T = T_0 \times e^{\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p}\]

<ul>
  <li>
    <p><strong>exp(β) &gt; 1</strong> → time to churn is <strong>longer</strong> (protective)</p>
  </li>
  <li>
    <p><strong>exp(β) &lt; 1</strong> → time to churn is <strong>short</strong> (risk factor)</p>
  </li>
</ul>

<h3 id="42-feature-encoding">4.2 Feature Encoding</h3>

<p>9 covariates: partner, multipleLines, internetService_DSL, onlineSecurity, onlineBackup, deviceProtection, techSupport, paymentMethod_Bank, paymentMethod_Credit.</p>

<h3 id="43-model-results">4.3 Model Results</h3>

<p><strong>Median survival time: 135.51 months</strong></p>

<table>
  <thead>
    <tr>
      <th><strong>Metric</strong></th>
      <th><strong>Cox PH</strong></th>
      <th><strong>AFT (Log-Logistic)</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Concordance</td>
      <td>0.6409</td>
      <td><strong>0.7306</strong></td>
    </tr>
  </tbody>
</table>

<h3 id="44-aft-coefficients">4.4 AFT Coefficients</h3>

<table>
  <thead>
    <tr>
      <th><strong>Covariate</strong></th>
      <th><strong>Coef (β)</strong></th>
      <th><strong>exp(β)</strong></th>
      <th><strong>p-value</strong></th>
      <th><strong>Interpretation</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>onlineSecurity_Yes</td>
      <td>0.8616</td>
      <td>2.3669</td>
      <td>&lt; 0.001</td>
      <td>2.37× longer survival</td>
    </tr>
    <tr>
      <td>onlineBackup_Yes</td>
      <td>0.8128</td>
      <td>2.2542</td>
      <td>&lt; 0.001</td>
      <td>2.25× longer survival</td>
    </tr>
    <tr>
      <td>paymentMethod_Credit</td>
      <td>0.7990</td>
      <td>2.2234</td>
      <td>&lt; 0.001</td>
      <td>2.22× longer survival</td>
    </tr>
    <tr>
      <td>techSupport_Yes</td>
      <td>0.6893</td>
      <td>1.9923</td>
      <td>&lt; 0.001</td>
      <td>1.99× longer survival</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="5-customer-lifetime-value-clv">5. Customer Lifetime Value (CLV)</h2>

<h3 id="51-methodology">5.1 Methodology</h3>

\[\text{Expected Profit}_m = S(m) \times \text{Monthly Revenue}\]

\[\text{NPV}_m = \frac{\text{Expected Profit}_m}{(1 + \text{Monthly IRR})^m}\]

<h3 id="52-customer-profile">5.2 Customer Profile</h3>

<p>Has dependents, Fiber Optic, has online backup, has tech support.</p>

<h3 id="53-clv-table">5.3 CLV Table</h3>

<table>
  <thead>
    <tr>
      <th><strong>Month</strong></th>
      <th><strong>Survival Prob</strong></th>
      <th><strong>Expected Profit</strong></th>
      <th><strong>NPV</strong></th>
      <th><strong>Cumulative NPV</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1</td>
      <td>0.9830</td>
      <td>$29.49</td>
      <td>$29.24</td>
      <td>$29.24</td>
    </tr>
    <tr>
      <td>6</td>
      <td>0.9416</td>
      <td>$28.25</td>
      <td>$26.90</td>
      <td>$168.13</td>
    </tr>
    <tr>
      <td>12</td>
      <td>0.9073</td>
      <td>$27.22</td>
      <td>$24.64</td>
      <td><strong>$319.76</strong></td>
    </tr>
    <tr>
      <td>24</td>
      <td>0.8583</td>
      <td>$25.75</td>
      <td>$21.12</td>
      <td><strong>$591.61</strong></td>
    </tr>
    <tr>
      <td>36</td>
      <td>0.8118</td>
      <td>$24.35</td>
      <td>$18.06</td>
      <td><strong>$824.71</strong></td>
    </tr>
  </tbody>
</table>

<h3 id="54-profile-comparison-36-month-cumulative-npv">5.4 Profile Comparison (36-Month Cumulative NPV)</h3>

<table>
  <thead>
    <tr>
      <th><strong>Profile</strong></th>
      <th><strong>36-Month Cumulative NPV</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DSL + TechSupport</td>
      <td><strong>$1,004.91</strong></td>
    </tr>
    <tr>
      <td>Fiber + TechSupport</td>
      <td>$907.05</td>
    </tr>
    <tr>
      <td>Fiber, No TechSupport</td>
      <td><strong>$596.69</strong></td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="6-summary--key-takeaways">6. Summary &amp; Key Takeaways</h2>

<h3 id="61-method-comparison">6.1 Method Comparison</h3>

<table>
  <thead>
    <tr>
      <th><strong>Method</strong></th>
      <th><strong>Concordance</strong></th>
      <th><strong>Key Assumption Met?</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cox PH</td>
      <td>0.6409</td>
      <td>❌ 3/4 violated</td>
    </tr>
    <tr>
      <td>AFT</td>
      <td><strong>0.7306</strong></td>
      <td>Partially</td>
    </tr>
  </tbody>
</table>

<h3 id="62-most-important-findings">6.2 Most Important Findings</h3>

<ol>
  <li>
    <p><strong>Median churn time is 34 months</strong> for target segment.</p>
  </li>
  <li>
    <p><strong>Online Backup is the strongest protective factor</strong> (HR = 0.460).</p>
  </li>
  <li>
    <p><strong>The AFT model outperforms Cox PH</strong> (concordance 0.73 vs 0.64).</p>
  </li>
  <li>
    <p><strong>Gender, phoneService, and seniorCitizen do NOT significantly affect churn.</strong></p>
  </li>
</ol>

<h3 id="63-business-recommendations">6.3 Business Recommendations</h3>

<ul>
  <li>
    <p><strong>Prioritize Online Backup and Tech Support</strong> adoption.</p>
  </li>
  <li>
    <p><strong>Monitor Fiber Optic customers</strong> closely.</p>
  </li>
  <li>
    <p><strong>Use AFT model</strong> for better prediction accuracy.</p>
  </li>
</ul>

<hr />

<h2 id="appendix-corrections">Appendix: Corrections</h2>

<table>
  <thead>
    <tr>
      <th><strong>Issue</strong></th>
      <th><strong>Original</strong></th>
      <th><strong>Corrected</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>paymentMethod significance</td>
      <td>Not significant</td>
      <td><strong>Significant</strong> (p &lt; 0.000001)</td>
    </tr>
    <tr>
      <td>AFT interpretation</td>
      <td>exp(β)&gt;1 = faster churn</td>
      <td><strong>exp(β)&gt;1 = longer survival</strong></td>
    </tr>
    <tr>
      <td>CLV 12m NPV</td>
      <td>$292.68</td>
      <td><strong>$319.76</strong></td>
    </tr>
    <tr>
      <td>CLV 36m NPV</td>
      <td>$799.97</td>
      <td><strong>$824.71</strong></td>
    </tr>
  </tbody>
</table>]]></content><author><name>Ying LING</name><email>spiritsswin@gmail.com</email></author><category term="survival analysis" /><category term="data science" /><category term="customer churn" /><summary type="html"><![CDATA[A comprehensive survival analysis of telco customer churn using Kaplan-Meier estimator, Cox PH model, and AFT model.]]></summary></entry></feed>