Skip to content

audrey9212/Telecom-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telecom-Prediction

Predicting telecom customer churn on the 51k-row Cell2Cell dataset using a classic logistic baseline and tuned neural networks (dense and wide-&-deep) with focal loss and decision-threshold optimization. Notebook-first workflow; all artifacts saved under data/.

專案重點:資料處理流程透明、可重現的模型訓練與最佳化、並附上輸出結果。若要看更完整的 ML 作品集,可直接跳到「More ML projects」段落。

Repository layout

  • 00-eda.ipynb — quick EDA of raw data.
  • 01-preprocess-dataset.ipynb — feature engineering, encoding, scaling, splits; saves processed data & pipelines.
  • 02-baseline-model.ipynb — class-balanced logistic regression baseline.
  • 03-hyperparameter-tuning.ipynb — KerasTuner Bayesian search for dense + wide&deep architectures (with focal loss).
  • 04-train-deep-learning-models.ipynb — retrain best models to best epochs.
  • 05-analysis.ipynb — test-set evaluation, threshold optimization, plots & metrics export.
  • utils.py — helpers for loading/saving artifacts.
  • data/ — raw dataset (cell2celltrain.csv), processed splits, pipelines, tuned hyperparameters, trained models (*.keras), and plots/CSVs with metrics.
  • Telecom_Prediction_Report1.pdf, Telecom_Prediction_Report2.pdf — slide-style reports of the findings.

Data

  • Source: Cell2Cell churn dataset (data/cell2celltrain.csv, 51,047 rows, 58 columns) — public version as seen on Kaggle/IBM; problem type is binary classification on Churn (Yes/No → 1/0).
  • Feature engineering: derives InactiveSubs = UniqueSubs - ActiveSubs, HandsetDiff = Handsets - HandsetModels, cleans negative numeric values to NaN, parses handset price to numeric + HandsetPrice_Unknown flag, removes constants/highly redundant columns (CustomerID, CallForwardingCalls, HandsetModels, etc.). Data dictionary and column notes are documented in the reports/notebooks (per ACTL5111/3143 spec).
  • Encoding: ordinal for CreditRating/IncomeGroup, one-hot for nominal features, target-mean encoding for high-cardinality ServiceArea, median imputation + scaling for numerics.
  • Splits: stratified into train/val/test (60/20/20) saved as X_*_base.csv (for linear models) and X_*_deep.csv (for NNs) with labels y_*.csv.

Data dictionary

  • Draft table lives in data_dictionary_draft.md (column, description, datatype; target marked).
  • Reports still contain column notes; this README links to the table so markers/interviewers can find it quickly.

Modeling workflow

  1. 00-eda.ipynb: sanity checks on class balance and feature distributions.
  2. 01-preprocess-dataset.ipynb: build and persist preprocessing pipelines (baseline.pkl, deep.pkl) and processed splits.
  3. 02-baseline-model.ipynb: logistic regression with class_weight='balanced'; saves logistic_baseline_model.pkl.
  4. 03-hyperparameter-tuning.ipynb: Bayesian search (KerasTuner) over depth/width/regularization/optimizer; uses custom FocalLoss(alpha=0.75) to handle imbalance. Best hyperparameters saved to data/best_hyperparameters_dense.json and data/best_hyperparameters_widedeep.json.
  5. 04-train-deep-learning-models.ipynb: retrains dense NN and wide-&-deep NN to best epochs (data/best_epoch_dense.json ~39, best_epoch_widedeep.json ~42); exports models as final_model_dense.keras, final_model_widedeep.keras, etc., plus training curves.
  6. 05-analysis.ipynb: compares default 0.5 threshold vs. optimized thresholds (maximize F1), exports metrics/plots (data/final_model_metrics*.csv, roc_curve_comparison.png, cm_*.png, threshold_optimization_comparison.png).

Quickstart

Prereqs: Python 3.10+, pip, virtualenv recommended.

python3 -m venv .venv && source .venv/bin/activate
pip install -U pandas numpy scikit-learn seaborn matplotlib joblib tensorflow keras-tuner nbformat

Run notebooks in order (00 → 05) to fully reproduce preprocessing, tuning, training, and evaluation (mirrors the ACTL5111/3143 “Run All” checklist). If you only need inference with the best models:

  1. Activate the env above. 2) Load deep.pkl to preprocess new data; 3) Load final_model_widedeep.keras or final_model_dense.keras and apply the tuned decision threshold from the Results section.

Note: If sharing code without the dataset, place cell2celltrain.csv in data/ before running notebooks. Reports should include the required “Generative AI usage” appendix per the course spec.

Results (test set)

  • Default threshold (0.5): logistic baseline F1 0.44 / ROC-AUC 0.61; dense NN F1 0.14 / ROC-AUC 0.64; wide&deep F1 0.15 / ROC-AUC 0.64 (data/final_model_metrics.csv).
  • Optimized thresholds (maximize F1):
    • Baseline @ 0.43 → F1 0.44, Recall 0.54, ROC-AUC 0.61.
    • Dense NN @ 0.20 → F1 0.48, Recall 0.84, ROC-AUC 0.64.
    • Wide&Deep @ 0.19 → F1 0.48, Recall 0.85, ROC-AUC 0.64 (data/final_model_metrics_with_opt_threshold.csv).
  • Visuals: ROC/PR comparisons (roc_curve_comparison.png, pr_curve_comparison.png), confusion matrices (cm_*.png, optimized versions), training curves (training_curve_dense*.png, training_curve_widedeep*.png), threshold effect chart (threshold_optimization_comparison.png).

More ML projects

If you’re reviewing my broader ML work with fuller READMEs and production-minded pipelines, please check my GitHub profile: https://github.com/audrey9212 (see pinned repositories for the most relevant case studies).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors