Komparasi Random Forest di Python vs R (Kaggle Experiment)

Dataset

Preparation

Biar adil, gw bikin 2 versi script:

  1. Versi awal → Python pakai TfidfVectorizer, R pakai tm::DocumentTermMatrix (hasilnya beda karena treatment fitur gak sama persis).
  2. Versi update → Python (TfidfVectorizer(max_features=5000, stop_words="english")) dan R create_vocabulary + prune_vocabulary(vocab_term_max=5000) → jadi preparation identik, sama-sama 5000 fitur TF-IDF.

Modeling

Metrics dicatat

Hasil

Test Language Model Accuracy F1 ROC AUC Train Time RAM Usage
1 Python RF (sklearn) 0.997 0.997 0.997 52.44s 0.06GB
1 R RF (ranger) 0.998 0.999 0.999 97.23s 0.05GB
2 Python RF (sklearn) tbu
2 R RF (ranger) tbu

Kesimpulan