causal_statのノート

R, Tex, データサイエンスに関するノート

CART 1

Data Science

CART Classification and Regression Tree

Binary tree model

Classifier

Learning set

${\mathcal L}=\{(x_i, j_i): i=1,2,\ldots,n\}$

${\mathcal X}$ the set of measurements

${\mathcal C}$ the set of classes

d: ${\mathcal X}\mapsto {\mathcal C}$ classifier (clasification rule)

Criterion for which variable to use for splits

Gini index

$\sum_{i\neq j} p(i|t)p(j|t)$

をimpurity measure に使う。

CARTの手順をおおまかに次の2ステップ

tree を育てる
育てたtreeをover fittingあるいは under fitting を避けるためにpruning を行う。

CARTは $y$ が連続変数の場合はregressuin treeといい、

Splitのルールは

$Q=\frac{1}{n}\sum_{i=1}^n(y_i-\bar{y})^2$ に基づく。

$y$ がカテゴリー変数のときにはClassication tree と呼ばれ

$Q=\sum_{i\neq j} \hat{p}_i \hat{p}_j$ に基づく(Gini係数という)。

ここで $\hat{p}_k$ はノード $A$ におけるクラス $k$ のobservations の割合。

$\hat{p}_k =\frac{1}{n_A}\sum_{y_i\in R_A} I(y_i=k)$

一つのノードについて $x_j$ とthreshold $s$ によって2つのchild node を

作成し、左のchild node の基準関数 $Q_L$ 右のchilde nodeの基準関数を $Q_R$

とし、

$Q_s=n_LQ_L+n_R Q_R$ が最小になる $j, s$ を求めそれが、split

になる。つまり

$\min_{j,s}(n_LQ_L+n_R Q_R)$ を求めてsplit を決める。

Rreferences

Classification and regression trees, Breiman, Freedman, Olshen, Stone

http://www.math.chalmers.se/Stat/Grundutb/GU/MSG500/A17/CARTlecture.pdf

次の解説論文の中でCARTについても説明しているが非常にわかりやすい。Random forestに結び付けて説明しているのでなお良い。

Random Forests, Adele Cutler, D. Richard Cutler and John R. Stevens