{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: 信息熵信息增益率基尼系数原理和pandas实战\n", "date: 2018-09-18 19:17:55\n", "tags: [pandas]\n", "toc: true\n", "xiongzhang: true\n", "xiongzhang_images: [main.jpg]\n", "\n", "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "信息增益(Info Gain), 信息增益率(Info Gain Ratio)和 基尼指数(Gini Index)都是在做决策树时试用的分支选择策略, 下面我们试用numpy来实现这几个指标, 以便加深对这些指标的理解。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面我们来引入用到的库:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "# 进行数据整理\n", "import pandas as pd\n", "# 数据\n", "import io\n", "# 验证结果\n", "from scipy import stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 案例和数据集\n", "\n", "学习决策树常用的一个数据集就是\"机器学习(周志华)》 西瓜数据集3.0\", 下面我把数据粘贴出来, 你们就不必下载了, 直接硬编码到python种使用:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
色泽根蒂敲声纹理脐部触感密度含糖率好瓜
编号
1青绿蜷缩浊响清晰凹陷硬滑0.6970.460
2乌黑蜷缩沉闷清晰凹陷硬滑0.7740.376
3乌黑蜷缩浊响清晰凹陷硬滑0.6340.264
4青绿蜷缩沉闷清晰凹陷硬滑0.6080.318
5浅白蜷缩浊响清晰凹陷硬滑0.5560.215
6青绿稍蜷浊响清晰稍凹软粘0.4030.237
7乌黑稍蜷浊响稍糊稍凹软粘0.4810.149
8乌黑稍蜷浊响清晰稍凹硬滑0.4370.211
9乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091
10青绿硬挺清脆清晰平坦软粘0.2430.267
11浅白硬挺清脆模糊平坦硬滑0.2450.057
12浅白蜷缩浊响模糊平坦软粘0.3430.099
13青绿稍蜷浊响稍糊凹陷硬滑0.6390.161
14浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198
15乌黑稍蜷浊响清晰稍凹软粘0.3600.370
16浅白蜷缩浊响模糊平坦硬滑0.5930.042
17青绿蜷缩沉闷稍糊稍凹硬滑0.7190.103
\n", "
" ], "text/plain": [ " 色泽 根蒂 敲声 纹理 脐部 触感 密度 含糖率 好瓜\n", "编号 \n", "1 青绿 蜷缩 浊响 清晰 凹陷 硬滑 0.697 0.460 是 \n", "2 乌黑 蜷缩 沉闷 清晰 凹陷 硬滑 0.774 0.376 是 \n", "3 乌黑 蜷缩 浊响 清晰 凹陷 硬滑 0.634 0.264 是 \n", "4 青绿 蜷缩 沉闷 清晰 凹陷 硬滑 0.608 0.318 是 \n", "5 浅白 蜷缩 浊响 清晰 凹陷 硬滑 0.556 0.215 是 \n", "6 青绿 稍蜷 浊响 清晰 稍凹 软粘 0.403 0.237 是 \n", "7 乌黑 稍蜷 浊响 稍糊 稍凹 软粘 0.481 0.149 是 \n", "8 乌黑 稍蜷 浊响 清晰 稍凹 硬滑 0.437 0.211 是 \n", "9 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑 0.666 0.091 否 \n", "10 青绿 硬挺 清脆 清晰 平坦 软粘 0.243 0.267 否 \n", "11 浅白 硬挺 清脆 模糊 平坦 硬滑 0.245 0.057 否 \n", "12 浅白 蜷缩 浊响 模糊 平坦 软粘 0.343 0.099 否 \n", "13 青绿 稍蜷 浊响 稍糊 凹陷 硬滑 0.639 0.161 否 \n", "14 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑 0.657 0.198 否 \n", "15 乌黑 稍蜷 浊响 清晰 稍凹 软粘 0.360 0.370 否 \n", "16 浅白 蜷缩 浊响 模糊 平坦 硬滑 0.593 0.042 否 \n", "17 青绿 蜷缩 沉闷 稍糊 稍凹 硬滑 0.719 0.103 否" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_str = output = io.StringIO('''编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜\n", "1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是 \n", "2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是 \n", "3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是 \n", "4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是 \n", "5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是 \n", "6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是 \n", "7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是 \n", "8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是 \n", "9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否 \n", "10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否 \n", "11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否 \n", "12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否 \n", "13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否 \n", "14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否 \n", "15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否 \n", "16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否 \n", "17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否''')\n", "\n", "data = pd.read_csv(data_str)\n", "data.set_index('编号', inplace=True)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "该数据集通常是使用前几个西瓜属性预测最后一个指标\"好瓜\"." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 信息熵怎么算\n", "\n", "要计算信息增益的话, 首先要知道怎么算信息熵, 它的公式是:\n", "\n", "$$\n", "H(X) = \\sum_{i \\in X} p_i log p_i\n", "$$\n", "\n", "假如我们要计算的是西瓜的属性\"脐部\"的信息熵, 解读公式的话, 就是:\n", "\n", "- n=脐部的种类数(3)\n", "- i=每一类脐部\n", "- $p_i$某种脐部在数据集种的概率\n", "\n", "代码实现:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "好瓜的信息熵:\n" ] }, { "data": { "text/plain": [ "0.8760918930634294" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def entropy(data, att_name):\n", " '''\n", " data: 数据集\n", " att_name: 属性名\n", " '''\n", " levels = data[att_name].unique()\n", " # 信息熵\n", " ent = 0\n", " for lv in levels:\n", " pi = sum(data[att_name]==lv) / data.shape[0]\n", " ent += pi*np.log(pi)\n", " return -ent\n", "print('好瓜的信息熵:')\n", "entropy(data, '好瓜')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用scipy内置的`stats.entropy`验证我们的函数:" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "def entropy_scipy(data, att_name):\n", " n = data.shape[0]\n", " values = data[att_name].value_counts()\n", " return stats.entropy(values/n)\n", "\n", "assert entropy(data, '好瓜')==entropy_scipy(data, '好瓜')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 条件信息熵怎么算\n", "\n", "已知条件Y, 求X的概率就是P(X|Y), 知道的信息越多,随机事件的不确定性就越小。\n", "书中定义:在Y条件X的条件熵:(二元模型)\n", "\n", "引自维基百科:\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "因为我们已经有了交叉熵的公式, 所以我们其实可以直接算出$H(Y|X=x)$, 所以上面的公式我们其实只用第一行就够了:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "触感条件下, 好瓜的信息熵:\n" ] }, { "data": { "text/plain": [ "0.8462465738213797" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def conditional_entropy(data, xname, yname):\n", " xs = data[xname].unique()\n", " ys = data[yname].unique()\n", " p_x = data[xname].value_counts() / data.shape[0]\n", " ce = 0\n", " for x in xs:\n", " ce += p_x[x]*entropy(data[data[xname]==x], yname)\n", " return ce\n", "\n", "print('触感条件下, 好瓜的信息熵:')\n", "conditional_entropy(data, '触感', '好瓜')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 信息增益\n", "\n", "理论上我们可以证明$H(Y)>=H(Y|X)$, 而X的信息增益就是他俩之间的差值:\n", "\n", "$$\n", "Gain(Y, X) = H(Y) - H(Y|X)\n", "$$\n", "\n", "它的意义就是比较了X信息的引入导致的信息混乱程度的降低。\n", "\n", "用代码实现的话就是:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "触感的引入导致的信息增益:\n" ] }, { "data": { "text/plain": [ "0.029845319242049695" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def gain(data, xname, yname):\n", " en = entropy(data, yname)\n", " ce = conditional_entropy(data, xname, yname)\n", " return en - ce\n", "\n", "print('触感的引入导致的信息增益:')\n", "gain(data, '触感', '好瓜')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 属性的信息增益与属性的类别数的关系\n", "\n", "\n", "属性的类别越多, 根据该属性就可以把数据切分的更细, 这样往往导致信息的混乱程度降低, 所以类别多的属性的信息增益较大, 我们可以用代码实验一下:\n" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.00011925580490779186\n", "0.0018253756129741339\n", "0.605797499372304\n" ] } ], "source": [ "data['testCol'] = 0\n", "data.iloc[10:, -1] = 1\n", "print(gain(data, 'testCol', '触感'))\n", "data.iloc[14:, -1] = 2\n", "print(gain(data, 'testCol', '触感'))\n", "print(gain(data, '密度', '触感'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 信息增益率\n", "\n", "信息增益率的提出是为了规避信息增益种属性类别数目造成的影响, 它的计算公式是:\n", "\n", "$$\n", "GainRatio(Y, X) = {Gain(Y, X) \\over SplitInfo(X)}\n", "$$\n", "\n", "其中:\n", "\n", "$$\n", "SplitInfo(X)= \\sum_{i \\in X} p_i log p_i = H(X)\n", "$$" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "信息增益率:\n" ] }, { "data": { "text/plain": [ "0.04926616447405919" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def gain_ratio(data, xname, yname):\n", " g = gain(data, xname, yname)\n", " si = entropy(data, xname)\n", " return g / si\n", "print('信息增益率:')\n", "gain_ratio(data, '触感', '好瓜')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }