{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "title: pandas文本数据转整数分类编码的最佳实践\n",
    "date: 2018-09-18 20:17:55\n",
    "tags: [pandas]\n",
    "toc: true\n",
    "xiongzhang: true\n",
    "xiongzhang_images: [main.jpg]\n",
    "\n",
    "---\n",
    "<span></span>\n",
    "<!-- more -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 问题描述"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在许多实际的数据处理工作中，数据集通常包含分类变量。这些变量通常存储为表示各种特征的文本值。一些示例包括颜色（“红色”，“黄色”，“蓝色”），尺寸（“小”，“中”，“大”）或地理名称（州或国家）。无论使用何种值，挑战在于确定如何在分析中使用此数据。许多机器学习算法可以支持分类值而无需进一步操作，但还有许多算法不支持。因此，分析师面临的挑战是如何将这些文本属性转换为数值以便进一步处理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与数据科学世界的许多其他方面一样，关于如何解决这个问题没有单一的答案。每种方法都需要权衡，并对分析结果产生潜在影响。幸运的是，pandas和scikit-learn的python工具提供了几种方法，可用于将分类数据转换为合适的数值。本文将对一些常见的（以及一些更复杂的）方法进行汇总，希望它能帮助其他人将这些技术应用于他们的现实世界问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据集\n",
    "\n",
    "在本文中，我在UCI机器学习库中找到一个好的数据集。这个特定的汽车数据集包括分类值和连续值的组合，并且作为相对容易理解的有用示例。由于在决定如何编码各种分类值时，领域知识是一个重要方面 - 这个数据集是一个很好的个案研究。\n",
    "\n",
    "在我们开始编码各种值之前，我们需要载入数据并进行一些小的清理。幸运的是，pandas使这简单明了："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>symboling</th>\n",
       "      <th>normalized_losses</th>\n",
       "      <th>make</th>\n",
       "      <th>fuel_type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num_doors</th>\n",
       "      <th>body_style</th>\n",
       "      <th>drive_wheels</th>\n",
       "      <th>engine_location</th>\n",
       "      <th>wheel_base</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>88.6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>94.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>164.0</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>164.0</td>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>99.4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   symboling  normalized_losses         make fuel_type aspiration num_doors  \\\n",
       "0          3                NaN  alfa-romero       gas        std       two   \n",
       "1          3                NaN  alfa-romero       gas        std       two   \n",
       "2          1                NaN  alfa-romero       gas        std       two   \n",
       "3          2              164.0         audi       gas        std      four   \n",
       "4          2              164.0         audi       gas        std      four   \n",
       "\n",
       "    body_style drive_wheels engine_location  wheel_base  \n",
       "0  convertible          rwd           front        88.6  \n",
       "1  convertible          rwd           front        88.6  \n",
       "2    hatchback          rwd           front        94.5  \n",
       "3        sedan          fwd           front        99.8  \n",
       "4        sedan          4wd           front        99.4  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# 定义数据的列名称, 因为这个数据集没有包含列名称\n",
    "headers = [\"symboling\", \"normalized_losses\", \"make\", \"fuel_type\", \"aspiration\",\n",
    "           \"num_doors\", \"body_style\", \"drive_wheels\", \"engine_location\",\n",
    "           \"wheel_base\", \"length\", \"width\", \"height\", \"curb_weight\",\n",
    "           \"engine_type\", \"num_cylinders\", \"engine_size\", \"fuel_system\",\n",
    "           \"bore\", \"stroke\", \"compression_ratio\", \"horsepower\", \"peak_rpm\",\n",
    "           \"city_mpg\", \"highway_mpg\", \"price\"]\n",
    "\n",
    "# 读取在线的数据集, 并将?转换为缺失NaN\n",
    "df = pd.read_csv(\"http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data\",\n",
    "                  header=None, names=headers, na_values=\"?\" )\n",
    "df.head()[df.columns[:10]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "看一下所有列的数据类型:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "symboling              int64\n",
       "normalized_losses    float64\n",
       "make                  object\n",
       "fuel_type             object\n",
       "aspiration            object\n",
       "num_doors             object\n",
       "body_style            object\n",
       "drive_wheels          object\n",
       "engine_location       object\n",
       "wheel_base           float64\n",
       "length               float64\n",
       "width                float64\n",
       "height               float64\n",
       "curb_weight            int64\n",
       "engine_type           object\n",
       "num_cylinders         object\n",
       "engine_size            int64\n",
       "fuel_system           object\n",
       "bore                 float64\n",
       "stroke               float64\n",
       "compression_ratio    float64\n",
       "horsepower           float64\n",
       "peak_rpm             float64\n",
       "city_mpg               int64\n",
       "highway_mpg            int64\n",
       "price                float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为我们只关心文本数据, 所以我们选出类型为\"object\"的列, 而pandas提供了`select_dtypes`方法可以快速达到目的:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>make</th>\n",
       "      <th>fuel_type</th>\n",
       "      <th>aspiration</th>\n",
       "      <th>num_doors</th>\n",
       "      <th>body_style</th>\n",
       "      <th>drive_wheels</th>\n",
       "      <th>engine_location</th>\n",
       "      <th>engine_type</th>\n",
       "      <th>num_cylinders</th>\n",
       "      <th>fuel_system</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>dohc</td>\n",
       "      <td>four</td>\n",
       "      <td>mpfi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>convertible</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>dohc</td>\n",
       "      <td>four</td>\n",
       "      <td>mpfi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>alfa-romero</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>two</td>\n",
       "      <td>hatchback</td>\n",
       "      <td>rwd</td>\n",
       "      <td>front</td>\n",
       "      <td>ohcv</td>\n",
       "      <td>six</td>\n",
       "      <td>mpfi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>fwd</td>\n",
       "      <td>front</td>\n",
       "      <td>ohc</td>\n",
       "      <td>four</td>\n",
       "      <td>mpfi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>audi</td>\n",
       "      <td>gas</td>\n",
       "      <td>std</td>\n",
       "      <td>four</td>\n",
       "      <td>sedan</td>\n",
       "      <td>4wd</td>\n",
       "      <td>front</td>\n",
       "      <td>ohc</td>\n",
       "      <td>five</td>\n",
       "      <td>mpfi</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          make fuel_type aspiration num_doors   body_style drive_wheels  \\\n",
       "0  alfa-romero       gas        std       two  convertible          rwd   \n",
       "1  alfa-romero       gas        std       two  convertible          rwd   \n",
       "2  alfa-romero       gas        std       two    hatchback          rwd   \n",
       "3         audi       gas        std      four        sedan          fwd   \n",
       "4         audi       gas        std      four        sedan          4wd   \n",
       "\n",
       "  engine_location engine_type num_cylinders fuel_system  \n",
       "0           front        dohc          four        mpfi  \n",
       "1           front        dohc          four        mpfi  \n",
       "2           front        ohcv           six        mpfi  \n",
       "3           front         ohc          four        mpfi  \n",
       "4           front         ohc          five        mpfi  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = df.select_dtypes('object').copy()\n",
    "df2.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "因为数据集种包括缺失数据, 这会增加后续处理的难度, 我们为了简单起见, 将缺失值删除即可:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 方案Ⅰ:替换字符串"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "最简单的方式就是, 查找列中所有的字符串, 然后给不同的字符串一个编号, 然后用编号替换字符串:\n",
    "\n",
    "- 使用`vlaue_counts`获取所有的字符串:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sedan          94\n",
       "hatchback      70\n",
       "wagon          25\n",
       "hardtop         8\n",
       "convertible     6\n",
       "Name: body_style, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "col = 'body_style'\n",
    "\n",
    "strs = df2[col].value_counts()\n",
    "strs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 将所有字符串映射为数字:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sedan': 0, 'hatchback': 1, 'wagon': 2, 'hardtop': 3, 'convertible': 4}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "value_map = dict((v, i) for i,v in enumerate(strs.index))\n",
    "value_map"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 使用`replace`方法替换字符串"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    4\n",
       "1    4\n",
       "2    1\n",
       "3    0\n",
       "4    0\n",
       "Name: body_style, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2.replace({col:value_map})[col].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你会看到, 不仅仅字符串被替换, 而且series的数据类型变成了int64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 方案Ⅱ:标签编码\n",
    "\n",
    "编码分类值的另一种方法是使用称为标签编码的技术。标签编码只是将列中的每个值转换为数字。例如，body_style列包含5个不同的值。我们可以选择像这样编码：\n",
    "\n",
    "- convertible -> 0\n",
    "- hardtop -> 1\n",
    "- hatchback -> 2\n",
    "- sedan -> 3\n",
    "- wagon -> 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 首先你可以将列的数据格式转换为`category`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    convertible\n",
       "1    convertible\n",
       "2      hatchback\n",
       "3          sedan\n",
       "4          sedan\n",
       "Name: body_style, dtype: category\n",
       "Categories (5, object): [convertible, hardtop, hatchback, sedan, wagon]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bs = df2['body_style'].astype('category')\n",
    "bs.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 然后你只需要使用标签的编码作为真正的数据就可以了:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0\n",
       "1    0\n",
       "2    2\n",
       "3    3\n",
       "4    3\n",
       "dtype: int8"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bs.cat.codes.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 方案三: 转换成哑变量, 或者叫one-hot编码\n",
    "\n",
    "标签编码的优点是它很简单，但它的缺点是数值可能被算法“误解”。例如，0的值显然小于4的值，但这是否真的与现实生活中的数据集相对应？在我们的计算中，旅行车的重量是否比敞篷车重4倍？在这个例子中，我不这么认为。所以我们需要将数据转换为哑变量(onehot), 在pandas中, 这个转变只需要一行代码:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>drive_wheels_4wd</th>\n",
       "      <th>drive_wheels_fwd</th>\n",
       "      <th>drive_wheels_rwd</th>\n",
       "      <th>body_style_convertible</th>\n",
       "      <th>body_style_hardtop</th>\n",
       "      <th>body_style_hatchback</th>\n",
       "      <th>body_style_sedan</th>\n",
       "      <th>body_style_wagon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   drive_wheels_4wd  drive_wheels_fwd  drive_wheels_rwd  \\\n",
       "0                 0                 0                 1   \n",
       "1                 0                 0                 1   \n",
       "2                 0                 0                 1   \n",
       "3                 0                 1                 0   \n",
       "4                 1                 0                 0   \n",
       "\n",
       "   body_style_convertible  body_style_hardtop  body_style_hatchback  \\\n",
       "0                       1                   0                     0   \n",
       "1                       1                   0                     0   \n",
       "2                       0                   0                     1   \n",
       "3                       0                   0                     0   \n",
       "4                       0                   0                     0   \n",
       "\n",
       "   body_style_sedan  body_style_wagon  \n",
       "0                 0                 0  \n",
       "1                 0                 0  \n",
       "2                 0                 0  \n",
       "3                 1                 0  \n",
       "4                 1                 0  "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(df[['drive_wheels', 'body_style']]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 方案四: 自定义二分类\n",
    "\n",
    "根据数据集，您可以使用标签编码和one-hot来创建满足进一步分析需求的二分类列\n",
    "\n",
    "在此特定数据集中，有一个名为engine_type的列包含几个不同的值："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ohc      146\n",
       "ohcf      15\n",
       "ohcv      13\n",
       "dohc      12\n",
       "l         12\n",
       "rotor      4\n",
       "dohcv      1\n",
       "Name: engine_type, dtype: int64"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2['engine_type'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了便于讨论，我们可能关心的是发动机是否是顶置凸轮（OHC）。换句话说，OHC的各种版本对于该分析都是相同的。如果是这种情况，那么我们可以使用str accessor创建一个新列，指示汽车是否有OHC引擎。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    187\n",
       "0     16\n",
       "Name: engine_type, dtype: int64"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2[\"engine_type\"].str.contains(\"ohc\").map(int).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scikit-Learn\n",
    "\n",
    "除了pandas方法，scikit-learn还提供类似的功能。就个人而言，我发现使用pandas有点简单，但我认为重要的是要知道如何在scikit-learn中执行这些过程。\n",
    "\n",
    "例如，如果我们想对汽车的品牌进行标签编码，我们需要实例化LabelEncoder对象并fit_transform数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>convertible</th>\n",
       "      <th>hardtop</th>\n",
       "      <th>hatchback</th>\n",
       "      <th>sedan</th>\n",
       "      <th>wagon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   convertible  hardtop  hatchback  sedan  wagon\n",
       "0            1        0          0      0      0\n",
       "1            1        0          0      0      0\n",
       "2            0        0          1      0      0\n",
       "3            0        0          0      1      0\n",
       "4            0        0          0      1      0"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelBinarizer\n",
    "\n",
    "lb_style = LabelBinarizer()\n",
    "lb_results = lb_style.fit_transform(df2[\"body_style\"])\n",
    "pd.DataFrame(lb_results, columns=lb_style.classes_).head()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}