{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"title: pandas文本数据转整数分类编码的最佳实践\n",
"date: 2018-09-18 20:17:55\n",
"tags: [pandas]\n",
"toc: true\n",
"xiongzhang: true\n",
"xiongzhang_images: [main.jpg]\n",
"\n",
"---\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 问题描述"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在许多实际的数据处理工作中,数据集通常包含分类变量。这些变量通常存储为表示各种特征的文本值。一些示例包括颜色(“红色”,“黄色”,“蓝色”),尺寸(“小”,“中”,“大”)或地理名称(州或国家)。无论使用何种值,挑战在于确定如何在分析中使用此数据。许多机器学习算法可以支持分类值而无需进一步操作,但还有许多算法不支持。因此,分析师面临的挑战是如何将这些文本属性转换为数值以便进一步处理。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"与数据科学世界的许多其他方面一样,关于如何解决这个问题没有单一的答案。每种方法都需要权衡,并对分析结果产生潜在影响。幸运的是,pandas和scikit-learn的python工具提供了几种方法,可用于将分类数据转换为合适的数值。本文将对一些常见的(以及一些更复杂的)方法进行汇总,希望它能帮助其他人将这些技术应用于他们的现实世界问题。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 数据集\n",
"\n",
"在本文中,我在UCI机器学习库中找到一个好的数据集。这个特定的汽车数据集包括分类值和连续值的组合,并且作为相对容易理解的有用示例。由于在决定如何编码各种分类值时,领域知识是一个重要方面 - 这个数据集是一个很好的个案研究。\n",
"\n",
"在我们开始编码各种值之前,我们需要载入数据并进行一些小的清理。幸运的是,pandas使这简单明了:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" symboling | \n",
" normalized_losses | \n",
" make | \n",
" fuel_type | \n",
" aspiration | \n",
" num_doors | \n",
" body_style | \n",
" drive_wheels | \n",
" engine_location | \n",
" wheel_base | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" NaN | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" convertible | \n",
" rwd | \n",
" front | \n",
" 88.6 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" NaN | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" convertible | \n",
" rwd | \n",
" front | \n",
" 88.6 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" NaN | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" hatchback | \n",
" rwd | \n",
" front | \n",
" 94.5 | \n",
"
\n",
" \n",
" 3 | \n",
" 2 | \n",
" 164.0 | \n",
" audi | \n",
" gas | \n",
" std | \n",
" four | \n",
" sedan | \n",
" fwd | \n",
" front | \n",
" 99.8 | \n",
"
\n",
" \n",
" 4 | \n",
" 2 | \n",
" 164.0 | \n",
" audi | \n",
" gas | \n",
" std | \n",
" four | \n",
" sedan | \n",
" 4wd | \n",
" front | \n",
" 99.4 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" symboling normalized_losses make fuel_type aspiration num_doors \\\n",
"0 3 NaN alfa-romero gas std two \n",
"1 3 NaN alfa-romero gas std two \n",
"2 1 NaN alfa-romero gas std two \n",
"3 2 164.0 audi gas std four \n",
"4 2 164.0 audi gas std four \n",
"\n",
" body_style drive_wheels engine_location wheel_base \n",
"0 convertible rwd front 88.6 \n",
"1 convertible rwd front 88.6 \n",
"2 hatchback rwd front 94.5 \n",
"3 sedan fwd front 99.8 \n",
"4 sedan 4wd front 99.4 "
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# 定义数据的列名称, 因为这个数据集没有包含列名称\n",
"headers = [\"symboling\", \"normalized_losses\", \"make\", \"fuel_type\", \"aspiration\",\n",
" \"num_doors\", \"body_style\", \"drive_wheels\", \"engine_location\",\n",
" \"wheel_base\", \"length\", \"width\", \"height\", \"curb_weight\",\n",
" \"engine_type\", \"num_cylinders\", \"engine_size\", \"fuel_system\",\n",
" \"bore\", \"stroke\", \"compression_ratio\", \"horsepower\", \"peak_rpm\",\n",
" \"city_mpg\", \"highway_mpg\", \"price\"]\n",
"\n",
"# 读取在线的数据集, 并将?转换为缺失NaN\n",
"df = pd.read_csv(\"http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data\",\n",
" header=None, names=headers, na_values=\"?\" )\n",
"df.head()[df.columns[:10]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"看一下所有列的数据类型:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"symboling int64\n",
"normalized_losses float64\n",
"make object\n",
"fuel_type object\n",
"aspiration object\n",
"num_doors object\n",
"body_style object\n",
"drive_wheels object\n",
"engine_location object\n",
"wheel_base float64\n",
"length float64\n",
"width float64\n",
"height float64\n",
"curb_weight int64\n",
"engine_type object\n",
"num_cylinders object\n",
"engine_size int64\n",
"fuel_system object\n",
"bore float64\n",
"stroke float64\n",
"compression_ratio float64\n",
"horsepower float64\n",
"peak_rpm float64\n",
"city_mpg int64\n",
"highway_mpg int64\n",
"price float64\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"因为我们只关心文本数据, 所以我们选出类型为\"object\"的列, 而pandas提供了`select_dtypes`方法可以快速达到目的:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" make | \n",
" fuel_type | \n",
" aspiration | \n",
" num_doors | \n",
" body_style | \n",
" drive_wheels | \n",
" engine_location | \n",
" engine_type | \n",
" num_cylinders | \n",
" fuel_system | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" convertible | \n",
" rwd | \n",
" front | \n",
" dohc | \n",
" four | \n",
" mpfi | \n",
"
\n",
" \n",
" 1 | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" convertible | \n",
" rwd | \n",
" front | \n",
" dohc | \n",
" four | \n",
" mpfi | \n",
"
\n",
" \n",
" 2 | \n",
" alfa-romero | \n",
" gas | \n",
" std | \n",
" two | \n",
" hatchback | \n",
" rwd | \n",
" front | \n",
" ohcv | \n",
" six | \n",
" mpfi | \n",
"
\n",
" \n",
" 3 | \n",
" audi | \n",
" gas | \n",
" std | \n",
" four | \n",
" sedan | \n",
" fwd | \n",
" front | \n",
" ohc | \n",
" four | \n",
" mpfi | \n",
"
\n",
" \n",
" 4 | \n",
" audi | \n",
" gas | \n",
" std | \n",
" four | \n",
" sedan | \n",
" 4wd | \n",
" front | \n",
" ohc | \n",
" five | \n",
" mpfi | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" make fuel_type aspiration num_doors body_style drive_wheels \\\n",
"0 alfa-romero gas std two convertible rwd \n",
"1 alfa-romero gas std two convertible rwd \n",
"2 alfa-romero gas std two hatchback rwd \n",
"3 audi gas std four sedan fwd \n",
"4 audi gas std four sedan 4wd \n",
"\n",
" engine_location engine_type num_cylinders fuel_system \n",
"0 front dohc four mpfi \n",
"1 front dohc four mpfi \n",
"2 front ohcv six mpfi \n",
"3 front ohc four mpfi \n",
"4 front ohc five mpfi "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.select_dtypes('object').copy()\n",
"df2.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"因为数据集种包括缺失数据, 这会增加后续处理的难度, 我们为了简单起见, 将缺失值删除即可:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"df2.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 方案Ⅰ:替换字符串"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"最简单的方式就是, 查找列中所有的字符串, 然后给不同的字符串一个编号, 然后用编号替换字符串:\n",
"\n",
"- 使用`vlaue_counts`获取所有的字符串:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sedan 94\n",
"hatchback 70\n",
"wagon 25\n",
"hardtop 8\n",
"convertible 6\n",
"Name: body_style, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"col = 'body_style'\n",
"\n",
"strs = df2[col].value_counts()\n",
"strs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 将所有字符串映射为数字:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'sedan': 0, 'hatchback': 1, 'wagon': 2, 'hardtop': 3, 'convertible': 4}"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"value_map = dict((v, i) for i,v in enumerate(strs.index))\n",
"value_map"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 使用`replace`方法替换字符串"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 4\n",
"1 4\n",
"2 1\n",
"3 0\n",
"4 0\n",
"Name: body_style, dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2.replace({col:value_map})[col].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"你会看到, 不仅仅字符串被替换, 而且series的数据类型变成了int64"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 方案Ⅱ:标签编码\n",
"\n",
"编码分类值的另一种方法是使用称为标签编码的技术。标签编码只是将列中的每个值转换为数字。例如,body_style列包含5个不同的值。我们可以选择像这样编码:\n",
"\n",
"- convertible -> 0\n",
"- hardtop -> 1\n",
"- hatchback -> 2\n",
"- sedan -> 3\n",
"- wagon -> 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 首先你可以将列的数据格式转换为`category`"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 convertible\n",
"1 convertible\n",
"2 hatchback\n",
"3 sedan\n",
"4 sedan\n",
"Name: body_style, dtype: category\n",
"Categories (5, object): [convertible, hardtop, hatchback, sedan, wagon]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bs = df2['body_style'].astype('category')\n",
"bs.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 然后你只需要使用标签的编码作为真正的数据就可以了:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 0\n",
"2 2\n",
"3 3\n",
"4 3\n",
"dtype: int8"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bs.cat.codes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 方案三: 转换成哑变量, 或者叫one-hot编码\n",
"\n",
"标签编码的优点是它很简单,但它的缺点是数值可能被算法“误解”。例如,0的值显然小于4的值,但这是否真的与现实生活中的数据集相对应?在我们的计算中,旅行车的重量是否比敞篷车重4倍?在这个例子中,我不这么认为。所以我们需要将数据转换为哑变量(onehot), 在pandas中, 这个转变只需要一行代码:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" drive_wheels_4wd | \n",
" drive_wheels_fwd | \n",
" drive_wheels_rwd | \n",
" body_style_convertible | \n",
" body_style_hardtop | \n",
" body_style_hatchback | \n",
" body_style_sedan | \n",
" body_style_wagon | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" drive_wheels_4wd drive_wheels_fwd drive_wheels_rwd \\\n",
"0 0 0 1 \n",
"1 0 0 1 \n",
"2 0 0 1 \n",
"3 0 1 0 \n",
"4 1 0 0 \n",
"\n",
" body_style_convertible body_style_hardtop body_style_hatchback \\\n",
"0 1 0 0 \n",
"1 1 0 0 \n",
"2 0 0 1 \n",
"3 0 0 0 \n",
"4 0 0 0 \n",
"\n",
" body_style_sedan body_style_wagon \n",
"0 0 0 \n",
"1 0 0 \n",
"2 0 0 \n",
"3 1 0 \n",
"4 1 0 "
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(df[['drive_wheels', 'body_style']]).head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 方案四: 自定义二分类\n",
"\n",
"根据数据集,您可以使用标签编码和one-hot来创建满足进一步分析需求的二分类列\n",
"\n",
"在此特定数据集中,有一个名为engine_type的列包含几个不同的值:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ohc 146\n",
"ohcf 15\n",
"ohcv 13\n",
"dohc 12\n",
"l 12\n",
"rotor 4\n",
"dohcv 1\n",
"Name: engine_type, dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2['engine_type'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"为了便于讨论,我们可能关心的是发动机是否是顶置凸轮(OHC)。换句话说,OHC的各种版本对于该分析都是相同的。如果是这种情况,那么我们可以使用str accessor创建一个新列,指示汽车是否有OHC引擎。"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 187\n",
"0 16\n",
"Name: engine_type, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2[\"engine_type\"].str.contains(\"ohc\").map(int).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Scikit-Learn\n",
"\n",
"除了pandas方法,scikit-learn还提供类似的功能。就个人而言,我发现使用pandas有点简单,但我认为重要的是要知道如何在scikit-learn中执行这些过程。\n",
"\n",
"例如,如果我们想对汽车的品牌进行标签编码,我们需要实例化LabelEncoder对象并fit_transform数据:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" convertible | \n",
" hardtop | \n",
" hatchback | \n",
" sedan | \n",
" wagon | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" convertible hardtop hatchback sedan wagon\n",
"0 1 0 0 0 0\n",
"1 1 0 0 0 0\n",
"2 0 0 1 0 0\n",
"3 0 0 0 1 0\n",
"4 0 0 0 1 0"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelBinarizer\n",
"\n",
"lb_style = LabelBinarizer()\n",
"lb_results = lb_style.fit_transform(df2[\"body_style\"])\n",
"pd.DataFrame(lb_results, columns=lb_style.classes_).head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}