{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: pandas数据分析100道练习题-第三部分\n", "date: 2018-08-16 18:17:55\n", "tags: [python, pandas]\n", "toc: true\n", "xiongzhang: true\n", "xiongzhang_images: [main.jpg]\n", "\n", "---\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。\n", "\n", "- 第一部分pandas练习题请访问: [pandas数据分析100道练习题-第一部分](http://mlln.cn/2018/08/13/pandas%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90100%E9%81%93%E7%BB%83%E4%B9%A0%E9%A2%98-%E7%AC%AC%E4%B8%80%E9%83%A8%E5%88%86/)\n", "- 第二部分pandas练习题请访问: [pandas数据分析100道练习题-第二部分](http://mlln.cn/2018/08/15/pandas数据分析100道练习题-第二部分/)\n", "\n", "下面是第三部分:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 检查dataframe是否有缺失值" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "缺失: False\n", "缺失: True\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.DataFrame({\n", " 'a':[1.2,2,3,4],\n", " 'b':list('abcd')\n", "})\n", "\n", "print('缺失:', df.isnull().values.any())\n", "df.iat[0,0] = np.nan\n", "print('缺失:', df.isnull().values.any())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 统计dataframe中每列缺失值的数量" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1\n", "b 0\n", "dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.apply(lambda x: x.isnull().sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### dataframe用每列的平均值取代缺失值" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Min.PriceMax.Price
012.918.8
129.238.7
225.932.3
3NaN44.6
4NaNNaN
\n", "
" ], "text/plain": [ " Min.Price Max.Price\n", "0 12.9 18.8\n", "1 29.2 38.7\n", "2 25.9 32.3\n", "3 NaN 44.6\n", "4 NaN NaN" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')\n", "df[['Min.Price', 'Max.Price']].head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Min.PriceMax.Price
012.90000018.800000
129.20000038.700000
225.90000032.300000
317.11860544.600000
417.11860521.459091
\n", "
" ], "text/plain": [ " Min.Price Max.Price\n", "0 12.900000 18.800000\n", "1 29.200000 38.700000\n", "2 25.900000 32.300000\n", "3 17.118605 44.600000\n", "4 17.118605 21.459091" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 仅使用['Min.Price', 'Max.Price']这两列演示\n", "df[['Min.Price', 'Max.Price']].apply(lambda x: x.fillna(x.mean())).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 从dataframe中获取某一列, 并返回一个dataframe" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Manufacturer
0Acura
1NaN
2Audi
3Audi
4BMW
\n", "
" ], "text/plain": [ " Manufacturer\n", "0 Acura\n", "1 NaN\n", "2 Audi\n", "3 Audi\n", "4 BMW" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 只要传入一个list作为切片\n", "df[['Manufacturer']].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### dataframe如何改变列的顺序" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
001234
156789
21011121314
31516171819
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0 1 2 3 4\n", "1 5 6 7 8 9\n", "2 10 11 12 13 14\n", "3 15 16 17 18 19" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.arange(20).reshape(-1, 5), columns=list('abcde'))\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cbdae
021304
176859
21211131014
31716181519
\n", "
" ], "text/plain": [ " c b d a e\n", "0 2 1 3 0 4\n", "1 7 6 8 5 9\n", "2 12 11 13 10 14\n", "3 17 16 18 15 19" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[list('cbdae')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 设置dataframe输出的行数和列数" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
001234
156789
21011121314
31516171819
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0 1 2 3 4\n", "1 5 6 7 8 9\n", "2 10 11 12 13 14\n", "3 15 16 17 18 19" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 设置之前\n", "df" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ab...de
001...34
156...89
21011...1314
31516...1819
\n", "

4 rows × 5 columns

\n", "
" ], "text/plain": [ " a b ... d e\n", "0 0 1 ... 3 4\n", "1 5 6 ... 8 9\n", "2 10 11 ... 13 14\n", "3 15 16 ... 18 19\n", "\n", "[4 rows x 5 columns]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.max_columns', 4)\n", "pd.set_option('display.max_rows', 4)\n", "# 设置之后\n", "df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_columns', 10)\n", "pd.set_option('display.max_rows', 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 设置dataframe输出时不使用科学记数法" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random
04.285027e-07
11.580650e-05
22.049461e-01
36.873938e-06
\n", "
" ], "text/plain": [ " random\n", "0 4.285027e-07\n", "1 1.580650e-05\n", "2 2.049461e-01\n", "3 6.873938e-06" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(np.random.random(4)**10, columns=['random'])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random
00.0000
10.9207
20.0209
30.0000
\n", "
" ], "text/plain": [ " random\n", "0 0.0000\n", "1 0.9207\n", "2 0.0209\n", "3 0.0000" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.float_format', lambda x: '%.4f' % x)\n", "pd.DataFrame(np.random.random(4)**10, columns=['random'])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random
00.004970
10.018763
20.000058
30.252180
\n", "
" ], "text/plain": [ " random\n", "0 0.004970\n", "1 0.018763\n", "2 0.000058\n", "3 0.252180" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 恢复默认值\n", "pd.set_option('display.float_format', None)\n", "pd.DataFrame(np.random.random(4)**10, columns=['random'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 设置dataframe输出百分比数据" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random
00.343280
10.700357
20.885307
30.932272
\n", "
" ], "text/plain": [ " random\n", "0 0.343280\n", "1 0.700357\n", "2 0.885307\n", "3 0.932272" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.random(4), columns=['random'])\n", "df" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
random
034.33%
170.04%
288.53%
393.23%
" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.style.format({'random':'{0:.2%}'.format})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> 知识点:\n", "\n", "> 使用百分号可以自动将数据转换为百分比格式:\n", "> `'{0:.2%}'.format(0.555555)`\n", "\n", "更多字符串格式化问题,可以看这篇专题: [python3字符串format最佳实践](http://mlln.cn/2018/07/19/python3%E5%AD%97%E7%AC%A6%E4%B8%B2format%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 使用多个列创建唯一索引(index)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\n", " 'https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv', \n", " usecols=[0,1,2,3,5])" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n" ] } ], "source": [ "df[['Manufacturer', 'Model', 'Type']] = df[['Manufacturer', 'Model', 'Type']].fillna('missing')\n", "df.index = df.Manufacturer + '_' + df.Model + '_' + df.Type\n", "print(df.index.is_unique)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 获取第n大的数所在行" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 21\n", "1 17\n", "2 19\n", "3 3\n", "4 2\n", "5 3\n", "6 10\n", "7 5\n", "8 25\n", "9 12\n", "Name: a, dtype: int32" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(\n", " np.random.randint(1, 30, 30).reshape(10,-1), \n", " columns=list('abc'))\n", "df['a']\n" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 4\n", "1 3\n", "2 5\n", "3 7\n", "4 6\n", "5 9\n", "6 1\n", "7 2\n", "8 0\n", "9 8\n", "Name: a, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 使用行号排序\n", "df['a'].argsort()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4 2\n", "3 3\n", "5 3\n", "7 5\n", "6 10\n", "9 12\n", "1 17\n", "2 19\n", "0 21\n", "8 25\n", "Name: a, dtype: int32" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 可以这样看到原始数据的排序\n", "\n", "df['a'][df['a'].argsort()]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = 5\n", "df['a'].argsort()[::-1][n]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### dataframe获取行之和大于100的数据, 并返回最后的两行" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
1339313515
1430222531
\n", "
" ], "text/plain": [ " 0 1 2 3\n", "13 39 31 35 15\n", "14 30 22 25 31" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))\n", "rowsums = df.apply(np.sum, axis=1)\n", "last_two_rows = df.iloc[np.where(rowsums > 100)[0][-2:], :]\n", "last_two_rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 如何从系列或数据框列中查找和限制异常值\n", "\n", "用相应的5%分位数和95%分位数值替换低于5%分位数和大于95%分位数的所有值" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.05 %ile: 0.016049294076965887 | 0.95 %ile: 63.876672220183934\n" ] } ], "source": [ "# Input\n", "ser = pd.Series(np.logspace(-2, 2, 30))\n", "\n", "# Solution\n", "def cap_outliers(ser, low_perc, high_perc):\n", " low, high = ser.quantile([low_perc, high_perc])\n", " print(low_perc, '%ile: ', low, '|', high_perc, '%ile: ', high)\n", " ser[ser < low] = low\n", " ser[ser > high] = high\n", " return(ser)\n", "\n", "capped_ser = cap_outliers(ser, .05, .95)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 如何在删除负值后将dataframe重新整形为最大可能的正方形\n", "\n", "\n", "将df重塑为最大可能的正方形,并删除负值。如果需要,删除最小值。结果中正数的顺序应保持与原始顺序相同。" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 1 2 3 4 5 6 7 8 9\n", "0 12 -6 -15 14 33 -9 -14 46 7 6\n", "1 23 -5 13 22 -10 35 -8 8 -20 34\n", "2 19 49 18 -1 5 22 38 -5 27 34\n", "3 -12 47 21 9 45 18 3 -10 27 42\n", "4 48 32 41 -14 5 40 33 -3 10 11\n", "5 17 -2 13 47 28 5 24 26 -7 10\n", "6 49 21 14 -16 12 -17 3 43 8 -10\n", "7 17 -19 -14 2 -8 -10 1 25 -9 48\n", "8 25 46 -7 36 -13 18 42 31 49 37\n", "9 2 25 47 28 9 49 10 44 -15 12\n" ] } ], "source": [ "df = pd.DataFrame(np.random.randint(-20, 50, 100).reshape(10,-1))\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([12., 14., 33., 46., 7., 6., 23., 13., 22., 35., 8., 34., 19.,\n", " 49., 18., 5., 22., 38., 27., 34., 47., 21., 9., 45., 18., 3.,\n", " 27., 42., 48., 32., 41., 5., 40., 33., 10., 11., 17., 13., 47.,\n", " 28., 5., 24., 26., 10., 49., 21., 14., 12., 3., 43., 8., 17.,\n", " 2., 1., 25., 48., 25., 46., 36., 18., 42., 31., 49., 37., 2.,\n", " 25., 47., 28., 9., 49., 10., 44., 12.])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 步骤1:删除负数\n", "\n", "arr = df[df > 0].values.flatten()\n", "arr_qualified = arr[~np.isnan(arr)]\n", "arr_qualified" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 步骤2: 计算正方形的边长\n", "\n", "n = int(np.floor(arr_qualified.shape[0]**.5))\n", "n" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[12. 14. 33. 46. 7. 23. 13. 22.]\n", " [35. 8. 34. 19. 49. 18. 22. 38.]\n", " [27. 34. 47. 21. 9. 45. 18. 27.]\n", " [42. 48. 32. 41. 40. 33. 10. 11.]\n", " [17. 13. 47. 28. 24. 26. 10. 49.]\n", " [21. 14. 12. 43. 8. 17. 25. 48.]\n", " [25. 46. 36. 18. 42. 31. 49. 37.]\n", " [25. 47. 28. 9. 49. 10. 44. 12.]]\n" ] } ], "source": [ "# 步骤3: 整形为要求的正方形\n", "top_indexes = np.argsort(arr_qualified)[::-1]\n", "output = np.take(arr_qualified, sorted(top_indexes[:n**2])).reshape(n, -1)\n", "print(output)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 交换dataframe的两行\n", "\n", "把第一行和第二行数据交换" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
001234
156789
21011121314
31516171819
42021222324
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 0 1 2 3 4\n", "1 5 6 7 8 9\n", "2 10 11 12 13 14\n", "3 15 16 17 18 19\n", "4 20 21 22 23 24" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.arange(25).reshape(5, -1))\n", "df" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
001234
11011121314
256789
31516171819
42021222324
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 0 1 2 3 4\n", "1 10 11 12 13 14\n", "2 5 6 7 8 9\n", "3 15 16 17 18 19\n", "4 20 21 22 23 24" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a, b = df.iloc[1, :].copy(), df.iloc[2, :].copy()\n", "df.iloc[1, :], df.iloc[2, :] = b, a\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### dataframe行倒序排序" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
42021222324
31516171819
256789
11011121314
001234
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "4 20 21 22 23 24\n", "3 15 16 17 18 19\n", "2 5 6 7 8 9\n", "1 10 11 12 13 14\n", "0 0 1 2 3 4" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[::-1, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 对分类数据进行one-hot编码\n", "\n", "经常用于逻辑回归" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
abcde
001234
156789
21011121314
31516171819
42021222324
\n", "
" ], "text/plain": [ " a b c d e\n", "0 0 1 2 3 4\n", "1 5 6 7 8 9\n", "2 10 11 12 13 14\n", "3 15 16 17 18 19\n", "4 20 21 22 23 24" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.arange(25).reshape(5,-1), columns=list('abcde'))\n", "df" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
05101520bcde
0100001234
1010006789
20010011121314
30001016171819
40000121222324
\n", "
" ], "text/plain": [ " 0 5 10 15 20 b c d e\n", "0 1 0 0 0 0 1 2 3 4\n", "1 0 1 0 0 0 6 7 8 9\n", "2 0 0 1 0 0 11 12 13 14\n", "3 0 0 0 1 0 16 17 18 19\n", "4 0 0 0 0 1 21 22 23 24" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([pd.get_dummies(df['a']), df[list('bcde')]], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 哪个列包含每行的最大值\n", "\n", "求行最大值所在的列" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "d:\\mysites\\deeplearning.ai-master\\.env\\lib\\site-packages\\numpy\\core\\fromnumeric.py:52: FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'\n", "will be corrected to return the positional maximum in the future.\n", "Use 'series.values.argmax' to get the position of the maximum now.\n", " return getattr(obj, method)(*args, **kwds)\n" ] }, { "data": { "text/plain": [ "0 3\n", "1 1\n", "2 3\n", "3 0\n", "4 2\n", "5 0\n", "6 0\n", "7 2\n", "8 2\n", "9 0\n", "dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(np.random.randint(1,100, 40).reshape(10, -1))\n", "df.apply(np.argmax, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 计算 每行的最近行(使用欧几里得距离)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{0: 7, 1: 7, 2: 6, 3: 5, 4: 8, 5: 4, 6: 2, 7: 0, 8: 4, 9: 3}\n" ] } ], "source": [ "nearest = {}\n", "for i, row in df.iterrows():\n", " c = ((df - row)**2).sum(axis = 1).argsort()\n", " for j in c:\n", " if j != i:\n", " break\n", " nearest[i] = j\n", "print(nearest)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "今天的教程就到此为止了, 希望大家关注我的小站mlln.cn, 后面还会有关于pandas系列的练习题, 希望这些工作能帮助你学习pandas. 如果你有什么意见或者建议, 请在微博@该用户已经被封, 并附带本文链接。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }