pandas数据分析100道练习题-第四部分

xxxspy 2018-08-23 17:54:31

Categories： Tags：

这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。

第一部分pandas练习题请访问: pandas数据分析100道练习题-第一部分
第二部分pandas练习题请访问: pandas数据分析100道练习题-第二部分
第三部分pandas练习题请访问: pandas数据分析100道练习题-第三部分
第四部分pandas练习题请访问: pandas数据分析100道练习题-第三部分

下面是第四部分:

如何计算列之间的最大相关系数

import pandas as pd
import numpy as np
df = pd.DataFrame(
    np.random.randint(1,100, 80).reshape(8, -1), 
    columns=list('pqrstuvwxy'),
    index=list('abcdefgh')
)

abs_corrmat = np.abs(df.corr())
print(abs_corrmat)
max_corr = abs_corrmat.apply(lambda x: sorted(x)[-2])
print('Maximum Correlation possible for each column: ', np.round(max_corr.tolist(), 2))

输出(stream):
p q r s t u v \
p 1.000000 0.268096 0.000000 0.552086 0.147951 0.229566 0.312353
q 0.268096 1.000000 0.881994 0.169709 0.124291 0.542839 0.351897
r 0.000000 0.881994 1.000000 0.254703 0.014796 0.335214 0.331702
s 0.552086 0.169709 0.254703 1.000000 0.373359 0.355978 0.042473
t 0.147951 0.124291 0.014796 0.373359 1.000000 0.564365 0.001794
u 0.229566 0.542839 0.335214 0.355978 0.564365 1.000000 0.179641
v 0.312353 0.351897 0.331702 0.042473 0.001794 0.179641 1.000000
w 0.697658 0.343943 0.566769 0.424458 0.014227 0.489756 0.274991
x 0.254656 0.431052 0.539917 0.434953 0.368824 0.275014 0.056530
y 0.106323 0.121851 0.179469 0.236219 0.228056 0.141275 0.468257

w x y
p 0.697658 0.254656 0.106323
q 0.343943 0.431052 0.121851
r 0.566769 0.539917 0.179469
s 0.424458 0.434953 0.236219
t 0.014227 0.368824 0.228056
u 0.489756 0.275014 0.141275
v 0.274991 0.056530 0.468257
w 1.000000 0.065845 0.124048
x 0.065845 1.000000 0.364810
y 0.124048 0.364810 1.000000
Maximum Correlation possible for each column: [0.7 0.88 0.88 0.55 0.56 0.56 0.47 0.7 0.54 0.47]

计算每一行的最小值与最大值的比值

df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution 1
min_by_max = df.apply(lambda x: np.min(x)/np.max(x), axis=1)
min_by_max

输出(plain):
0 0.074468
1 0.013514
2 0.101010
3 0.457447
4 0.040404
5 0.081633
6 0.024096
7 0.163265
dtype: float64

找到每行第二大的值

创建一个新列’penultimate’，它具有每行df的第二大值。

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
out = df.apply(lambda x: x.sort_values().unique()[-2], axis=1)
df['penultimate'] = out
print(df)

输出(stream):
0 1 2 3 4 5 6 7 8 9 penultimate
0 89 42 65 63 4 24 41 72 79 66 79
1 76 28 17 53 42 21 93 81 5 39 81
2 66 92 18 93 99 74 71 85 84 42 93
3 59 53 72 13 1 88 95 92 70 68 92
4 97 56 64 76 78 36 80 10 94 14 94
5 48 92 39 42 1 26 32 7 48 90 90
6 17 4 70 22 44 52 39 84 67 52 70
7 38 44 46 12 24 23 28 85 87 82 85

如何正态化dataframe中的所有列

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution Q1
out1 = df.apply(lambda x: ((x - x.mean())/x.std()).round(2))
print('Solution Q1\n',out1)

# Solution Q2
out2 = df.apply(lambda x: ((x.max() - x)/(x.max() - x.min())).round(2))
print('Solution Q2\n', out2)

输出(stream):
Solution Q1
0 1 2 3 4 5 6 7 8 9
0 0.25 1.05 1.21 0.39 -0.57 0.34 -0.42 -0.79 -1.18 0.90
1 -0.11 -0.12 1.24 0.07 1.15 -1.18 1.16 0.53 1.10 -0.32
2 -0.78 1.53 -1.54 -1.24 1.20 1.72 0.27 1.00 0.42 0.06
3 -0.04 0.81 -0.41 -1.14 -0.99 -0.40 -1.05 -1.06 -1.45 -1.28
4 -0.75 -1.05 -1.12 -0.05 1.20 -1.14 -1.22 -0.62 -0.09 0.15
5 -0.53 -0.63 0.43 1.48 -0.69 -0.40 -0.16 -1.10 -0.16 1.07
6 2.31 -0.60 0.01 1.17 -0.91 0.94 1.65 0.63 -0.09 0.95
7 -0.36 -1.02 0.18 -0.68 -0.40 0.12 -0.23 1.41 1.44 -1.54
Solution Q2
0 1 2 3 4 5 6 7 8 9
0 0.67 0.19 0.01 0.40 0.81 0.48 0.72 0.88 0.91 0.06
1 0.78 0.64 0.00 0.52 0.02 1.00 0.17 0.35 0.12 0.53
2 1.00 0.00 1.00 1.00 0.00 0.00 0.48 0.16 0.35 0.39
3 0.76 0.28 0.59 0.97 1.00 0.73 0.94 0.99 1.00 0.90
4 0.99 1.00 0.85 0.56 0.00 0.99 1.00 0.81 0.53 0.35
5 0.92 0.84 0.29 0.00 0.87 0.73 0.63 1.00 0.55 0.00
6 0.00 0.83 0.44 0.11 0.97 0.27 0.00 0.31 0.53 0.05
7 0.86 0.99 0.38 0.79 0.73 0.55 0.66 0.00 0.00 1.00

如何计算每行与上一行的相关？

# Input
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, -1))

# Solution
[df.iloc[i].corr(df.iloc[i+1]).round(2) for i in range(df.shape[0])[:-1]]

输出(plain):
[-0.51, 0.11, -0.05, 0.02, 0.46, -0.69, -0.23]

如何用0填充dataframe的对角线上的数

df = pd.DataFrame(np.random.randint(1,100, 100).reshape(10, -1))

# Solution
for i in range(df.shape[0]):
    df.iat[i, i] = 0
    df.iat[df.shape[0]-i-1, i] = 0
df

输出(html):

	0	1	2	3	4	5	6	7	8	9
0	0	71	61	38	97	22	93	36	47	0
1	11	0	11	86	46	20	60	60	0	38
2	71	88	0	11	92	25	98	0	17	69
3	27	57	2	0	49	83	0	27	3	94
4	1	33	61	52	0	0	50	71	96	29
5	2	52	32	90	0	0	59	53	15	52
6	1	41	90	0	42	52	0	14	17	39
7	42	87	0	51	54	84	29	0	94	99
8	29	0	64	64	7	99	47	39	0	62
9	0	71	1	20	27	54	37	99	31	0

dataframe分组后获取某个组的数据

df = pd.DataFrame({'col1': ['apple', 'banana', 'orange'] * 3,
                   'col2': np.random.rand(9),
                   'col3': np.random.randint(0, 15, 9)})

df_grouped = df.groupby(['col1'])

# Solution 1
df_grouped.get_group('apple')

输出(html):

	col1	col2	col3
0	apple	0.861407	5
3	apple	0.407644	14
6	apple	0.974718	11

分组后获取某组中的第n大的值

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'taste': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

n=2
# Solution
df_grpd = df['taste'].groupby(df.fruit)
df_grpd.get_group('banana').sort_values().iloc[-n]

输出(plain):
0.31660034951387783

分组后获取每组平均值, 并且保持分组列不是index

df = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                   'rating': np.random.rand(9),
                   'price': np.random.randint(0, 15, 9)})

# Solution
out = df.groupby('fruit', as_index=False)['price'].mean()
print(out)

输出(stream):
fruit price
0 apple 6
1 banana 7
2 orange 5

参照两列合并两个dataframe, 并且只保留两个dataframe都有的行

df1 = pd.DataFrame({'fruit': ['apple', 'banana', 'orange'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.random.randint(0, 15, 9)})

df2 = pd.DataFrame({'pazham': ['apple', 'orange', 'pine'] * 2,
                    'kilo': ['high', 'low'] * 3,
                    'price': np.random.randint(0, 15, 6)})

# Solution
pd.merge(df1, df2, how='inner', left_on=['fruit', 'weight'], right_on=['pazham', 'kilo'], suffixes=['_left', '_right'])

输出(html):

	fruit	weight	price_left	pazham	kilo	price_right
0	apple	high	14	apple	high	4
1	apple	high	9	apple	high	4
2	apple	high	10	apple	high	4
3	orange	low	7	orange	low	11
4	orange	low	8	orange	low	11
5	orange	low	14	orange	low	11

如何从dataframe中删除另一个dataframe中存在的行

df1 = pd.DataFrame({'fruit': ['apple', 'orange', 'banana'] * 3,
                    'weight': ['high', 'medium', 'low'] * 3,
                    'price': np.arange(9)})

df2 = pd.DataFrame({'fruit': ['apple', 'orange', 'pine'] * 2,
                    'weight': ['high', 'medium'] * 3,
                    'price': np.arange(6)})


# Solution
print(df1[~df1.isin(df2).all(1)])

df1.isin(df2)

输出(stream):
fruit weight price
2 banana low 2
3 apple high 3
4 orange medium 4
5 banana low 5
6 apple high 6
7 orange medium 7
8 banana low 8

输出(html):

	fruit	weight	price
0	True	True	True
1	True	True	True
2	False	False	True
3	True	False	True
4	True	False	True
5	False	False	True
6	False	False	False
7	False	False	False
8	False	False	False

如何获得两列值匹配的位置

df = pd.DataFrame({'fruit1': np.random.choice(['apple', 'orange', 'banana'], 10),
                    'fruit2': np.random.choice(['apple', 'orange', 'banana'], 10)})

# Solution
np.where(df.fruit1 == df.fruit2)

输出(plain):
(array([1, 2, 5, 6, 7, 8], dtype=int64),)

时间序列如何前后移动时间步

创建新的列是已有列的滞后列或者前向列

df = pd.DataFrame(np.random.randint(1, 100, 20).reshape(-1, 4), columns = list('abcd'))

# Solution
df['a_lag1'] = df['a'].shift(1)
df['b_lead1'] = df['b'].shift(-1)
print(df)

输出(stream):
a b c d a_lag1 b_lead1
0 90 49 33 17 NaN 11.0
1 84 11 34 16 90.0 66.0
2 78 66 63 6 84.0 34.0
3 84 34 53 15 78.0 30.0
4 12 30 44 22 84.0 NaN

获取整个dataframe值的计数

df = pd.DataFrame(np.random.randint(1, 10, 20).reshape(-1, 4), columns = list('abcd'))
# Solution

pd.value_counts(df.values.ravel())

输出(plain):
2 4
8 3
5 3
1 3
4 2
3 2
9 1
7 1
6 1
dtype: int64

字符串列的分割

df = pd.DataFrame(["STD, City    State",
"33, Kolkata    West Bengal",
"44, Chennai    Tamil Nadu",
"40, Hyderabad    Telengana",
"80, Bangalore    Karnataka"], columns=['row'])

# Solution
df.row.str.split(',|\t', expand=True)

输出(html):

	0	1
0	STD	City State
1	33	Kolkata West Bengal
2	44	Chennai Tamil Nadu
3	40	Hyderabad Telengana
4	80	Bangalore Karnataka

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

EM算法详解和numpy代码实现

DataScience博客重大更新-在线运行示例代码

	0	1	2	3	4	5	6	7	8	9
0	0	71	61	38	97	22	93	36	47	0
1	11	0	11	86	46	20	60	60	0	38
2	71	88	0	11	92	25	98	0	17	69
3	27	57	2	0	49	83	0	27	3	94
4	1	33	61	52	0	0	50	71	96	29
5	2	52	32	90	0	0	59	53	15	52
6	1	41	90	0	42	52	0	14	17	39
7	42	87	0	51	54	84	29	0	94	99
8	29	0	64	64	7	99	47	39	0	62
9	0	71	1	20	27	54	37	99	31	0

	0	1	2	3	4	5	6	7	8	9
0	0	71	61	38	97	22	93	36	47	0
1	11	0	11	86	46	20	60	60	0	38
2	71	88	0	11	92	25	98	0	17	69
3	27	57	2	0	49	83	0	27	3	94
4	1	33	61	52	0	0	50	71	96	29
5	2	52	32	90	0	0	59	53	15	52
6	1	41	90	0	42	52	0	14	17	39
7	42	87	0	51	54	84	29	0	94	99
8	29	0	64	64	7	99	47	39	0	62
9	0	71	1	20	27	54	37	99	31	0

	0	1	2	3	4	5	6	7	8	9
0	0	71	61	38	97	22	93	36	47	0
1	11	0	11	86	46	20	60	60	0	38
2	71	88	0	11	92	25	98	0	17	69
3	27	57	2	0	49	83	0	27	3	94
4	1	33	61	52	0	0	50	71	96	29
5	2	52	32	90	0	0	59	53	15	52
6	1	41	90	0	42	52	0	14	17	39
7	42	87	0	51	54	84	29	0	94	99
8	29	0	64	64	7	99	47	39	0	62
9	0	71	1	20	27	54	37	99	31	0