pandas数据分析100道练习题-第一部分

xxxspy 2018-08-13 18:17:55

Categories： Tags：

这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。

如何引入pandas并查看版本

1
2
3

import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))

输出(stream):
0.23.0
{'system': {'commit': None, 'python': '3.6.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 158 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.23.0', 'pytest': None, 'pip': '9.0.1', 'setuptools': '28.8.0', 'Cython': None, 'numpy': '1.14.3', 'scipy': '1.1.0', 'pyarrow': None, 'xarray': None, 'IPython': '6.4.0', 'sphinx': None, 'patsy': None, 'dateutil': '2.7.3', 'pytz': '2018.4', 'blosc': None, 'bottleneck': None, 'tables': None, 'numexpr': None, 'feather': None, 'matplotlib': '2.2.2', 'openpyxl': None, 'xlrd': '1.1.0', 'xlwt': '1.3.0', 'xlsxwriter': None, 'lxml': None, 'bs4': None, 'html5lib': '0.9999999', 'sqlalchemy': None, 'pymysql': None, 'psycopg2': None, 'jinja2': '2.10', 's3fs': None, 'fastparquet': None, 'pandas_gbq': None, 'pandas_datareader': None}}
None

list或numpy array或dict转pd.Series

import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# 方法
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

输出(stream):
a 0
b 1
c 2
e 3
d 4
dtype: int64

series的index转dataframe的column

1	ser3.head()

输出(plain):
a 0
b 1
c 2
e 3
d 4
dtype: int64

1 2	df = ser3.to_frame().reset_index() print(df.head())

输出(stream):
index 0
0 a 0
1 b 1
2 c 2
3 e 3
4 d 4

多个series合并成一个dataframe

1 2	df = pd.DataFrame({'col1': ser1, 'col2': ser2}) print(df.head())

输出(stream):
col1 col2
0 a 0
1 b 1
2 c 2
3 e 3
4 d 4

根据index, 多个series合并成dataframe

# 选择部分数据进行合并, 便与看到合并效果
s1 = ser1[:16]
s2 = ser2[14:]
s1

输出(plain):
0 a
1 b
2 c
3 e
4 d
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
dtype: object

1	pd.concat([s1, s2], axis=1)

输出(html):

	0	1
0	a	NaN
1	b	NaN
2	c	NaN
3	e	NaN
4	d	NaN
5	f	NaN
6	g	NaN
7	h	NaN
8	i	NaN
9	j	NaN
10	k	NaN
11	l	NaN
12	m	NaN
13	n	NaN
14	o	14.0
15	p	15.0
16	NaN	16.0
17	NaN	17.0
18	NaN	18.0
19	NaN	19.0
20	NaN	20.0
21	NaN	21.0
22	NaN	22.0
23	NaN	23.0
24	NaN	24.0
25	NaN	25.0

头尾拼接两个series

1	pd.concat([s1, s2], axis=0)

输出(plain):
0 a
1 b
2 c
3 e
4 d
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
dtype: object

找到元素在series A中不在series B中

1
2
3

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
ser1[~ser1.isin(ser2)]

输出(plain):
0 1
1 2
2 3
dtype: int64

两个seiries的并集

1	np.union1d(ser1, ser2)

输出(plain):
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

两个series的交集

1	np.intersect1d(ser1, ser2)

输出(plain):
array([4, 5], dtype=int64)

两个series的非共有元素

1
2
3

u = pd.Series(np.union1d(ser1, ser2))
i = pd.Series(np.intersect1d(ser1, ser2))
u[~u.isin(i)]

输出(plain):
0 1
1 2
2 3
5 6
6 7
7 8
dtype: int64

如何获得series的最小值，第25百分位数，中位数，第75位和最大值？

1
2
3

ser = pd.Series(np.random.normal(10, 5, 25))
np.random.RandomState(100)
np.percentile(ser, q=[0, 25, 50, 75, 100])

输出(plain):
array([-1.2740299 , 5.82920931, 8.64214184, 10.8035798 , 18.08081406])

如何获得系列中唯一项目的频率计数？

1
2
3

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

ser.value_counts()

输出(plain):
g 5
b 5
e 4
f 4
d 4
a 4
c 2
h 2
dtype: int64

series中计数排名前2的元素

1
2
3

v_cnt = ser.value_counts()
cnt_cnt=v_cnt.value_counts().index[:2]
cnt_cnt

输出(plain):
Int64Index([4, 5], dtype='int64')

1 2	index = v_cnt[v_cnt.isin(cnt_cnt)].index index

输出(plain):
Index(['g', 'b', 'e', 'f', 'd', 'a'], dtype='object')

如何将数字系列分成10个相同大小的组

1 2	ser = pd.Series(np.random.random(20)) ser.head()

输出(plain):
0 0.888218
1 0.938604
2 0.859850
3 0.434301
4 0.851859
dtype: float64

1
2
3

groups = pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
groups.head()

输出(plain):
0 8th
1 10th
2 7th
3 4th
4 6th
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

如何将numpy数组转换为给定形状的dataframe

1
2
3

ser = pd.Series(np.random.randint(1, 10, 35))
df = pd.DataFrame(ser.values.reshape(7,5))
df

输出(html):

	0	1	2	3	4
0	5	1	2	3	6
1	6	7	8	5	6
2	8	7	3	6	4
3	8	5	4	7	2
4	2	8	3	4	1
5	6	8	7	1	9
6	6	5	9	9	1

如何从一系列中找到2的倍数的数字位置

1 2	ser = pd.Series(np.random.randint(1, 10, 7)) ser

输出(plain):
0 8
1 9
2 8
3 5
4 3
5 6
6 3
dtype: int32

1	np.argwhere(ser % 2==0)

输出(plain):
array([[0],
[2],
[5]], dtype=int64)

如何从系列中的给定位置提取项目

1
2
3

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]
ser.take(pos)

输出(plain):
0 a
4 e
8 i
14 o
20 u
dtype: object

获取元素的位置

1 2	aims = list('adhz') [pd.Index(ser).get_loc(i) for i in aims]

输出(plain):
[0, 3, 7, 25]

如何计算真值和预测序列的均方误差

1
2
3

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
np.mean((truth-pred)**2)

输出(plain):
0.18571414723876128

如何将系列中每个元素的第一个字符转换为大写

1 2	ser = pd.Series(['how', 'to', 'kick', 'ass?']) ser.map(lambda x: x.title())

输出(plain):
0 How
1 To
2 Kick
3 Ass?
dtype: object

如何计算系列中每个单词的字符数

1	ser.map(lambda x: len(x))

输出(plain):
0 3
1 2
2 4
3 4
dtype: int64

如何计算时间序列数据的差分

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# 一级差分
ser.diff()

输出(plain):
0 NaN
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 6.0
7 8.0
dtype: float64

1 2	# 二级差分 ser.diff().diff()

输出(plain):
0 NaN
1 NaN
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 2.0
dtype: float64

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

pandas数据分析100道练习题-第二部分

python3字符串format最佳实践