pandas数据分析100道练习题-第一部分

2018年08月13日

文章目录

1. 如何引入pandas并查看版本
2. list或numpy array或dict转pd.Series
3. series的index转dataframe的column
4. 多个series合并成一个dataframe
5. 根据index, 多个series合并成dataframe
6. 头尾拼接两个series
7. 找到元素在series A中不在series B中
8. 两个seiries的并集
9. 两个series的交集
10. 两个series的非共有元素
11. 如何获得series的最小值，第25百分位数，中位数，第75位和最大值？
12. 如何获得系列中唯一项目的频率计数？
13. series中计数排名前2的元素
14. 如何将数字系列分成10个相同大小的组
15. 如何将numpy数组转换为给定形状的dataframe
16. 如何从一系列中找到2的倍数的数字位置
17. 如何从系列中的给定位置提取项目
18. 获取元素的位置
19. 如何计算真值和预测序列的均方误差
20. 如何将系列中每个元素的第一个字符转换为大写
21. 如何计算系列中每个单词的字符数
22. 如何计算时间序列数据的差分

这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。

如何引入pandas并查看版本

1
2
3

import pandas as pd
print(pd.__version__)
print(pd.show_versions(as_json=True))

输出(stream):
0.23.0
{'system': {'commit': None, 'python': '3.6.4.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 158 Stepping 9, GenuineIntel', 'byteorder': 'little', 'LC_ALL': 'None', 'LANG': 'None', 'LOCALE': 'None.None'}, 'dependencies': {'pandas': '0.23.0', 'pytest': None, 'pip': '9.0.1', 'setuptools': '28.8.0', 'Cython': None, 'numpy': '1.14.3', 'scipy': '1.1.0', 'pyarrow': None, 'xarray': None, 'IPython': '6.4.0', 'sphinx': None, 'patsy': None, 'dateutil': '2.7.3', 'pytz': '2018.4', 'blosc': None, 'bottleneck': None, 'tables': None, 'numexpr': None, 'feather': None, 'matplotlib': '2.2.2', 'openpyxl': None, 'xlrd': '1.1.0', 'xlwt': '1.3.0', 'xlsxwriter': None, 'lxml': None, 'bs4': None, 'html5lib': '0.9999999', 'sqlalchemy': None, 'pymysql': None, 'psycopg2': None, 'jinja2': '2.10', 's3fs': None, 'fastparquet': None, 'pandas_gbq': None, 'pandas_datareader': None}}
None

list或numpy array或dict转pd.Series

import numpy as np
mylist = list('abcedfghijklmnopqrstuvwxyz')
myarr = np.arange(26)
mydict = dict(zip(mylist, myarr))

# 方法
ser1 = pd.Series(mylist)
ser2 = pd.Series(myarr)
ser3 = pd.Series(mydict)
print(ser3.head())

输出(stream):
a 0
b 1
c 2
e 3
d 4
dtype: int64

series的index转dataframe的column

1	ser3.head()

输出(plain):
a 0
b 1
c 2
e 3
d 4
dtype: int64

1 2	df = ser3.to_frame().reset_index() print(df.head())

输出(stream):
index 0
0 a 0
1 b 1
2 c 2
3 e 3
4 d 4

多个series合并成一个dataframe

1 2	df = pd.DataFrame({'col1': ser1, 'col2': ser2}) print(df.head())

输出(stream):
col1 col2
0 a 0
1 b 1
2 c 2
3 e 3
4 d 4

根据index, 多个series合并成dataframe

# 选择部分数据进行合并, 便与看到合并效果
s1 = ser1[:16]
s2 = ser2[14:]
s1

输出(plain):
0 a
1 b
2 c
3 e
4 d
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
dtype: object

1	pd.concat([s1, s2], axis=1)

输出(html):

	0	1
0	a	NaN
1	b	NaN
2	c	NaN
3	e	NaN
4	d	NaN
5	f	NaN
6	g	NaN
7	h	NaN
8	i	NaN
9	j	NaN
10	k	NaN
11	l	NaN
12	m	NaN
13	n	NaN
14	o	14.0
15	p	15.0
16	NaN	16.0
17	NaN	17.0
18	NaN	18.0
19	NaN	19.0
20	NaN	20.0
21	NaN	21.0
22	NaN	22.0
23	NaN	23.0
24	NaN	24.0
25	NaN	25.0

头尾拼接两个series

1	pd.concat([s1, s2], axis=0)

输出(plain):
0 a
1 b
2 c
3 e
4 d
5 f
6 g
7 h
8 i
9 j
10 k
11 l
12 m
13 n
14 o
15 p
14 14
15 15
16 16
17 17
18 18
19 19
20 20
21 21
22 22
23 23
24 24
25 25
dtype: object

找到元素在series A中不在series B中

1
2
3

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])
ser1[~ser1.isin(ser2)]

输出(plain):
0 1
1 2
2 3
dtype: int64

两个seiries的并集

1	np.union1d(ser1, ser2)

输出(plain):
array([1, 2, 3, 4, 5, 6, 7, 8], dtype=int64)

两个series的交集

1	np.intersect1d(ser1, ser2)

输出(plain):
array([4, 5], dtype=int64)

两个series的非共有元素

1
2
3

u = pd.Series(np.union1d(ser1, ser2))
i = pd.Series(np.intersect1d(ser1, ser2))
u[~u.isin(i)]

输出(plain):
0 1
1 2
2 3
5 6
6 7
7 8
dtype: int64

如何获得series的最小值，第25百分位数，中位数，第75位和最大值？

1
2
3

ser = pd.Series(np.random.normal(10, 5, 25))
np.random.RandomState(100)
np.percentile(ser, q=[0, 25, 50, 75, 100])

输出(plain):
array([-1.2740299 , 5.82920931, 8.64214184, 10.8035798 , 18.08081406])

如何获得系列中唯一项目的频率计数？

1
2
3

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))

ser.value_counts()

输出(plain):
g 5
b 5
e 4
f 4
d 4
a 4
c 2
h 2
dtype: int64

series中计数排名前2的元素

1
2
3

v_cnt = ser.value_counts()
cnt_cnt=v_cnt.value_counts().index[:2]
cnt_cnt

输出(plain):
Int64Index([4, 5], dtype='int64')

1 2	index = v_cnt[v_cnt.isin(cnt_cnt)].index index

输出(plain):
Index(['g', 'b', 'e', 'f', 'd', 'a'], dtype='object')

如何将数字系列分成10个相同大小的组

1 2	ser = pd.Series(np.random.random(20)) ser.head()

输出(plain):
0 0.888218
1 0.938604
2 0.859850
3 0.434301
4 0.851859
dtype: float64

1
2
3

groups = pd.qcut(ser, q=[0, .10, .20, .3, .4, .5, .6, .7, .8, .9, 1], 
        labels=['1st', '2nd', '3rd', '4th', '5th', '6th', '7th', '8th', '9th', '10th'])
groups.head()

输出(plain):
0 8th
1 10th
2 7th
3 4th
4 6th
dtype: category
Categories (10, object): [1st < 2nd < 3rd < 4th ... 7th < 8th < 9th < 10th]

如何将numpy数组转换为给定形状的dataframe

1
2
3

ser = pd.Series(np.random.randint(1, 10, 35))
df = pd.DataFrame(ser.values.reshape(7,5))
df

输出(html):

	0	1	2	3	4
0	5	1	2	3	6
1	6	7	8	5	6
2	8	7	3	6	4
3	8	5	4	7	2
4	2	8	3	4	1
5	6	8	7	1	9
6	6	5	9	9	1

如何从一系列中找到2的倍数的数字位置

1 2	ser = pd.Series(np.random.randint(1, 10, 7)) ser

输出(plain):
0 8
1 9
2 8
3 5
4 3
5 6
6 3
dtype: int32

1	np.argwhere(ser % 2==0)

输出(plain):
array([[0],
[2],
[5]], dtype=int64)

如何从系列中的给定位置提取项目

1
2
3

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]
ser.take(pos)

输出(plain):
0 a
4 e
8 i
14 o
20 u
dtype: object

获取元素的位置

1 2	aims = list('adhz') [pd.Index(ser).get_loc(i) for i in aims]

输出(plain):
[0, 3, 7, 25]

如何计算真值和预测序列的均方误差

1
2
3

truth = pd.Series(range(10))
pred = pd.Series(range(10)) + np.random.random(10)
np.mean((truth-pred)**2)

输出(plain):
0.18571414723876128

如何将系列中每个元素的第一个字符转换为大写

1 2	ser = pd.Series(['how', 'to', 'kick', 'ass?']) ser.map(lambda x: x.title())

输出(plain):
0 How
1 To
2 Kick
3 Ass?
dtype: object

如何计算系列中每个单词的字符数

1	ser.map(lambda x: len(x))

输出(plain):
0 3
1 2
2 4
3 4
dtype: int64

如何计算时间序列数据的差分

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

# 一级差分
ser.diff()

输出(plain):
0 NaN
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 6.0
7 8.0
dtype: float64

1 2	# 二级差分 ser.diff().diff()

输出(plain):
0 NaN
1 NaN
2 1.0
3 1.0
4 1.0
5 1.0
6 0.0
7 2.0
dtype: float64

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

#python #pandas