pandas数据分析100道练习题-第二部分

xxxspy 2018-08-15 18:17:55

Categories： Tags：

这篇文章收集了网友们使用pandas进行数据分析时经常遇到的问题, 这些问题也可以检验你使用pandas的熟练程度, 所以他们更像是一个学习教材, 掌握这些技能, 可以使你数据数据分析的工作事半功倍。第一部分pandas练习题请访问: pandas数据分析100道练习题-第一部分, 下面是第二部分:

series如何将一日期字符串转换为时间

import pandas as pd
ser = pd.Series(['01 Jan 2010', 
                '02-02-2011', 
                 '20120303', 
                 '2013/04/04', 
                 '2014-05-05', 
                 '2015-06-06T12:20'])

pd.to_datetime(ser)

输出(plain):
0 2010-01-01 00:00:00
1 2011-02-02 00:00:00
2 2012-03-03 00:00:00
3 2013-04-04 00:00:00
4 2014-05-05 00:00:00
5 2015-06-06 12:20:00
dtype: datetime64[ns]

series如何从时间序列中提取年/月/天/小时/分钟/秒

date = pd.Series(['01 Jan 2010', 
                '02-02-2011', 
                 '20120303', 
                 '2013/04/04', 
                 '2014-05-05', 
                 '2015-06-06T12:20'])
date = pd.to_datetime(date)
date.dt.year

输出(plain):
0 2010
1 2011
2 2012
3 2013
4 2014
5 2015
dtype: int64

1	date.dt.month

输出(plain):
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64

1	date.dt.day

输出(plain):
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64

1	date.dt.hour

输出(plain):
0 0
1 0
2 0
3 0
4 0
5 12
dtype: int64

从series中找出包含两个以上元音字母的单词

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

def count(x):
    aims = 'aeiou'
    c= 0
    for i in x:
        if i in aims:
            c += 1
    return c

counts = ser.map(lambda x: count(x))
ser[counts>=2]

输出(plain):
1 Orange
4 Money
dtype: object

如何过滤series中的有效电子邮件

emails = pd.Series(['buying books at amazom.com', 
                    'rameses@egypt.com', 
                    'matt@t.co',
                    'narendra@modi.com'])

import re
pattern ='[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'
valid = emails.str.findall(pattern, flags=re.IGNORECASE)
[x[0] for x in valid if len(x)]

输出(plain):
['rameses@egypt.com', 'matt@t.co', 'narendra@modi.com']

series A 以series B为分组依据, 然后计算分组后的平均值

import numpy as np
fruit = pd.Series(np.random.choice(['apple', 'banana', 'carrot'], 10))
weights = pd.Series(np.linspace(1, 10, 10))

weights.groupby(fruit).mean()

输出(plain):
apple 9.00
banana 4.75
carrot 3.00
dtype: float64

如何计算两个系列之间的欧氏距离

p = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
q = pd.Series([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])

sum((p - q)**2)**.5

输出(plain):
18.16590212458495

如何在数字系列中查找所有局部最大值（或峰值）

ser = pd.Series([2, 10, 3, 4, 9, 10, 2, 7, 3])
dd = np.diff(np.sign(np.diff(ser)))
peak_locs = np.where(dd == -2)[0] + 1
peak_locs

输出(plain):
array([1, 5, 7], dtype=int64)

如何创建一个以’2000-01-02’开始包含10个周六的TimeSeries

pd.Series(np.random.randint(1,10,10), 
          pd.date_range('2000-01-02', 
                        periods=10, 
                        freq='W-SAT'))

输出(plain):
2000-01-08 5
2000-01-15 4
2000-01-22 2
2000-01-29 1
2000-02-05 4
2000-02-12 8
2000-02-19 1
2000-02-26 6
2000-03-04 6
2000-03-11 2
Freq: W-SAT, dtype: int32

如何填补TimeSeires的缺失日期

ser = pd.Series([1,10,3,np.nan], index=pd.to_datetime(['2000-01-01',
                                                       '2000-01-03', 
                                                       '2000-01-06', 
                                                       '2000-01-08']))
# 使用前一个日期的数据填补
ser.resample('D').ffill()
# 如果使用后一个日期的数据填补, 可以使用bfill方法

输出(plain):
2000-01-01 1.0
2000-01-02 1.0
2000-01-03 10.0
2000-01-04 10.0
2000-01-05 10.0
2000-01-06 3.0
2000-01-07 3.0
2000-01-08 NaN
Freq: D, dtype: float64

如何计算series的自相关

ser = pd.Series(np.arange(20) + np.random.normal(1, 10, 20))
autocorrelations = [ser.autocorr(i).round(2) for i in range(11)]

autocorrelations

输出(plain):
[1.0, 0.38, 0.12, 0.17, 0.44, 0.48, 0.25, -0.31, -0.1, 0.65, 0.05]

读取csv时, 间隔几行读取数据

# 生成用于测试的csv
fpath = 'testt.csv'
df = pd.DataFrame({'a': range(100), 
                   'b':np.random.choice(['apple', 'banana', 'carrot'], 100)})
df.to_csv(fpath, index=None)

### 隔行读取csv
import csv

with open(fpath, 'r') as f:
    reader = csv.reader(f)
    out = []
    for i, row in enumerate(reader):
        if i%20 ==0:
            out.append(row)
pd.DataFrame(out[1:], columns=out[0])

输出(html):

	a	b
0	19	banana
1	39	carrot
2	59	banana
3	79	banana
4	99	apple

读取csv时进行数据转换

pd.read_csv(fpath, 
            converters={
                'a':lambda x: 'low' if int(x) < 50 else 'high'
            }).head()

输出(html):

	a	b
0	low	carrot
1	low	carrot
2	low	banana
3	low	apple
4	low	apple

读取csv时只读取某列

1	pd.read_csv(fpath, usecols=['a']).head()

输出(html):

	a
0	0
1	1
2	2
3	3
4	4

读取dataframe每列的数据类型

df=pd.DataFrame(
    {
        'a':range(100),
        'b':np.random.rand(100),
        'c':[1,2,3,4]*25,
        'd':['apple', 'banana', 'carrot']*33 + ['apple']
    }
)

df.dtypes

输出(plain):
a int64
b float64
c int64
d object
dtype: object

读取dataframe的行数和列数

df.shape

输出(plain):
(100, 4)

获取dataframe每列的基本描述统计

1 2	df.describe()

输出(html):

	a	b	c
count	100.000000	100.000000	100.000000
mean	49.500000	0.515885	2.500000
std	29.011492	0.281679	1.123666
min	0.000000	0.000605	1.000000
25%	24.750000	0.280289	1.750000
50%	49.500000	0.545348	2.500000
75%	74.250000	0.736113	3.250000
max	99.000000	0.992075	4.000000

从dataframe中找到a列最大值对应的行

1	df.loc[df.a==np.max(df.a)]

输出(html):

	a	b	c	d
99	99	0.598169	4	apple

从dataframe中获取c列最大值所在的行号

1	np.where(df.c==np.max(df.c))

输出(plain):
(array([ 3, 7, 11, 15, 19, 23, 27, 31, 35, 39, 43, 47, 51, 55, 59, 63, 67,
71, 75, 79, 83, 87, 91, 95, 99], dtype=int64),)

在dataframe中根据行列数读取某个值

row = 4
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 4
col = 2
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 0
col = 0
print(f'行{row}列{col}的值是: {df.iat[row, col]}')
row = 33
col = 3
print(f'行{row}列{col}的值是: {df.iat[row, col]}')

输出(stream):
行4列0的值是: 4
行4列2的值是: 1
行0列0的值是: 0
行33列3的值是: apple

在dataframe中根据index和列名称读取某个值

index = 0
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 2
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 4
col = 'd'
print(f'index={index}, col={col} : {df.at[index, col]}')
index = 5
col = 'c'
print(f'index={index}, col={col} : {df.at[index, col]}')

输出(stream):
index=0, col=d : apple
index=2, col=d : carrot
index=4, col=d : banana
index=5, col=c : 2

dataframe中重命名某一列

1	df.rename(columns={'d':'fruit'}).head()

输出(html):

	a	b	c	fruit
0	0	0.406456	1	apple
1	1	0.607407	2	banana
2	2	0.197953	3	carrot
3	3	0.279180	4	apple
4	4	0.193107	1	banana

今天的教程就到此为止了, 希望大家关注我的小站mlln.cn, 后面还会有关于pandas系列的练习题, 希望这些工作能帮助你学习pandas, 或者在面试的时候应付面试题。

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

pandas数据分析100道练习题-第三部分

pandas数据分析100道练习题-第一部分