Python For Data Analysis-八章第三节
《Python For Data Analysis》的第八章的第三节主要讨论的是数据的变形问题涉及stack和unstack、pivot函数。
19.1 set_index函数生成层次化索引
在本Python学习网站的17章的第一节层次化索引生成的dataframe的层次化索引是通过给出相同的行名而产生的,在17章的最后去层次化索引一节提到了set_index函数,也给出了例子,这节继续使用set_index函数来构造层次化索引的dataframe,然后研究dataframe针对层次化的dataframe的的变形函数stack和unstack。
import pandas as pd
import numpy as np
tfboys = [['Donald Trump', 'English',65],['Donald Trump','Math',63],['Donald Trump','Politics',56],
['George.W.Bush', 'English',75],['George.W.Bush', 'Math',43],['George.W.Bush', 'Politics',36],
['William Clinton', 'English',55],['William Clinton', 'Math',58],['William Clinton', 'Politics',45]]
#print tfboys
df = pd.DataFrame(tfboys,columns = ["name", "course", "score"])
print "-" * 40
print df, "# df"
print "-" * 40
df.set_index(["name", "course"], inplace = True)
print df, "# df multiInde"
print "-" * 40
print df.index, “# df index”
print "-" * 40
执行结果:
----------------------------------------
name course score
0 Donald Trump English 65
1 Donald Trump Math 63
2 Donald Trump Politics 56
3 George.W.Bush English 75
4 George.W.Bush Math 43
5 George.W.Bush Politics 36
6 William Clinton English 55
7 William Clinton Math 58
8 William Clinton Politics 45 # df
----------------------------------------
score
name course
Donald Trump English 65
Math 63
Politics 56
George.W.Bush English 75
Math 43
Politics 36
William Clinton English 55
Math 58
Politics 45 # df multiInde
----------------------------------------
MultiIndex(levels=[[u'Donald Trump', u'George.W.Bush', u'William Clinton'], [u'English', u'Math', u'Politics']],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
names=[u'name', u'course'])# df index
----------------------------------------
set_index函数可以将多列设置为index即行索引,出现多层次的行索引。
19.2 stack函数和unstack函数
- stack函数可以将dataframe的列变行。
import pandas as pd
import numpy as np
tfboys = [['Donald Trump', 'English',65],['Donald Trump','Math',63],['Donald Trump','Politics',56],
['George.W.Bush', 'English',75],['George.W.Bush', 'Math',43],['George.W.Bush', 'Politics',36],
['William Clinton', 'English',55],['William Clinton', 'Math',58],['William Clinton', 'Politics',45]]
df = pd.DataFrame(tfboys,columns = ["name", "course", "score"])
print "-" * 40
print df, "# df"
print "-" * 40
df.set_index("name", inplace = True)
print df, "# df set_index"
print "-" * 40
print df.index, "# df.index"
print "-" * 40
print df.stack(), "# df stack()"
print "-" * 40
print df.stack().index, "# df stack() index"
print "-" * 40
执行结果:
----------------------------------------
name course score
0 Donald Trump English 65
1 Donald Trump Math 63
2 Donald Trump Politics 56
3 George.W.Bush English 75
4 George.W.Bush Math 43
5 George.W.Bush Politics 36
6 William Clinton English 55
7 William Clinton Math 58
8 William Clinton Politics 45 # df
----------------------------------------
course score
name
Donald Trump English 65
Donald Trump Math 63
Donald Trump Politics 56
George.W.Bush English 75
George.W.Bush Math 43
George.W.Bush Politics 36
William Clinton English 55
William Clinton Math 58
William Clinton Politics 45 # df set_index
----------------------------------------
Index([u'Donald Trump', u'Donald Trump', u'Donald Trump', u'George.W.Bush',
u'George.W.Bush', u'George.W.Bush', u'William Clinton',
u'William Clinton', u'William Clinton'],
dtype='object', name=u'name') # df.index
----------------------------------------
name
Donald Trump course English
score 65
course Math
score 63
course Politics
score 56
George.W.Bush course English
score 75
course Math
score 43
course Politics
score 36
William Clinton course English
score 55
course Math
score 58
course Politics
score 45
dtype: object # df stack()
----------------------------------------
MultiIndex(levels=[[u'Donald Trump', u'George.W.Bush', u'William Clinton'], [u'course', u'score']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'name', None]) # df stack() index
----------------------------------------
从stack函数的执行结果:
----------------------------------------
name
Donald Trump course English
score 65
course Math
score 63
course Politics
score 56
George.W.Bush course English
score 75
course Math
score 43
course Politics
score 36
William Clinton course English
score 55
course Math
score 58
course Politics
score 45
dtype: object # df stack()
----------------------------------------
MultiIndex(levels=[[u'Donald Trump', u'George.W.Bush', u'William Clinton'], [u'course', u'score']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'name', None]) # df stack() index
----------------------------------------
可以看出stack也能产生层次化的多级的行索引。从print df.stack().index
语句的输出结果:
----------------------------------------
MultiIndex(levels=[[u'Donald Trump', u'George.W.Bush', u'William Clinton'], [u'course', u'score']],
labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'name', None]) # df stack() index
----------------------------------------
也能看出是层次化多级索引。
- unstack函数可以将层次化索引的dataframe数据的行变列和stack正好相反。
import pandas as pd
import numpy as np
tfboys = [['Donald Trump', 'English',65],['Donald Trump','Math',63],['Donald Trump','Politics',56],
['George.W.Bush', 'English',75],['George.W.Bush', 'Math',43],['George.W.Bush', 'Politics',36],
['William Clinton', 'English',55],['William Clinton', 'Math',58],['William Clinton', 'Politics',45]]
df = pd.DataFrame(tfboys,columns = ["name", "course", "score"])
print "-" * 40
df.set_index("name", inplace = True)
df.columns.name = "info"
print df, "# df after set_index"
print "-" * 40
ret = df.stack()
print ret, "# df stack()"
print "-" * 40
print ret.unstack(), "# df stack() unstack()"
print "-" * 40
执行结果:
----------------------------------------
info course score
name
Donald Trump English 65
...
William Clinton Politics 45 # df after set_index
----------------------------------------
name info
Donald Trump course English
score 65
course Math
score 63
course Politics
score 56
George.W.Bush course English
score 75
course Math
score 43
course Politics
score 36
William Clinton course English
score 55
course Math
score 58
course Politics
score 45
dtype: object # df stack()
----------------------------------------
Traceback (most recent call last):
File "pda1901.py", line 15, in <module>
print ret.unstack(), "# df stack() unstack()"
File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 2899, in unstack
return unstack(self, level, fill_value)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/reshape.py", line 501, in unstack
constructor=obj._constructor_expanddim)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/reshape.py", line 137, in __init__
self._make_selectors()
File "/usr/local/lib/python2.7/dist-packages/pandas/core/reshape/reshape.py", line 175, in _make_selectors
raise ValueError('Index contains duplicate entries, '
ValueError: Index contains duplicate entries, cannot reshape
程序执行unstack函数的时候出现了错误ValueError: Index contains duplicate entries, cannot reshape
说由于有重复数据无法reshape,重复数据在?
name info
Donald Trump course English
score 65
course Math
score 63
course Politics
score 56
George.W.Bush course English
第二列info下有3个course和score原因是df.set_index("name", inplace = True)
用name这一列做行索引,那么Donald Trump
有三门课成绩,所以会在info有3个course和score。那怎么办?取消name做索引用自然整数(唯一不重复)。
import pandas as pd
import numpy as np
tfboys = [['Donald Trump', 'English',65],['Donald Trump','Math',63],['Donald Trump','Politics',56],
['George.W.Bush', 'English',75],['George.W.Bush', 'Math',43],['George.W.Bush', 'Politics',36],
['William Clinton', 'English',55],['William Clinton', 'Math',58],['William Clinton', 'Politics',45]]
df = pd.DataFrame(tfboys,columns = ["name", "course", "score"])
print "-" * 40
#df.set_index("name", inplace = True)
df.columns.name = "info"
print df, "# df after set_index"
print "-" * 40
ret = df.stack()
print ret, "# df stack()"
print "-" * 40
print ret.unstack(), "# df stack() unstack()"
print "-" * 40
执行结果:
----------------------------------------
info name course score
0 Donald Trump English 65
1 Donald Trump Math 63
2 Donald Trump Politics 56
3 George.W.Bush English 75
4 George.W.Bush Math 43
5 George.W.Bush Politics 36
6 William Clinton English 55
7 William Clinton Math 58
8 William Clinton Politics 45 # df after set_index
----------------------------------------
info
0 name Donald Trump
course English
score 65
1 name Donald Trump
course Math
score 63
2 name Donald Trump
course Politics
score 56
3 name George.W.Bush
course English
score 75
4 name George.W.Bush
course Math
score 43
5 name George.W.Bush
course Politics
score 36
6 name William Clinton
course English
score 55
7 name William Clinton
course Math
score 58
8 name William Clinton
course Politics
score 45
dtype: object # df stack()
----------------------------------------
info name course score
0 Donald Trump English 65
1 Donald Trump Math 63
2 Donald Trump Politics 56
3 George.W.Bush English 75
4 George.W.Bush Math 43
5 George.W.Bush Politics 36
6 William Clinton English 55
7 William Clinton Math 58
8 William Clinton Politics 45 # df stack() unstack()
----------------------------------------
也就是层次化索引的dataframe如果能用unstack函数变回去,要求内层索引(此处是stack()后的info列)不能重复才能unstack。
19.3 pivot函数
pivot函数可以很轻松的构建出某种版式的透视表。数据透视表,是因为可以动态地改变它们的版面布置,以便按照不同方式分析数据,也可以重新安排行号、列标和页字段。每一次改变版面布置时,数据透视表会立即按照新的布置重新计算数据。另外,如果原始数据发生更改,则可以更新数据透视表。
19.3.1 JD股价数据
为了演示pivot函数的作用,可以先从Nasdaq下载JD的股票交易数据,进入该页面点击最下面的download下载股票历史交易记录,然后在看看pivot函数的作用。而这个数据在之前的Numpy部分的章节里有过接触过。
- Nasdaq网站JD股票交易数据网址
- JD交易数据网站展示
- 下载的文件:HistoricalQuotes文件的格式是csv的,如下图:
- 读取数据文,HistoricalQuotes.csv文件和下面的代码保存在同一个目录下,这个下载的数据文件是.csv的类型文件,可以用之前学习过的一些函数读取这个文件到dataframe里,这里可以使用read_table函数。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
从执行结果可以看出,数据文件里有若干列,分别记录交易日期、收盘价、交易量、开盘价、最高交易价和最低交易价这些数据。
19.3.2 JD股票数据透视表
之前一小节可以成功下载、读取数据文件数据到dataframe里了,这节要解决的问题是对数据进行必要的变形、变化,调整成便于分析统计的样式。暂不使用pivot,看已学的其他函数能不能实现pivot函数的作用。
1). 数据逆序
可以自行打开HistoricalQuotes.csv数据文件查看一下,发现最新的交易时间在第一行,而久远的交易时间在最后一行,我们想调整一下按时间先后顺序排列,而不是倒序,此时的df的行索引是整数的有序增大,所以可以使用sort_index函数来逆序一下整个数据。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
df里的数据按date即交易时间逆序了,即现在的df的第一条数据是最早的交易时间了,可能比较符合一般人观察习惯。
2). 用交易时间做行索引
如果不打算用整数作为每条数据的数据可以使用set_index
函数来指定某列为新生成的dataframe的行索引。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
print df.set_index("date")[:10], "# after set index on date"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
close volume open high low
date
2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after set index on date
------------------------------------------------------------
这也许是某些数据分析师分析数据时所想要的内容。
3). 构建周开盘闭盘时间
此时的数据是每天都有,如果只想关心JD股票每周周一股市开盘、和周五股市闭盘的情况,那么需要对date列的数据进行过滤,即只保留周一和周五的时间,如何做?
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
print exchange_day, "# exchange_day"
sel_day = exchange_day.strftime("%Y/%m/%d")
print "-" * 60
print sel_day, "# sel_day"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
DatetimeIndex(['2018-12-14', '2018-12-17', '2018-12-21', '2018-12-24',
'2018-12-28', '2018-12-31', '2019-01-04', '2019-01-07',
'2019-01-11', '2019-01-14', '2019-01-18', '2019-01-21',
'2019-01-25', '2019-01-28', '2019-02-01', '2019-02-04',
'2019-02-08', '2019-02-11', '2019-02-15', '2019-02-18',
'2019-02-22', '2019-02-25', '2019-03-01', '2019-03-04',
'2019-03-08', '2019-03-11'],
dtype='datetime64[ns]', freq=None) # exchange_day
------------------------------------------------------------
Index([u'2018/12/14', u'2018/12/17', u'2018/12/21', u'2018/12/24',
u'2018/12/28', u'2018/12/31', u'2019/01/04', u'2019/01/07',
u'2019/01/11', u'2019/01/14', u'2019/01/18', u'2019/01/21',
u'2019/01/25', u'2019/01/28', u'2019/02/01', u'2019/02/04',
u'2019/02/08', u'2019/02/11', u'2019/02/15', u'2019/02/18',
u'2019/02/22', u'2019/02/25', u'2019/03/01', u'2019/03/04',
u'2019/03/08', u'2019/03/11'],
dtype='object') #sel_day
------------------------------------------------------------
变量exchange_day里是周一和周五的时间数据。使用的range_date函数在时间序列的创建一章和data_range参数两节介绍过。
由于date_range函数产生的时间序列是DateTimeIndex类型的,不能直接被DataFrame通过loc来提取以以时间作为行索引的数据行,所以需要对exchange_day
(类型为DatetimeIndex)进行数据类型的转换得到sel_day
(类型为Index)。
4). 提取若干列数据
JD的股票交易数据文件HistoricalQuotes.csv有若干列,如果只关注其中几列的数据,如交易量、最高和最低交易价,可以使用reindex函数的columns参数来提取、限定数据列的输出。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
#print exchange_day, "# exchange_day"
sel_day = exchange_day.strftime("%Y/%m/%d")
#print "-" * 60
#print sel_day, "sel_day"
#print "-" * 60
cols = pd.Index(['date', 'volume', 'high','low'], name='item')
print cols, "# index on columns"
print "-" * 60
dd = df.reindex(columns = cols)
print dd[:10], "# reindex to get some columns"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
Index([u'date', u'volume', u'high', u'low'], dtype='object', name=u'item') # index on columns
------------------------------------------------------------
item date volume high low
60 2018/12/11 13234290.0 21.3400 20.82
59 2018/12/12 19641470.0 22.3150 21.27
58 2018/12/13 25465660.0 23.3500 22.00
57 2018/12/14 14866570.0 22.9400 22.01
56 2018/12/17 16522610.0 22.2600 21.17
55 2018/12/18 14253720.0 21.8369 20.91
54 2018/12/19 15254470.0 21.4850 20.00
53 2018/12/20 17855170.0 20.2100 19.64
52 2018/12/21 49463160.0 21.9900 19.52
51 2018/12/24 28037350.0 21.5100 19.26 # reindex to get some columns
------------------------------------------------------------
从dd = df.reindex(columns = cols)
语句的执行结果可以看到,只提取了原数据文件里的四列数据date、volume、high、low
。
5). 指定date为索引
现在想用date列作为行索引而不是用默认的整形数据做行索引。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
#print exchange_day, "# exchange_day"
sel_day = exchange_day.strftime("%Y/%m/%d")
#print "-" * 60
#print sel_day, "sel_day"
#print "-" * 60
cols = pd.Index(['date', 'volume', 'high','low'], name='item')
#print cols, "# index on columns"
#print "-" * 60
dd = df.reindex(columns = cols)
print dd[:10], "# reindex to get some columns"
print "-" * 60
dd = df.reindex(columns = cols).set_index("date")
print dd[:10], "# reindex set_index ond date"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
item date volume high low
60 2018/12/11 13234290.0 21.3400 20.82
59 2018/12/12 19641470.0 22.3150 21.27
58 2018/12/13 25465660.0 23.3500 22.00
57 2018/12/14 14866570.0 22.9400 22.01
56 2018/12/17 16522610.0 22.2600 21.17
55 2018/12/18 14253720.0 21.8369 20.91
54 2018/12/19 15254470.0 21.4850 20.00
53 2018/12/20 17855170.0 20.2100 19.64
52 2018/12/21 49463160.0 21.9900 19.52
51 2018/12/24 28037350.0 21.5100 19.26 # reindex to get some columns
------------------------------------------------------------
item volume high low
date
2018/12/11 13234290.0 21.3400 20.82
2018/12/12 19641470.0 22.3150 21.27
2018/12/13 25465660.0 23.3500 22.00
2018/12/14 14866570.0 22.9400 22.01
2018/12/17 16522610.0 22.2600 21.17
2018/12/18 14253720.0 21.8369 20.91
2018/12/19 15254470.0 21.4850 20.00
2018/12/20 17855170.0 20.2100 19.64
2018/12/21 49463160.0 21.9900 19.52
2018/12/24 28037350.0 21.5100 19.26 # reindex set_index ond date
------------------------------------------------------------
6). 提取周一开盘和周五闭市交易数据
这个时候可以用之前3). 构建周开盘闭盘时间
里的sel_day
的周一和周五的时间序列来提取只有周一和周五的数据交易信息了。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
print "-" * 60
print df[:10], "# after sort on index"
print "-" * 60
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
#print exchange_day, "# exchange_day"
sel_day = exchange_day.strftime("%Y/%m/%d")
#print "-" * 60
#print sel_day, "sel_day"
#print "-" * 60
cols = pd.Index(['date', 'volume', 'high','low'], name='item')
#print cols, "# index on columns"
#print "-" * 60
dd = df.reindex(columns = cols)
print dd[:10], "# reindex to get some columns"
print "-" * 60
dd = df.reindex(columns = cols).set_index("date").loc[sel_day]
print dd, "# select Monday and Friday"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date close volume open high low
60 2018/12/11 20.91 13234290.0 20.95 21.3400 20.82
59 2018/12/12 21.89 19641470.0 21.30 22.3150 21.27
58 2018/12/13 22.71 25465660.0 22.00 23.3500 22.00
57 2018/12/14 22.17 14866570.0 22.04 22.9400 22.01
56 2018/12/17 21.35 16522610.0 22.04 22.2600 21.17
55 2018/12/18 21.04 14253720.0 21.47 21.8369 20.91
54 2018/12/19 20.17 15254470.0 20.95 21.4850 20.00
53 2018/12/20 19.91 17855170.0 20.09 20.2100 19.64
52 2018/12/21 21.08 49463160.0 20.09 21.9900 19.52
51 2018/12/24 19.75 28037350.0 21.49 21.5100 19.26 # after sort on index
------------------------------------------------------------
item date volume high low
60 2018/12/11 13234290.0 21.3400 20.82
59 2018/12/12 19641470.0 22.3150 21.27
58 2018/12/13 25465660.0 23.3500 22.00
57 2018/12/14 14866570.0 22.9400 22.01
56 2018/12/17 16522610.0 22.2600 21.17
55 2018/12/18 14253720.0 21.8369 20.91
54 2018/12/19 15254470.0 21.4850 20.00
53 2018/12/20 17855170.0 20.2100 19.64
52 2018/12/21 49463160.0 21.9900 19.52
51 2018/12/24 28037350.0 21.5100 19.26 # reindex to get some columns
------------------------------------------------------------
item volume high low
2018/12/14 14866570.0 22.9400 22.0100
2018/12/17 16522610.0 22.2600 21.1700
2018/12/21 49463160.0 21.9900 19.5200
2018/12/24 28037350.0 21.5100 19.2600
2018/12/28 16629620.0 22.0000 21.1500
2018/12/31 19407160.0 22.4800 20.7700
2019/01/04 25132510.0 22.6800 20.7501
2019/01/07 18843420.0 22.8500 22.2600
2019/01/11 10017400.0 23.7500 22.6900
2019/01/14 14120930.0 22.3000 21.5300
2019/01/18 23260730.0 23.8500 22.3000
2019/01/21 NaN NaN NaN
2019/01/25 28015970.0 23.9300 22.4700
2019/01/28 13884910.0 24.1399 23.0600
2019/02/01 11928890.0 25.3100 24.3100
2019/02/04 9046348.0 24.5400 24.1600
2019/02/08 8239639.0 24.2800 23.6000
2019/02/11 9025630.0 24.7300 24.1500
2019/02/15 11468160.0 24.5700 23.9100
2019/02/18 NaN NaN NaN
2019/02/22 27084030.0 25.9600 24.4800
2019/02/25 29263690.0 27.3800 26.0400
2019/03/01 36774330.0 29.3900 27.9700
2019/03/04 23431540.0 29.2000 28.1400
2019/03/08 28540480.0 27.2500 26.3700
2019/03/11 21376730.0 27.8900 27.1600 # select Monday and Friday
------------------------------------------------------------
7). 按日查看数据信息,可以使用stack函数创建层次化的索引能满足需求。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
#print "-" * 60
#print df[:10], "# after sort on index"
#print "-" * 60
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
#print exchange_day, "# exchange_day"
sel_day = exchange_day.strftime("%Y/%m/%d")
#print "-" * 60
#print sel_day, "sel_day"
#print "-" * 60
cols = pd.Index(['date', 'volume', 'high','low'], name='item')
#print cols, "# index on columns"
#print "-" * 60
dd = df.reindex(columns = cols)
#print dd[:10], "# reindex to get some columns"
#print "-" * 60
dd = df.reindex(columns = cols).set_index("date").loc[sel_day]
#print dd, "# select Monday and Friday"
print "-" * 60
print dd.stack(), "# stack"
最后的输出是按日期输出的当日交易量、最高价和最低交易价信息,执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
item
2018/12/14 volume 1.486657e+07
high 2.294000e+01
low 2.201000e+01
2018/12/17 volume 1.652261e+07
high 2.226000e+01
low 2.117000e+01
...<中间部分省略>....
2019/03/08 volume 2.854048e+07
high 2.725000e+01
low 2.637000e+01
2019/03/11 volume 2.137673e+07
high 2.789000e+01
low 2.716000e+01
Length: 72, dtype: float64 # stack
19.3.3 pivot函数
这一小节,使用pivot来作出上一小节的数据透视表的效果。
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset
df = pd.read_table("HistoricalQuotes.csv", sep = ",")
print "-" * 60
print df[:10], "# after read_table"
df.sort_index(ascending=False, inplace=True)
st = pd.Timestamp(df.iloc[0, 0])
date_mon = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-MON')
date_fri = pd.date_range(start= st,end = st + DateOffset(months=3),freq='W-FRI')
exchange_day = date_mon.union(date_fri)
sel_day = exchange_day.strftime("%Y/%m/%d")
cols = pd.Index(['date', 'volume', 'high','low'], name='item')
dd = df.reindex(columns = cols).set_index("date").stack()
print "-" * 60
print dd[:10], "# stack"
print "-" * 60
dd = dd.reset_index()
print dd[:10], "# reset index"
print "-" * 60
dd = dd.rename(columns={0: 'value'})
print dd[:10], "# rename"
p = dd.pivot('date', 'item', 'value')
print "-" * 60
print p.loc[sel_day], "# pivot"
print "-" * 60
执行结果:
------------------------------------------------------------
date close volume open high low
0 2019/03/11 27.62 21376730.0 27.46 27.890 27.16
1 2019/03/08 27.05 28540480.0 26.71 27.250 26.37
2 2019/03/07 27.84 20908040.0 28.50 28.645 27.60
3 2019/03/06 29.03 20238530.0 29.25 29.660 28.72
4 2019/03/05 29.35 19324880.0 28.88 29.700 28.70
5 2019/03/04 28.89 23431540.0 29.04 29.200 28.14
6 2019/03/01 28.73 36774330.0 28.03 29.390 27.97
7 2019/02/28 27.71 66542550.0 28.58 28.620 26.58
8 2019/02/27 25.97 22903520.0 26.42 26.900 25.87
9 2019/02/26 26.59 20260590.0 25.98 26.820 25.66 # after read_table
------------------------------------------------------------
date item
2018/12/11 volume 1.323429e+07
high 2.134000e+01
low 2.082000e+01
2018/12/12 volume 1.964147e+07
high 2.231500e+01
low 2.127000e+01
2018/12/13 volume 2.546566e+07
high 2.335000e+01
low 2.200000e+01
2018/12/14 volume 1.486657e+07
dtype: float64 # stack
------------------------------------------------------------
date item 0
0 2018/12/11 volume 1.323429e+07
1 2018/12/11 high 2.134000e+01
2 2018/12/11 low 2.082000e+01
3 2018/12/12 volume 1.964147e+07
4 2018/12/12 high 2.231500e+01
5 2018/12/12 low 2.127000e+01
6 2018/12/13 volume 2.546566e+07
7 2018/12/13 high 2.335000e+01
8 2018/12/13 low 2.200000e+01
9 2018/12/14 volume 1.486657e+07 # reset index
------------------------------------------------------------
date item value
0 2018/12/11 volume 1.323429e+07
1 2018/12/11 high 2.134000e+01
2 2018/12/11 low 2.082000e+01
3 2018/12/12 volume 1.964147e+07
4 2018/12/12 high 2.231500e+01
5 2018/12/12 low 2.127000e+01
6 2018/12/13 volume 2.546566e+07
7 2018/12/13 high 2.335000e+01
8 2018/12/13 low 2.200000e+01
9 2018/12/14 volume 1.486657e+07 # rename
------------------------------------------------------------
item high low volume
2018/12/14 22.9400 22.0100 14866570.0
2018/12/17 22.2600 21.1700 16522610.0
2018/12/21 21.9900 19.5200 49463160.0
2018/12/24 21.5100 19.2600 28037350.0
2018/12/28 22.0000 21.1500 16629620.0
2018/12/31 22.4800 20.7700 19407160.0
2019/01/04 22.6800 20.7501 25132510.0
2019/01/07 22.8500 22.2600 18843420.0
2019/01/11 23.7500 22.6900 10017400.0
2019/01/14 22.3000 21.5300 14120930.0
2019/01/18 23.8500 22.3000 23260730.0
2019/01/21 NaN NaN NaN
2019/01/25 23.9300 22.4700 28015970.0
2019/01/28 24.1399 23.0600 13884910.0
2019/02/01 25.3100 24.3100 11928890.0
2019/02/04 24.5400 24.1600 9046348.0
2019/02/08 24.2800 23.6000 8239639.0
2019/02/11 24.7300 24.1500 9025630.0
2019/02/15 24.5700 23.9100 11468160.0
2019/02/18 NaN NaN NaN
2019/02/22 25.9600 24.4800 27084030.0
2019/02/25 27.3800 26.0400 29263690.0
2019/03/01 29.3900 27.9700 36774330.0
2019/03/04 29.2000 28.1400 23431540.0
2019/03/08 27.2500 26.3700 28540480.0
2019/03/11 27.8900 27.1600 21376730.0 # pivot
------------------------------------------------------------
两小节的区别是,上一小节的6). 提取周一开盘和周五闭市交易数据
是通过将索引设置为date列为索引的到
------------------------------------------------------------
item volume high low
2018/12/14 14866570.0 22.9400 22.0100
....
2019/03/11 21376730.0 27.8900 27.1600 # select Monday and Friday
------------------------------------------------------------
结果的,而本小节则是通过stack
层次化索引和reset_index
去层次化索引增加整数作为行索引之后用pivot
得到这个结果。