Pandas

作者: D_Major | 来源:发表于2019-04-10 20:38 被阅读0次

    1. pandas读取csv

    import pandas as pd
    
    df = pd.read_csv(<filename>)
    print df.head()    # 头部5行
    print df.tail()    # 尾部5行
    print df.tail(n)    # 尾部n行
    print df[10:21]   # 切片, 10到20行
    print df['Close'].max()    # 统计出最大收盘价 
    

    2. pandas绘制图表

    """Plot High prices for IBM"""
    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    def test_run():
        df = pd.read_csv("data/IBM.csv")
        df['High'].plot()
        # df[['Close', 'Adj Close']].plot() #绘制两列
        plt.xlabel('Time')
        plt.ylabel('High')
        plt.title('High prices for IBM')
        plt.show()  # must be called to show plots
    
    
    if __name__ == "__main__":
        test_run()
    

    csv中日期是从上向下递增的, 所以读进来的日期顺序是反过来的, 曲线也是反过来的, 呈下降趋势, 需进行反转处理.

    3. 构建及合并DataFrame, df.join()

    • 创建日期范围列
    start_date = '2010-01-22'
    end_date = '2010-01-26'
    dates = pd.date_range(start_date, end_date)
    df1 = pd.DataFrame(index = dates)  # 不指定index将使用0, 1, 2
    

    date_range()返回datetime索引对象的列表[2010-01-22, ..., 2010-01-26]
    In: dates[0]
    Out: 2010-01-22 00:00:00 # 股票信息中时间戳可忽略

    合并DataFrame:

    dfSPY = pd.read_csv("data/SPY.csv", index_col = "Date", 
                        parse_dates = True, usecols = ['Date', 'Adj Close'],
                        na_values = ['nan'])
    df1 = df1.join(dfSPY)
    

    默认向左合并, dfSPY和df1索引中有交集的部分会被合并, 否则填充NaN, 所以会保留df1的全部数据.
    SPY.csv默认是以整数进行索引的, 所以需指定索引列index_col, 并指定索引格式是datatime.对于NaN需告诉pandas不能当成一个数处理, 所以na_values指定NaN为字符串格式.
    之后使用dropna()直接将df1中为NaN的行去掉:

    df1 = df1.dropna()
    

    二者其实可以直接用一句话完成, 即how参数为inner内联, 只取交集, 但是时间顺序为倒序

    df1 = df1.join(dfSPY, how = 'inner')
    

    附df.join()中how参数使用用法:
    how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’
    How to handle the operation of the two objects.
    left: use calling frame’s(调用表) index (or column if on is specified)
    right: use other’s index.(被调用表)
    outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically字典序.
    inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

    在批量读入csv时, 为防止'Adj Close'列名冲突, 需要重新命名为csv文件的名字, 如下图所示:

    df_temp = df_temp.rename(columns={'Adj Close': symbol})
    df = df.join(df_temp)
    
    DataFrame重命名

    当使用SPY作为参考时,需要将不交易的日期删掉:

    # TODO:对SPY去除空白列时需配合how参数为left, 若为inner则日期顺序会倒过来, 使用subset则为正序
    df = df.join(df_temp, how='left')  
    if symbol == 'SPY':
      df = df.dropna(subset=['SPY'])
    

    纽约证交所一年交易日252天, SPY除了周末和节假日都开盘, 所以经常用SPY作为日期参考

    4. DataFrame切片

    • 行切片
    df.ix['2010-01-01' : '2010-01-31']
    
    • 列切片
    df['GOOG']  # 单列
    df[['IBM', 'GLD']]  # 多列要传入列表
    
    • 行列同时切片
    df.ix['2010-01-01' : '2010-01-31', ['IBM', 'GLD']]
    

    5.图表相关

    • 图表归一化:
    df1 = df1 / df1[0] # 除以第一行做归一化, 所有曲线都从1美元开始
    

    归一化的目的是凸显股票波动
    注意:根据 pandas 语法,该操作应该读成:
    df = df / df.ix[0]
    或者,更明确地读成:
    df = df / df.ix[0, :]
    这种方法是基于C层面的, 比二层循环遍历整个图表的python层面更快

    • 将df数据传入使用matplotlib.pyplot绘制图表:
    import matplotlib.pyplot as plt
    def plot_data(df, title="Stock prices"):
        """Plot stock prices with a custom title and meaningful axis labels."""
        ax = df.plot(title=title, fontsize=12)
        ax.set_xlabel("Date")
        ax.set_ylabel("Price")
        plt.show()
    

    注意: set_xlabel()方法是调用df.plot后返回对象的方法, 可以理解为handler

    全部代码如下:

    """Slice and plot"""
    
    import os
    import pandas as pd
    import matplotlib.pyplot as plt
    
    def plot_selected(df, columns, start_index, end_index):
        """Plot the desired columns over index values in the given range."""
        plot_data(df.ix[start_index : end_index, columns], title="Selected data")
    
    def symbol_to_path(symbol, base_dir="data"):
        """Return CSV file path given ticker symbol."""
        return os.path.join(base_dir, "{}.csv".format(str(symbol)))
    
    def get_data(symbols, dates):
        """Read stock data (adjusted close) for given symbols from CSV files."""
        df = pd.DataFrame(index=dates)
        if 'SPY' not in symbols:  # add SPY for reference, if absent
            symbols.insert(0, 'SPY')
    
        for symbol in symbols:
            df_temp = pd.read_csv(symbol_to_path(symbol), index_col='Date',
                    parse_dates=True, usecols=['Date', 'Adj Close'], na_values=['nan'])
            df_temp = df_temp.rename(columns={'Adj Close': symbol})
            df = df.join(df_temp)
            if symbol == 'SPY':  # drop dates SPY did not trade
                df = df.dropna(subset=["SPY"])
    
        return df
    
    def plot_data(df, title="Stock prices"):
        """Plot stock prices with a custom title and meaningful axis labels."""
        ax = df.plot(title=title, fontsize=12)
        ax.set_xlabel("Date")
        ax.set_ylabel("Price")
        plt.show()
    
    def normalize_data(df):
        """Normalize stock prices using the first row of the dataframe"""
        return df / df.ix[0, :]
    
    def test_run():
        # Define a date range
        dates = pd.date_range('2010-01-01', '2010-12-31')
        # Choose stock symbols to read
        symbols = ['GOOG', 'IBM', 'GLD']  # SPY will be added in get_data()
        # Get stock data
        df = get_data(symbols, dates)
        # Normalize stock prices
        df = normalize_data(df)  
        # Slice and plot
        plot_selected(df, ['SPY', 'IBM'], '2010-03-01', '2010-04-01')
    
    if __name__ == "__main__":
        test_run()
    

    相关文章

      网友评论

          本文标题:Pandas

          本文链接:https://www.haomeiwen.com/subject/omxaiqtx.html