- win32与linux中,运行列出文件命令(dir, ls), 默认对文件名的排序是先数字后字母。因win32大小写不区别,所以字母的排序就没什么好说了。linux下,字母的序列并没有按ascii表的序列,而是同一个字母的大小写排在一起,大写字母紧跟在小写字母之后。 但python的os模块对两个平台处理结果却不一样。
续上一篇
4. Parsing data
解压ZIP包,放于合适的路径下。在PyCharm中新建Project选择anaconda下的python,脚本及运行结果如下:
import pandas as pd
import os
import time
from datetime import datetime
path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd
def Key_Stats(gather="Total Debt/Enquity (mrp)"):
statspath = path+'/_KeyStats'
stock_list = sorted([x[0] for x in os.walk(statspath)])
#in Linux use sorted() func
#print(stock_list)
for each_dir in stock_list[1:]:
each_file = os.listdir(each_dir)
#print(each_file)
#time.sleep(15)
if len(each_file) > 0:
for file in each_file:
date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
unix_time = time.mktime(date_stamp.timetuple())
print(date_stamp, unix_time)
#time.sleep(15)
Key_Stats()
- 本部分能够获得文件路径和文件名,也就是具体到秒的时间。
-
由于参考内容中提到的os模块读取文件名顺序问题,因此使用sorted()函数处理后再继续。
运行结果
5. More Parsing
import pandas as pd
import os
import time
from datetime import datetime
path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd
def Key_Stats(gather="Total Debt/Equity (mrq)"):
statspath = path+'/_KeyStats'
stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func
for each_dir in stock_list[1:]:
each_file = os.listdir(each_dir)
#print(each_file)
ticker = each_dir.split("/")[-1] #in Linux use '/'
if len(each_file) > 0:
for file in each_file:
date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
unix_time = time.mktime(date_stamp.timetuple())
#print(date_stamp, unix_time)
full_file_path = each_dir+'/'+file
#print(full_file_path)
source = open(full_file_path, 'r').read()
value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
if 1 < len(value):
value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
else:
value = 'NoValue'
print(ticker+":",value)
#time.sleep(15)
Key_Stats()
- 此部分获取每个公司的名字和总负债股本比。
- 因为在使用split是有</th>或者</td>标签,因此只是用":"接在gather后;
- 在获取数字时,有可能文件中不存在要收集的字段,因此添加了if-else判断;
-
调试过程中发现在gather和具体数据之间,部分存在换行,因此分两次使用split获取数据。
获取公司名及总负债股本比
此处获取数据是使用的split和静态字符,更加广泛的应用参见Regular Expressions正则表达式。
6. Structuring data with Pandas
使用pandas将数据(datetime,unixtime,ticker,value)存入.csv文件中,其中value为'N/A'或者'NoValue'会pass。
import pandas as pd
import os
import time
from datetime import datetime
path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd
def Key_Stats(gather="Total Debt/Equity (mrq)"):
statspath = path+'/_KeyStats'
stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func
df = pd.DataFrame(columns=['Date','Unix','Ticker','DE Ratio'])
for each_dir in stock_list[1:]:
each_file = os.listdir(each_dir)
ticker = each_dir.split("/")[-1]
if len(each_file) > 0:
for file in each_file:
date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
unix_time = time.mktime(date_stamp.timetuple())
full_file_path = each_dir+'/'+file
source = open(full_file_path, 'r').read()
try:
value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
if 1 < len(value):
value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
else:
value = 'NoValue'
print(ticker+":",value)
df = df.append({'Date':date_stamp, 'Unix':unix_time, 'Ticker':ticker, 'DE Ratio':float(value)}, ignore_index=True)
except Exception as e:
pass
save = gather.replace(' ','').replace('(','').replace(')','').replace('/','')+('.6.csv')
print(save)
df.to_csv(save)
Key_Stats()
.csv文件内容
使用Pandas结构化数据,提高处理效率。
7. Getting more data and meshing data sets
对带标签数据的处理目标是进行分类,在投资方面,仅区分一只股票:
- 优于市场表现(1)
- 劣势于市场表现(0)
如果如果精细分类,或许可以分为:
- Significantly Outperform(2)
- Outperform(1)
- Match (say within 0.5% or something)(0)
- Under-perform(-1)
- Significantly Under-perform(-2)
虽然Yahoo Finance提供了一些相关数据,但是为了练习两数据源融合,我们在Quandl获取S&P 500的相关数据,搜索并下载自2000年开始的数据,选择CSV格式。由于Quandl网站与教程中使用方法发生变化,因此在URL中输入视频里获取数据的地址,即下载S&P 500 Index数据集;也可从我的百度云盘下载,数据从2000年1月3号-2016年3月22号。
- S&P 500 Index:标准普尔500指数覆盖的所有公司,都是在美国主要交易所,如纽纽约证券交易所、Nasdaq交易的上市公司。与道琼斯指数相比,标准普尔500指数包含的公司更多,因此风险更为分散,能够反映更广泛的市场变化。
- 从Quandl上下载数据需要账号,可以使用github(最近好像不通)、gmail、Linkedin。
import pandas as pd
import os
import time
from datetime import datetime
path = "/home/sum/share/Ubuntu_DeepLearning/intraQuarter" #cd path & pwd
def Key_Stats(gather="Total Debt/Equity (mrq)"):
#read the data sets
statspath = path+'/_KeyStats'
stock_list = sorted([x[0] for x in os.walk(statspath)]) #in Linux use sorted() func
df = pd.DataFrame(columns=['Date','Unix','Ticker','DE Ratio','Price','SP500'])
sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv")
for each_dir in stock_list[1:]:
each_file = os.listdir(each_dir)
ticker = each_dir.split("/")[-1]
if len(each_file) > 0:
for file in each_file:
date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
unix_time = time.mktime(date_stamp.timetuple())
full_file_path = each_dir+'/'+file
source = open(full_file_path, 'r').read()
try:
value = source.split(gather+':') #exist </td> or </th>, may exist \n, so just use : and split twice
if 1 < len(value):
value = value[1].split('<td class="yfnc_tabledata1">')[1].split('</td>')[0]
else:
value = 'NoValue'
try:
sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d')
row = sp500_df[(sp500_df.index == sp500_date)]
sp500_value = float(row["Adjusted Close"])
except:
sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d')
row = sp500_df[(sp500_df.index == sp500_date)]
sp500_value = float(row["Adjusted Close"])
#The reason for the Try and Except here is because some of our stock data may have been pulled on a weekend day.
#If we hunt for a weekend day's value of the S&P 500, that date just simply wont exist in the dataset
stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0])
print("ticker:",ticker,"sp500_date:",sp500_date,"stock_price:",stock_price,"sp500_value:",sp500_value)
#part of the stock_price doesn't exist
df = df.append({'Date':date_stamp,
'Unix':unix_time,
'Ticker':ticker,
'DE Ratio':float(value),
'Price':stock_price,
'SP500':sp500_value}, ignore_index=True)
except Exception as e:
pass
save = gather.replace(' ','').replace('(','').replace(')','').replace('/','')+('.7.csv')
print(save)
df.to_csv(save)
Key_Stats()
其中嵌套try-catch块是由于股市周末没有S&P 500值,因此减去3天的时间(单位:秒);
相比于TotalDebtEquitymrq.6.csv,本次生成的TotalDebtEquitymrq.7.csv缺少部分数据,经调试发现大部分缺少数据是由于来自YaHoo Finance的HTML文件中没有当天的stock_price。
网友评论