【知识点】julia中dataframe合并操作——直接合并和使用预分配的时间差别
一、直接使用append!(df1,df2)的合并方式。
每次合并一个dataframe,该dataframe只有一行,一共合并2000万次
using DataFrames
function main()
a =1.0
b =2.0
c =3.0
d =4.0
e =5.0
g =6.0
df = DataFrame([[a],[b],[c],[d],[e],[g]],[:a,:b,:c,:d,:e,:g])
df1 = DataFrame([[a],[b],[c],[d],[e],[g]],[:a,:b,:c,:d,:e,:g])
for i in 1:20000000
append!(df,df1)
end
return df |> size
end
@time main()
耗时:12.765717 seconds (260.16 M allocations: 3.903 GiB, 5.02% gc time)
二、先创建一个(2000万, 6)的数组,for循环更新矩阵,然后矩阵to Dataframe
function main1()
a =1.0
b =2.0
c =3.0
d =4.0
e =5.0
g =6.0
df = DataFrame()
myAry = Array{Float64,2}(undef,20000000,6)
for i in 1:20000000
myAry[i,1] = a
myAry[i,2] = b
myAry[i,3] = c
myAry[i,4] = d
myAry[i,5] = e
myAry[i,6] = g
end
df = convert(DataFrame, myAry)
names!(df,[:a,:b,:c,:d,:e,:g])
return df |> size
end
@time main1()
耗时:1.137518 seconds (102.15 k allocations: 1.794 GiB, 18.47% gc time)
结论:速度相差10倍左右。先用矩阵,与分配好存储空间,速度比较快。前提是预知数据大小,不然的话,只能与分配一个估计的值。
【知识点】如何获取一个变量的变量名字
macro Name(arg)
string(arg)
end
variablex = 6
a = @Name(variablex)
a |> println
@Name(a) |> println
输出:
variablex
a
【知识点】missing和NaN值的判断——ismissing(),!ismissing(), isnan(),!isnan()
ddd=[NaN,1,2,3,4,5,NaN,missing]
ddd |> display
ddd[ddd.|>!ismissing] |> display #排除missing
ddd[ddd.|>!ismissing] |> begin #排除missing
ary->ary[ary.|> !isnan] #排除NaN
end |>
display
ddd[ddd.|>!ismissing] |> #排除missing
ary->ary[(ary .|> isnan).| (ary .> 0)] |> #为NaN或者大于0
display
结果输出
8-element Array{Union{Missing, Float64},1}:
NaN
1.0
2.0
3.0
4.0
5.0
NaN
missing
7-element Array{Union{Missing, Float64},1}:
NaN
1.0
2.0
3.0
4.0
5.0
NaN
5-element Array{Union{Missing, Float64},1}:
1.0
2.0
3.0
4.0
5.0
7-element Array{Union{Missing, Float64},1}:
NaN
1.0
2.0
3.0
4.0
5.0
NaN
【知识点】 DataFrame中,检视某列的值,排除该列为missing值的数据。
using DataFrames
# 生成dataframe
name = [missing,"王五","金三","赵四"]
age = [1,2,3,4]
df = DataFrame(name = name,age = age)
df |> display
# 排除姓名为空的数据
df_new = df[df.name .|> name -> !ismissing(name),:]
df_new |> display
操作结果:
操作前:
4 rows × 2 columns
name age
String⍰ Int64
1 missing 1
2 王五 2
3 金三 3
4 赵四 4
操作后:
3 rows × 2 columns
name age
String⍰ Int64
1 王五 2
2 金三 3
3 赵四 4
【知识点】DataFrame排序操作:sort(df,[:col]),默认是按照升序排序的
using DataFrames
mydf = DataFrame()
mydf.a = [-x + 10 for x in 1:10]
mydf.b = [x for x in 201:210]
mydf.c = [x for x in 301:310]
mydf.d = [x for x in 401:410]
mydf.time = ["9:00:00","9:05;00","9:10;00","9:15;00","9:20;00","9:25;00","9:30;00","9:35;00","9:40;00","9:45;00"]
display("生成的df:")
mydf |> display
display("按a列(升序)排序后的df")
df = sort(mydf,[:a])
df |> display
display("取a最大的前5个值:")
last(df,5)
display("按照time(String)列升序排列")
df = sort(mydf,[:time])
df |> display
结果
"生成的df:"
10×5 DataFrame
│ Row │ a │ b │ c │ d │ time │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 9 │ 201 │ 301 │ 401 │ 9:00:00 │
│ 2 │ 8 │ 202 │ 302 │ 402 │ 9:05;00 │
│ 3 │ 7 │ 203 │ 303 │ 403 │ 9:10;00 │
│ 4 │ 6 │ 204 │ 304 │ 404 │ 9:15;00 │
│ 5 │ 5 │ 205 │ 305 │ 405 │ 9:20;00 │
│ 6 │ 4 │ 206 │ 306 │ 406 │ 9:25;00 │
│ 7 │ 3 │ 207 │ 307 │ 407 │ 9:30;00 │
│ 8 │ 2 │ 208 │ 308 │ 408 │ 9:35;00 │
│ 9 │ 1 │ 209 │ 309 │ 409 │ 9:40;00 │
│ 10 │ 0 │ 210 │ 310 │ 410 │ 9:45;00 │
按a列(升序)排序后的df
10×5 DataFrame
│ Row │ a │ b │ c │ d │ time │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 0 │ 210 │ 310 │ 410 │ 9:45;00 │
│ 2 │ 1 │ 209 │ 309 │ 409 │ 9:40;00 │
│ 3 │ 2 │ 208 │ 308 │ 408 │ 9:35;00 │
│ 4 │ 3 │ 207 │ 307 │ 407 │ 9:30;00 │
│ 5 │ 4 │ 206 │ 306 │ 406 │ 9:25;00 │
│ 6 │ 5 │ 205 │ 305 │ 405 │ 9:20;00 │
│ 7 │ 6 │ 204 │ 304 │ 404 │ 9:15;00 │
│ 8 │ 7 │ 203 │ 303 │ 403 │ 9:10;00 │
│ 9 │ 8 │ 202 │ 302 │ 402 │ 9:05;00 │
│ 10 │ 9 │ 201 │ 301 │ 401 │ 9:00:00 │
取a最大的前5个值:
5×5 DataFrame
│ Row │ a │ b │ c │ d │ time │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 5 │ 205 │ 305 │ 405 │ 9:20;00 │
│ 2 │ 6 │ 204 │ 304 │ 404 │ 9:15;00 │
│ 3 │ 7 │ 203 │ 303 │ 403 │ 9:10;00 │
│ 4 │ 8 │ 202 │ 302 │ 402 │ 9:05;00 │
│ 5 │ 9 │ 201 │ 301 │ 401 │ 9:00:00 │
按照time(String)列升序排列
10×5 DataFrame
│ Row │ a │ b │ c │ d │ time │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 9 │ 201 │ 301 │ 401 │ 9:00:00 │
│ 2 │ 8 │ 202 │ 302 │ 402 │ 9:05;00 │
│ 3 │ 7 │ 203 │ 303 │ 403 │ 9:10;00 │
│ 4 │ 6 │ 204 │ 304 │ 404 │ 9:15;00 │
│ 5 │ 5 │ 205 │ 305 │ 405 │ 9:20;00 │
│ 6 │ 4 │ 206 │ 306 │ 406 │ 9:25;00 │
│ 7 │ 3 │ 207 │ 307 │ 407 │ 9:30;00 │
│ 8 │ 2 │ 208 │ 308 │ 408 │ 9:35;00 │
│ 9 │ 1 │ 209 │ 309 │ 409 │ 9:40;00 │
│ 10 │ 0 │ 210 │ 310 │ 410 │ 9:45;00 │
【知识点】in 和 not in 的等价运算符号,以及矢量运算【∈,∉,.∈,.∉】
#例子一
"a" in ["b"] # false
"a" in ["a"] # true
"a" ∈ ["a"] # true
"a" ∉ ["b"] # true
#例子二:数组筛选
ary = ["a","b","c","d"]
sub_ary = ["a","b"]
ary[ary .∈ Ref(sub_ary)] #["a", "b"] 注意:Ref不能省略
ary[ary .|> item -> item ∈ sub_ary] #["a", "b"] 等价写法
ary[ary .∉ Ref(sub_ary)] #["c", "d"]
ary[ary .|> item -> item ∉ sub_ary] #["a", "b"] 等价写法
#例子三:数据筛选
using DataFrames
using DataFramesMeta
mydf = DataFrame()
mydf.a = [x for x in 1:10]
mydf.b = [x for x in 201:210]
mydf.c = [x for x in 301:310]
mydf.d = [x for x in 401:410]
mydf.时间 = ["9:00:00","9:05;00","9:10;00","9:15;00","9:20;00","9:25;00","9:30;00","9:35;00","9:40;00","9:45;00"]
mydf |> println
timeSet = ["9:00:00","11:30:00","15:00;00"]
@linq mydf |> where(:时间 .∈ Ref(timeSet)) |> println #筛选数据,排除时间为【9点】【11点半】和【15点】的数据 Ref不能省略
"""
# ====输出结果====
10×5 DataFrame
│ Row │ a │ b │ c │ d │ 时间 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 201 │ 301 │ 401 │ 9:00:00 │
│ 2 │ 2 │ 202 │ 302 │ 402 │ 9:05;00 │
│ 3 │ 3 │ 203 │ 303 │ 403 │ 9:10;00 │
│ 4 │ 4 │ 204 │ 304 │ 404 │ 9:15;00 │
│ 5 │ 5 │ 205 │ 305 │ 405 │ 9:20;00 │
│ 6 │ 6 │ 206 │ 306 │ 406 │ 9:25;00 │
│ 7 │ 7 │ 207 │ 307 │ 407 │ 9:30;00 │
│ 8 │ 8 │ 208 │ 308 │ 408 │ 9:35;00 │
│ 9 │ 9 │ 209 │ 309 │ 409 │ 9:40;00 │
│ 10 │ 10 │ 210 │ 310 │ 410 │ 9:45;00 │
1×5 DataFrame
│ Row │ a │ b │ c │ d │ 时间 │
│ │ Int64 │ Int64 │ Int64 │ Int64 │ String │
├─────┼───────┼───────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 201 │ 301 │ 401 │ 9:00:00 │
"""
【知识点】字典的迭代和排序
keys(dict) -- 键
values(dict) --值
collect(dict) --pairs 映射
sort(pairs) --键值对排序
# 生成字典
mydict = Dict([k=>"value:$(v)" for (k,v) in zip(1:10,11:20)])
# 字典的遍历1
for (key,value) in mydict
println("$(key):$(value)")
end
"""
7:value:17
4:value:14
9:value:19
10:value:20
2:value:12
3:value:13
5:value:15
8:value:18
6:value:16
1:value:11
"""
# 字典遍历2
mydict |> keys |> println #key值
#[7, 4, 9, 10, 2, 3, 5, 8, 6, 1]
mydict |> values |> println #value值
#["value:17", "value:14", "value:19", "value:20", "value:12", "value:13", "value:15", "value:18", "value:16", "value:11"]
#字典的排序,先输出pairs,在排序,或者直接用有序字典,注意使用collect
mydict |> collect |> sort .|> println
"""
===输出===
1 => "value:11"
2 => "value:12"
3 => "value:13"
4 => "value:14"
5 => "value:15"
6 => "value:16"
7 => "value:17"
8 => "value:18"
9 => "value:19"
10 => "value:20"
"""
shitf(Array,n) 把数组的值挪动位置
shift(myary::Array,n::Int)
"""
功能:把数组中的元素向前或者向后顺移位置
参数:
myary:要移动的数组
n:要移动的位置数 正:从右向左移动,往前移。负:从左向右移动,往后移
返回值:新的array
"""
function shift(myary::Array,n::Int)
#result = zeros(Float64,length(myary))
result = Array{Any,1}(undef, length(myary)) #适用多种类型
for i in 1:length(myary)
if 1 <= n+i <= length(myary)
result[i] = myary[n+i]
else
result[i] = NaN
end
end
return result
end
# ====测试代码====
ary1 = [1,2,3,4,5,6,7,8,9,10]
ary2 = ["1","2","3","4","5"]
shift(ary1,1) |> println # Any[2, 3, 4, 5, 6, 7, 8, 9, 10, NaN]
shift(ary2,-1) |> println # Any[NaN, "1", "2", "3", "4"]
对一个系列的时间窗求值
rolling_func(array::Array,n::Int,func::Function):Array
using Statistics
"""
功能:取时间窗里的数据系列,并用给定的func进行求值
参数:
array:给定的Array
n:时间窗数量
func:对取到的系列值进行的操作【例如:mean std】
====测试代码====
myary = [1,2,3,4,5,6,7,8,9]
println(myary)
println(rolling(myary,2,mean))
println(rolling(myary,2,std))
"""
function rolling_func(array::Array,n::Int,func::Function):Array
@assert 1 <= n <= length(array)
result::Array = zeros(Float64,length(array))
#result = Array{Any,1}(undef, length(myary)) #适用多种类型
for i in 1:length(array)
if i >= n
#array[i-n+1:i] |> func |> (x)-> result[i] = x
result[i] = array[i-n+1:i] |> func
else
result[i] = NaN
end
end
return result
end
读取yaml格式的配置文件->Dict
read_yaml_config(file_path::String):Dict
import YAML
#读取yaml格式的配置文件->Dict
function read_yaml_config(file_path::String):Dict
data::Dict = Dict()
try
io = open(file_path)
data = YAML.load(io)
close(io)
#println(typeof(data))
#println(data)
catch e
println("调用出错:read_yaml_config(),错误代码:",e)
end
return data
end
# """
# ====测试代码====
file_path = "D:/pythonWorkSpace/lianghuafxi/mod/winCelue05/config/2020年度每日选股列表.yaml"
read_yaml_config(file_path)
@time read_yaml_config(file_path)
#貌似效能不佳,读取的速度比较慢
# """
#读取k线数据
read_xt_kline(kind::String,code::String,kline_directory::String):DataFrame
路径 :
D:\pythonWorkSpace\lianghuafxi\mod\data2020\stock\price_000001.txt
txt的文件内容:
timetag|open|high|low|close|volumn|amount
20160104|9.51|9.53|8.87|8.95|563497.0|660373404.00
20160105|8.90|9.15|8.80|9.01|663269.0|755529717.00
20160106|9.02|9.14|9.00|9.12|515706.0|591700436.00
20160107|9.02|9.02|8.60|8.62|174761.0|194869000.00
20160108|8.85|8.92|8.59|8.77|747527.0|831332251.00
20160111|8.67|8.74|8.41|8.47|732013.0|800683139.00
20160112|8.53|8.60|8.37|8.52|561642.0|605972403.00
using CSV
using DataFrames
using Dates
#读取k线数据
function read_xt_kline(kind::String,code::String,kline_directory::String):DataFrame
if kind == "股票"
kind = "stock"
elseif kind == "指数"
kind = "index"
end
df = DataFrame!()
file = "$kline_directory/$kind/price_$code.txt"
try
df = CSV.read(file,delim = '|') #df = DataFrame(CSV.File(file))
df = df[:,[:timetag,:open,:high,:low,:close]] #选取指定的列
df = names!(df, [:day, :open, :high, :low, :close]) #改列名称 !代表inplace
df.day = df.day .|> string .|> (x) -> Date(x,dateformat"yyyymmdd") #Int -> string -> Date
df[!,:stockID] .= code #增加一列:股票代号
#display(df)
catch e
println("调用出错:read_xt_kline(),",file," 错误代码:",e)
end
return df
end
"""
#测试代码
kind = "股票"
stock = "600352"
kline_directory = "D:/pythonWorkSpace/lianghuafxi/mod/data2020" #k线位置
df = read_xt_kline(kind,stock,kline_directory)
@time read_xt_kline(kind,stock,kline_directory)
println("ok")
"""
读取k线数据,只返回交易日系列
read_xt_kline_as_daPan(kind::String,code::String,kline_directory::String):Array
用法:回测时获取交易日的时间轴
using Dates
using CSV
#读取k线数据,只返回交易日系列
function read_xt_kline_as_daPan(kind::String,code::String,kline_directory::String):Array
"""
type:股票 指数 期货 基金
读取迅投的k线数据 作为大盘交易日时间轴
返回:set(20190101:str)
"""
#根据类型整理处路径文件名
if kind =="股票"
kind ="stock"
elseif kind =="指数"
kind = "index"
end
#如果包含市场后缀的话,去除
code = replace(code,".SH"=>"")
code = replace(code,".SZ"=>"")
df = DataFrame!()
file = "$kline_directory/$kind/price_$code.txt"
try #退市股票,没有k线,会报错
df = CSV.read(file,delim = '|') #df = DataFrame(CSV.File(file))
df = df[:,[:timetag,:open,:high,:low,:close]] #选取指定的列
df = names!(df, [:day, :open, :high, :low, :close]) #改列名称 !代表inplace
df.day = df.day .|> string .|> (x) -> Date(x,dateformat"yyyymmdd") #Int -> string -> Date
catch e
println("*****读取{$code}k线出错,请检查******{$e}")
end
return df.day
end
#测试代码
kind = "指数"
stock = "000001"
kline_directory = "D:/pythonWorkSpace/lianghuafxi/mod/data2020" #k线位置
dt = read_xt_kline_as_daPan(kind,stock,kline_directory)
@time read_xt_kline_as_daPan(kind,stock,kline_directory)
dt |> display
println("ok")
网友评论