美文网首页
julia lang 编程 随记

julia lang 编程 随记

作者: 昵称违法 | 来源:发表于2020-05-10 11:57 被阅读0次

    【知识点】julia中dataframe合并操作——直接合并和使用预分配的时间差别

    一、直接使用append!(df1,df2)的合并方式。

    每次合并一个dataframe,该dataframe只有一行,一共合并2000万次

    using DataFrames
    function main()
        a =1.0
        b =2.0
        c =3.0
        d =4.0
        e =5.0
        g =6.0
        df = DataFrame([[a],[b],[c],[d],[e],[g]],[:a,:b,:c,:d,:e,:g])
        df1 = DataFrame([[a],[b],[c],[d],[e],[g]],[:a,:b,:c,:d,:e,:g])
        for i in 1:20000000
            append!(df,df1)
        end
        return df |> size
    end
    @time main()
    

    耗时:12.765717 seconds (260.16 M allocations: 3.903 GiB, 5.02% gc time)

    二、先创建一个(2000万, 6)的数组,for循环更新矩阵,然后矩阵to Dataframe
    function main1()
        a =1.0
        b =2.0
        c =3.0
        d =4.0
        e =5.0
        g =6.0
        df = DataFrame()
        myAry = Array{Float64,2}(undef,20000000,6)    
        for i in 1:20000000
            myAry[i,1] = a
            myAry[i,2] = b
            myAry[i,3] = c
            myAry[i,4] = d
            myAry[i,5] = e
            myAry[i,6] = g        
        end
        df = convert(DataFrame, myAry)
        names!(df,[:a,:b,:c,:d,:e,:g])
        return df |> size
    end
    
    @time main1()
    

    耗时:1.137518 seconds (102.15 k allocations: 1.794 GiB, 18.47% gc time)

    结论:速度相差10倍左右。先用矩阵,与分配好存储空间,速度比较快。前提是预知数据大小,不然的话,只能与分配一个估计的值。

    【知识点】如何获取一个变量的变量名字

    macro Name(arg)
        string(arg)
    end
    
    variablex  = 6
    
    a = @Name(variablex)
    a |> println
    
    @Name(a) |> println
    

    输出:
    variablex
    a

    【知识点】missing和NaN值的判断——ismissing(),!ismissing(), isnan(),!isnan()

    ddd=[NaN,1,2,3,4,5,NaN,missing]
    ddd |> display
    
    ddd[ddd.|>!ismissing] |> display  #排除missing
    
    ddd[ddd.|>!ismissing] |> begin    #排除missing
        ary->ary[ary.|> !isnan]       #排除NaN 
        end |> 
        display
    
    ddd[ddd.|>!ismissing] |>                        #排除missing    
        ary->ary[(ary .|> isnan).| (ary .> 0)] |>   #为NaN或者大于0
        display
    

    结果输出

    8-element Array{Union{Missing, Float64},1}:
     NaN       
       1.0     
       2.0     
       3.0     
       4.0     
       5.0     
     NaN       
        missing
    
    7-element Array{Union{Missing, Float64},1}:
     NaN  
       1.0
       2.0
       3.0
       4.0
       5.0
     NaN  
    
    5-element Array{Union{Missing, Float64},1}:
     1.0
     2.0
     3.0
     4.0
     5.0
    
    7-element Array{Union{Missing, Float64},1}:
     NaN  
       1.0
       2.0
       3.0
       4.0
       5.0
     NaN  
    

    【知识点】 DataFrame中,检视某列的值,排除该列为missing值的数据。

    using DataFrames
    # 生成dataframe
    name = [missing,"王五","金三","赵四"]
    age = [1,2,3,4]
    df = DataFrame(name = name,age = age)
    df |> display
    
    # 排除姓名为空的数据
    df_new = df[df.name .|> name -> !ismissing(name),:] 
    df_new |> display
    

    操作结果:

    操作前:
    
    4 rows × 2 columns
    
    name    age
    String⍰ Int64
    1   missing 1
    2   王五  2
    3   金三  3
    4   赵四  4
    
    
    操作后:
    
    3 rows × 2 columns
    
    name    age
    String⍰ Int64
    1   王五  2
    2   金三  3
    3   赵四  4
    

    【知识点】DataFrame排序操作:sort(df,[:col]),默认是按照升序排序的

    using DataFrames
    mydf = DataFrame()
    mydf.a = [-x + 10 for x in 1:10]
    mydf.b = [x for x in 201:210]
    mydf.c = [x for x in 301:310]
    mydf.d = [x for x in 401:410]
    mydf.time = ["9:00:00","9:05;00","9:10;00","9:15;00","9:20;00","9:25;00","9:30;00","9:35;00","9:40;00","9:45;00"]
    
    display("生成的df:")
    mydf |> display
    
    display("按a列(升序)排序后的df")
    df = sort(mydf,[:a])
    df |> display
    
    display("取a最大的前5个值:")
    last(df,5)
    
    display("按照time(String)列升序排列")
    df = sort(mydf,[:time])
    df |> display
    

    结果

    "生成的df:"
    10×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ time    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 9     │ 201   │ 301   │ 401   │ 9:00:00 │
    │ 2   │ 8     │ 202   │ 302   │ 402   │ 9:05;00 │
    │ 3   │ 7     │ 203   │ 303   │ 403   │ 9:10;00 │
    │ 4   │ 6     │ 204   │ 304   │ 404   │ 9:15;00 │
    │ 5   │ 5     │ 205   │ 305   │ 405   │ 9:20;00 │
    │ 6   │ 4     │ 206   │ 306   │ 406   │ 9:25;00 │
    │ 7   │ 3     │ 207   │ 307   │ 407   │ 9:30;00 │
    │ 8   │ 2     │ 208   │ 308   │ 408   │ 9:35;00 │
    │ 9   │ 1     │ 209   │ 309   │ 409   │ 9:40;00 │
    │ 10  │ 0     │ 210   │ 310   │ 410   │ 9:45;00 │
    按a列(升序)排序后的df
    10×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ time    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 0     │ 210   │ 310   │ 410   │ 9:45;00 │
    │ 2   │ 1     │ 209   │ 309   │ 409   │ 9:40;00 │
    │ 3   │ 2     │ 208   │ 308   │ 408   │ 9:35;00 │
    │ 4   │ 3     │ 207   │ 307   │ 407   │ 9:30;00 │
    │ 5   │ 4     │ 206   │ 306   │ 406   │ 9:25;00 │
    │ 6   │ 5     │ 205   │ 305   │ 405   │ 9:20;00 │
    │ 7   │ 6     │ 204   │ 304   │ 404   │ 9:15;00 │
    │ 8   │ 7     │ 203   │ 303   │ 403   │ 9:10;00 │
    │ 9   │ 8     │ 202   │ 302   │ 402   │ 9:05;00 │
    │ 10  │ 9     │ 201   │ 301   │ 401   │ 9:00:00 │
    取a最大的前5个值:
    5×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ time    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 5     │ 205   │ 305   │ 405   │ 9:20;00 │
    │ 2   │ 6     │ 204   │ 304   │ 404   │ 9:15;00 │
    │ 3   │ 7     │ 203   │ 303   │ 403   │ 9:10;00 │
    │ 4   │ 8     │ 202   │ 302   │ 402   │ 9:05;00 │
    │ 5   │ 9     │ 201   │ 301   │ 401   │ 9:00:00 │
    按照time(String)列升序排列
    10×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ time    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 9     │ 201   │ 301   │ 401   │ 9:00:00 │
    │ 2   │ 8     │ 202   │ 302   │ 402   │ 9:05;00 │
    │ 3   │ 7     │ 203   │ 303   │ 403   │ 9:10;00 │
    │ 4   │ 6     │ 204   │ 304   │ 404   │ 9:15;00 │
    │ 5   │ 5     │ 205   │ 305   │ 405   │ 9:20;00 │
    │ 6   │ 4     │ 206   │ 306   │ 406   │ 9:25;00 │
    │ 7   │ 3     │ 207   │ 307   │ 407   │ 9:30;00 │
    │ 8   │ 2     │ 208   │ 308   │ 408   │ 9:35;00 │
    │ 9   │ 1     │ 209   │ 309   │ 409   │ 9:40;00 │
    │ 10  │ 0     │ 210   │ 310   │ 410   │ 9:45;00 │
    
    

    【知识点】in 和 not in 的等价运算符号,以及矢量运算【∈,∉,.∈,.∉】

    #例子一
    "a" in ["b"]  # false
    "a" in ["a"]  # true
    "a" ∈ ["a"]  # true
    "a" ∉  ["b"]  # true
    
    #例子二:数组筛选
    ary = ["a","b","c","d"]
    sub_ary = ["a","b"]
    
    ary[ary .∈ Ref(sub_ary)]               #["a", "b"]   注意:Ref不能省略
    ary[ary .|> item -> item ∈ sub_ary]    #["a", "b"]  等价写法
    
    ary[ary .∉ Ref(sub_ary)]               #["c", "d"]
    ary[ary .|> item -> item ∉ sub_ary]    #["a", "b"]   等价写法
    
    #例子三:数据筛选
    using DataFrames
    using DataFramesMeta
    
    mydf = DataFrame()
    mydf.a = [x for x in 1:10]
    mydf.b = [x for x in 201:210]
    mydf.c = [x for x in 301:310]
    mydf.d = [x for x in 401:410]
    mydf.时间 = ["9:00:00","9:05;00","9:10;00","9:15;00","9:20;00","9:25;00","9:30;00","9:35;00","9:40;00","9:45;00"]
    mydf |> println
    
    timeSet = ["9:00:00","11:30:00","15:00;00"]
    @linq mydf |> where(:时间 .∈ Ref(timeSet)) |> println  #筛选数据,排除时间为【9点】【11点半】和【15点】的数据 Ref不能省略
    
    """
    # ====输出结果====
    10×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ 时间    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 1     │ 201   │ 301   │ 401   │ 9:00:00 │
    │ 2   │ 2     │ 202   │ 302   │ 402   │ 9:05;00 │
    │ 3   │ 3     │ 203   │ 303   │ 403   │ 9:10;00 │
    │ 4   │ 4     │ 204   │ 304   │ 404   │ 9:15;00 │
    │ 5   │ 5     │ 205   │ 305   │ 405   │ 9:20;00 │
    │ 6   │ 6     │ 206   │ 306   │ 406   │ 9:25;00 │
    │ 7   │ 7     │ 207   │ 307   │ 407   │ 9:30;00 │
    │ 8   │ 8     │ 208   │ 308   │ 408   │ 9:35;00 │
    │ 9   │ 9     │ 209   │ 309   │ 409   │ 9:40;00 │
    │ 10  │ 10    │ 210   │ 310   │ 410   │ 9:45;00 │
    
    1×5 DataFrame
    │ Row │ a     │ b     │ c     │ d     │ 时间    │
    │     │ Int64 │ Int64 │ Int64 │ Int64 │ String  │
    ├─────┼───────┼───────┼───────┼───────┼─────────┤
    │ 1   │ 1     │ 201   │ 301   │ 401   │ 9:00:00 │
    """
    
    
    

    【知识点】字典的迭代和排序

    keys(dict) -- 键
    values(dict) --值
    collect(dict) --pairs 映射
    sort(pairs) --键值对排序

    # 生成字典
    mydict = Dict([k=>"value:$(v)" for (k,v) in zip(1:10,11:20)])
    
    # 字典的遍历1
    for (key,value) in mydict
        println("$(key):$(value)")
    end
    
    """
    7:value:17
    4:value:14
    9:value:19
    10:value:20
    2:value:12
    3:value:13
    5:value:15
    8:value:18
    6:value:16
    1:value:11
    """
    
    # 字典遍历2
    mydict |> keys |> println    #key值
    
    
    #[7, 4, 9, 10, 2, 3, 5, 8, 6, 1]
    
    mydict |> values |> println  #value值
    
    #["value:17", "value:14", "value:19", "value:20", "value:12", "value:13", "value:15", "value:18", "value:16", "value:11"]
    
    
    #字典的排序,先输出pairs,在排序,或者直接用有序字典,注意使用collect
    mydict |> collect |> sort .|> println 
    
    """
    ===输出===
    1 => "value:11"
    2 => "value:12"
    3 => "value:13"
    4 => "value:14"
    5 => "value:15"
    6 => "value:16"
    7 => "value:17"
    8 => "value:18"
    9 => "value:19"
    10 => "value:20"
    """
    

    shitf(Array,n) 把数组的值挪动位置

    shift(myary::Array,n::Int)

    """
    功能:把数组中的元素向前或者向后顺移位置
    参数:
         myary:要移动的数组
         n:要移动的位置数 正:从右向左移动,往前移。负:从左向右移动,往后移
         返回值:新的array
    """
    function shift(myary::Array,n::Int)
        #result = zeros(Float64,length(myary))
        result = Array{Any,1}(undef, length(myary)) #适用多种类型
        for i in 1:length(myary)
            if 1 <= n+i <= length(myary)
                result[i] = myary[n+i]   
            else
                result[i] = NaN
            end
        end
        return result
    end
    
    # ====测试代码====
    ary1 = [1,2,3,4,5,6,7,8,9,10]
    ary2 = ["1","2","3","4","5"]
    shift(ary1,1) |> println    # Any[2, 3, 4, 5, 6, 7, 8, 9, 10, NaN]
    shift(ary2,-1) |> println   # Any[NaN, "1", "2", "3", "4"]
    

    对一个系列的时间窗求值

    rolling_func(array::Array,n::Int,func::Function):Array

    using Statistics
    """
    功能:取时间窗里的数据系列,并用给定的func进行求值
    参数:
      array:给定的Array
      n:时间窗数量
      func:对取到的系列值进行的操作【例如:mean std】
    
    ====测试代码====
    myary = [1,2,3,4,5,6,7,8,9]
    println(myary)
    println(rolling(myary,2,mean))
    println(rolling(myary,2,std))
    """
    function rolling_func(array::Array,n::Int,func::Function):Array
        @assert 1 <= n <= length(array)
        result::Array = zeros(Float64,length(array))
        #result = Array{Any,1}(undef, length(myary)) #适用多种类型
        for i in 1:length(array)
            if i >= n                      
                #array[i-n+1:i] |> func |> (x)-> result[i] = x
                result[i] = array[i-n+1:i] |> func
            else
                result[i] = NaN
            end    
        end
        return result
    end
    

    读取yaml格式的配置文件->Dict

    read_yaml_config(file_path::String):Dict

    import YAML
    #读取yaml格式的配置文件->Dict
    function read_yaml_config(file_path::String):Dict
        data::Dict = Dict()
        try
            io = open(file_path)
            data = YAML.load(io)
            close(io)
            #println(typeof(data))
            #println(data)
        catch e
            println("调用出错:read_yaml_config(),错误代码:",e)
        end
        return data
    end
    
    # """
    # ====测试代码====
    file_path = "D:/pythonWorkSpace/lianghuafxi/mod/winCelue05/config/2020年度每日选股列表.yaml"
    read_yaml_config(file_path)
    @time read_yaml_config(file_path)
    #貌似效能不佳,读取的速度比较慢
    # """
    

    #读取k线数据

    read_xt_kline(kind::String,code::String,kline_directory::String):DataFrame

    路径 :
    D:\pythonWorkSpace\lianghuafxi\mod\data2020\stock\price_000001.txt
    
    txt的文件内容:
    
    timetag|open|high|low|close|volumn|amount
    20160104|9.51|9.53|8.87|8.95|563497.0|660373404.00
    20160105|8.90|9.15|8.80|9.01|663269.0|755529717.00
    20160106|9.02|9.14|9.00|9.12|515706.0|591700436.00
    20160107|9.02|9.02|8.60|8.62|174761.0|194869000.00
    20160108|8.85|8.92|8.59|8.77|747527.0|831332251.00
    20160111|8.67|8.74|8.41|8.47|732013.0|800683139.00
    20160112|8.53|8.60|8.37|8.52|561642.0|605972403.00
    
    using CSV
    using DataFrames
    using Dates
    
    #读取k线数据
    function read_xt_kline(kind::String,code::String,kline_directory::String):DataFrame
        if kind == "股票"
            kind = "stock"
        elseif kind == "指数"
            kind = "index"
        end       
        
        df = DataFrame!()
        file = "$kline_directory/$kind/price_$code.txt" 
        try          
            df = CSV.read(file,delim = '|') #df = DataFrame(CSV.File(file)) 
            df = df[:,[:timetag,:open,:high,:low,:close]]                        #选取指定的列
            df = names!(df, [:day, :open, :high, :low, :close])                  #改列名称  !代表inplace
            df.day = df.day .|> string .|> (x) -> Date(x,dateformat"yyyymmdd")   #Int -> string -> Date
            df[!,:stockID]  .= code                                              #增加一列:股票代号  
            #display(df)
        catch e
            println("调用出错:read_xt_kline(),",file," 错误代码:",e)       
        end    
        return df
    end
    
    """
    #测试代码
    kind = "股票"
    stock = "600352"
    kline_directory = "D:/pythonWorkSpace/lianghuafxi/mod/data2020"  #k线位置
    df = read_xt_kline(kind,stock,kline_directory)
    @time read_xt_kline(kind,stock,kline_directory)
    println("ok")
    """
    

    读取k线数据,只返回交易日系列

    read_xt_kline_as_daPan(kind::String,code::String,kline_directory::String):Array
    用法:回测时获取交易日的时间轴

    using Dates
    using CSV
    
    #读取k线数据,只返回交易日系列
    function read_xt_kline_as_daPan(kind::String,code::String,kline_directory::String):Array
        """
        type:股票  指数 期货 基金    
        读取迅投的k线数据 作为大盘交易日时间轴
        返回:set(20190101:str)
        """   
        #根据类型整理处路径文件名
        if kind =="股票"
            kind ="stock"
        elseif kind =="指数"
            kind = "index"
        end        
         
        #如果包含市场后缀的话,去除   
        code = replace(code,".SH"=>"")
        code = replace(code,".SZ"=>"")   
       
        df = DataFrame!()
        file = "$kline_directory/$kind/price_$code.txt" 
        try  #退市股票,没有k线,会报错                                
            df = CSV.read(file,delim = '|') #df = DataFrame(CSV.File(file)) 
            df = df[:,[:timetag,:open,:high,:low,:close]]                        #选取指定的列
            df = names!(df, [:day, :open, :high, :low, :close])                  #改列名称  !代表inplace
            df.day = df.day .|> string .|> (x) -> Date(x,dateformat"yyyymmdd")   #Int -> string -> Date 
        catch e
            println("*****读取{$code}k线出错,请检查******{$e}")  
        end
        return df.day  
    end
    
    
    #测试代码
    kind = "指数"
    stock = "000001"
    kline_directory = "D:/pythonWorkSpace/lianghuafxi/mod/data2020"  #k线位置
    dt = read_xt_kline_as_daPan(kind,stock,kline_directory)
    @time read_xt_kline_as_daPan(kind,stock,kline_directory)
    dt |> display
    println("ok")
    
    
    

    相关文章

      网友评论

          本文标题:julia lang 编程 随记

          本文链接:https://www.haomeiwen.com/subject/jazdnhtx.html