美文网首页数据挖掘
tf.feature_column实用特征工程总结

tf.feature_column实用特征工程总结

作者: xiaogp | 来源:发表于2020-06-12 16:30 被阅读0次

    tf.feature_column特征处理功能预览

    tf.feature_column主要针对离散特征和连续特征, 概括而言, tf.feature_column可以实现四种场景

    (1) 连续特征离散化, 分箱, hash, onehot, mutihot
    (2) 高维离散特征映射成低维稠密向量
    (3) 连续特征归一化
    (4) 离散特征交叉组合

    feature_column.png

    离散特征的表征策略

    tf.feature_column对离散变量的处理有四个接口, 分别是整数直接onehot, 指定词表onehot, hash降维onehot, embedding离散特征连续化, 其中前三种都会把字符串/整数特征处理成onehot 0/1向量,最后一种embedding_column会对onehot的结果之上在做低维稠密向量映射, 接口如下

    tf.feature_column.categorical_column_with_identity
    tf.feature_column.categorical_column_with_vocabulary_list
    tf.feature_column.categorical_column_with_hash_bucket
    tf.feature_column.embedding_column


    1.整数连续值的特征直接映射成离散特征 tf.feature_column.categorical_column_with_identity

    如果这一列离散特征本身就是用连续的整数表示的(从0开始),则可以直接映射为离散变量,提前指定最大取值数量,如果超出了用默认值填充,适合本来就是用整数ID编码,并且编码氛围不是很大的离散特征, 如果传入的值列表包含多个元素, 可以实现mutihot, 但是列表元素个数要相同

    key, 特征名
    num_buckets, 离散特征的离散取值规模
    default_value=None, 出现新值的默认填充值

    # 对col1列进行特征离散化
    one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    features = {"col1": [[4], [1], [0]]}  # 必须是大于等于0的数,不能是负数
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        print(net.eval())
    # [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    #  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
    #  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
    
    # 测试mutihot
    one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    features = {"col1": [[4, 2], [1, 0], [0, 1]]}  # 列表中包含多个元素
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        print(net.eval())
    # [[0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
    #  [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
    #  [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
    

    2.字符串的离散特征通过词表映射为离散特征 tf.feature_column.categorical_column_with_vocabulary_list

    如果一列离散特征是字符串(也可以是整数), 并且取值范围不多, 可以使用这个接口, 定义好离散特征的取值就好, 当出现新值的时候可以分配新的索引位置,或者映射到已有的位置上, 词表映射也支持mutihot

    key, 特征名
    vocabulary_list, 所有词的list
    dtype=None,
    default_value=-1, 处理新词, 新词直接映射到一个已给词的索引上, 默认是-1,由于索引从0开始所以默认就是不处理
    num_oov_buckets=0, 处理新词, 新词映射到新的索引上, 设置新词给多少索引

    # 对col1列进行特征离散化
    from tensorflow.python.ops import lookup_ops
    
    # num_oov_buckets会新增2列,防止有新词分配给他们, 加上词表的4列onehot之后一共有6列
    one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
                'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
    # onehot的索引顺序就是['a', 'x', 'ca', ''我'],所以只要指定了vocabulary_list的顺序,结果都是一样的
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  
    
    features = {'col1': [['a'], ['我'], ['a'], ['ca']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        sess.run(lookup_ops.tables_initializer())
        print(net.eval())
    # [[1. 0. 0. 0. 0. 0.]
    #  [0. 0. 0. 1. 0. 0.]
    #  [1. 0. 0. 0. 0. 0.]
    #  [0. 0. 1. 0. 0. 0.]]
    

    如果设置num_oov_buckets=0, 或者不设置默认, 新词会直接忽略掉

    one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
                'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=0)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  
    
    features = {'col1': [['a'], ['他'], ['a'], ['ca']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        sess.run(lookup_ops.tables_initializer())
        print(net.eval())
    # [[1. 0. 0. 0.]
    # [0. 0. 0. 0.]
    #  [1. 0. 0. 0.]
    #  [0. 0. 1. 0.]]
    

    另一种方式设置, 也可以处理新词问题, 直接映射到一个其他的词上面, 比如叫其他

    one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
                'col1', vocabulary_list=['a', 'x', 'ca', '我'], default_value=1)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  
    
    features = {'col1': [['a'], ['他'], ['a'], ['ca']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        sess.run(lookup_ops.tables_initializer())
        print(net.eval())
    # [[1. 0. 0. 0.]
    #  [0. 1. 0. 0.]
    #  [1. 0. 0. 0.]
    #  [0. 0. 1. 0.]]
    

    测试mutihot, 同一个元素在一行出现多次, 计数会超过1

    from tensorflow.python.ops import lookup_ops
    
    one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
                'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  
    
    features = {'col1': [['a', 'a'], ['我', 'a'], ['a', 'ca'], ['ca', 'x']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:
        sess.run(lookup_ops.tables_initializer())
        print(net.eval())
    # [[2. 0. 0. 0. 0. 0.]
    #  [1. 0. 0. 1. 0. 0.]
    #  [1. 0. 1. 0. 0. 0.]
    #  [0. 1. 1. 0. 0. 0.]]
    

    3.字符串的离散特征哈希后映射成离散特征 tf.feature_column.categorical_column_with_hash_bucket

    如果这一列是字符串离散变量(也可以是整数, 支持整数和字符串), 并且取值很多的情况下, 比如ID, 可以使用这个接口

    key, 特征名
    hash_bucket_size, 一个大于1至少是2的整数, 分为多少个桶
    dtype=tf.string, 输入的特征类型, 支持字符串和整数
    使用hash分桶将离散特征降维成离散特征表示

    one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
                'col1', hash_bucket_size=3)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    features = {'col1': [['a'], ['x'], ['a'], ['b'], ['d'], ['h'], ['c'], ['k']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:    
        print(net.eval())
    # [[1. 0. 0.]
    #  [0. 0. 1.]
    #  [1. 0. 0.]
    #  [0. 1. 0.]
    #  [0. 0. 1.]
    #  [0. 1. 0.]
    #  [0. 1. 0.]
    #  [0. 0. 1.]]
    

    如果输入是整数数值, 使用dtype设置dtype=tf.int32, 内部会把整数先转化为字符串在做hash

    one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
                'col1', hash_bucket_size=3, dtype=tf.int32)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    features = {'col1': [[1], [2]]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    with tf.Session() as sess:    
        print(net.eval())
    # [[1. 0. 0.]
    #  [0. 1. 0.]]
    

    4.离散之后在做embedding连续化 tf.feature_column.embedding_column

    整数,词表,hash之后通过indicator_column直接离散化, 进一步可以使用embedding_column将onehot矩阵通过一个中间的embedding随机词表, lookup成一个embedding稠密向量, 默认情况下embedding可以跟着模型继续训练, 即trainable=True, 对于mutihot, embedding支持向量组合方式mean, sqrtn和sum
    1.hash之后做embedding

    # 先hash
    one_categorical_feature = tf.feature_column.categorical_column_with_hash_bucket(
                'col1', hash_bucket_size=3, dtype=tf.string)
    # hash之后,对hash的新列再做embedding
    embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)
    # 和hash后直接onehot做对比
    one_hot_cols = tf.feature_column.indicator_column(one_categorical_feature) 
    # 输入特征的表头col1要和one_categorical_feature一致
    features = {"col1": [['a'], ['x'], ['a'], ['ca']]}
    # 分别查看embedding和onehot的结果
    net = tf.feature_column.input_layer(features, embedding_cols)
    net2 = tf.feature_column.input_layer(features, one_hot_cols)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(net.eval())
        print(net2.eval())
    # [[-0.34977508  0.6984099   0.3818953 ]
    #  [-0.15186837  0.8362309   0.03452474]
    #  [-0.34977508  0.6984099   0.3818953 ]
    #  [-0.15186837  0.8362309   0.03452474]]
    # [[1. 0. 0.]
    #  [0. 0. 1.]
    #  [1. 0. 0.]
    #  [0. 0. 1.]]
    

    2.词表onehot之后做embedding

    one_categorical_feature = tf.feature_column.categorical_column_with_vocabulary_list(
                'col1', vocabulary_list=['a', 'x', 'ca', '我'], num_oov_buckets=2)
    embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)  
    
    features = {'col1': [['a'], ['我'], ['a'], ['ca']]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    net2 = tf.feature_column.input_layer(features, embedding_cols)
    
    with tf.Session() as sess:
        sess.run(lookup_ops.tables_initializer())
        sess.run(tf.global_variables_initializer())
        print(net.eval())
        print(net2.eval())
    

    3.整数直接onehot之后再embedding

    one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)
    
    features = {"col1": [[4], [1], [0]]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    net2 = tf.feature_column.input_layer(features, embedding_cols)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(net.eval())
        print(net2.eval())
    # [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
    #  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
    #  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
    # [[-0.25737718  0.7426282  -0.12589012]
    #  [ 0.4047859   1.0059739   0.38896823]
    #  [ 0.07904201 -0.10592438 -0.10732928]]
    

    mutihot的embedding

    one_categorical_feature = tf.feature_column.categorical_column_with_identity("col1", 10)
    one_categorical_feature_show = tf.feature_column.indicator_column(one_categorical_feature)
    embedding_cols = tf.feature_column.embedding_column(one_categorical_feature, dimension=3)  # 默认组合方式是mean
    
    features = {"col1": [[4, 2], [1, 0], [0, 3]]}
    net = tf.feature_column.input_layer(features, one_categorical_feature_show)
    net2 = tf.feature_column.input_layer(features, embedding_cols)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(net.eval())
        print(net2.eval())
    

    连续特征的表征策略

    tf.feature_column对连续变量的处理有两个个接口, 分别是连续值直接映射成连续变量, 连续值分箱离散离散化,接口如下

    tf.feature_column.numeric_column
    tf.feature_column.bucketized_column


    1.连续值特征直接映射成连续特征,tf.feature_column.numeric_column

    one_continuous_feature = tf.feature_column.numeric_column("col1")
    feature = {"col1": [[1.], [5.]]}
    net = tf.feature_column.input_layer(feature, one_continuous_feature)
    with tf.Session() as sess:
        print(sess.run(net))
    # [[1.]
    #  [5.]]
    可以使用normalizer_fn将连续列进行归一化, 需要指定自定义函数
    one_continuous_feature = tf.feature_column.numeric_column("col1", normalizer_fn=lambda x: (x - 1.0) / 4.0)
    feature = {"col1": [[1.], [5.]]}
    net = tf.feature_column.input_layer(feature, one_continuous_feature)
    with tf.Session() as sess:
        print(sess.run(net))
    # [[0.]
    #  [1.]]
    

    2.连续值特征分箱转化为离散特征,tf.feature_column.bucketized_column

    one_continuous_feature = tf.feature_column.numeric_column("col1")
    # 不需要使用tf.feature_column.indicator_column, bucketized_column直接映射成onehot
    bucket = tf.feature_column.bucketized_column(one_continuous_feature, boundaries=[3, 8])  # 小于3,≥3且<8,≥8
    features = {"col1": [[2], [7], [13]]}
    net = tf.feature_column.input_layer(features, [bucket])
    with tf.Session() as sess:
        print(net.eval())
    # 指定2个分割点, onehot成3列
    # [[1. 0. 0.]
    # [0. 1. 0.]
    # [0. 0. 1.]]
    

    连续变量分箱之后可以继续接embedding

    one_continuous_feature = tf.feature_column.numeric_column("col1")
    bucket = tf.feature_column.bucketized_column(one_continuous_feature, boundaries=[3, 8])  # 小于3,≥3且<8,≥8
    embedding_cols = tf.feature_column.embedding_column(bucket, dimension=3)
    
    features = {"col1": [[2], [7], [13]]}
    net = tf.feature_column.input_layer(features, [bucket])
    net2 = tf.feature_column.input_layer(features, embedding_cols)
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(net.eval())
        print(net2.eval())
    # [[1. 0. 0.]
    #  [0. 1. 0.]
    #  [0. 0. 1.]]
    # [[ 0.16381341 -1.0849875  -1.0715116 ]
    #  [ 0.02688654  0.4091867   0.4175648 ]
    #  [ 0.61752146 -0.605965   -0.04298744]]    
    

    离散特征交叉

    tf.feature_column.crossed_column可以对离散特征进行交叉组合, 增加模型特征的表征能力

    # 分别定义sex 0, 1代表男女, degree 0, 1, 2代表不同学历, 获得两个特征的交叉一共2 * 6
    feature_a = tf.feature_column.categorical_column_with_identity("sex", num_buckets=2)
    feature_b = tf.feature_column.categorical_column_with_identity("degree", num_buckets=3)
    # hash_bucket_size必须指定, 是一个大于1的整数
    feature_cross = tf.feature_column.crossed_column([feature_a, feature_b], hash_bucket_size=6)
    feature_cross_show = tf.feature_column.indicator_column(feature_cross)
    
    features = {"sex": [[1], [1], [0]], "degree": [[1], [0], [2]]}
    
    net = tf.feature_column.input_layer(features, feature_cross_show)
    with tf.Session() as sess:
        print(net.eval())
    
    # [[0. 0. 0. 1. 0. 0.]
    #  [0. 0. 0. 0. 0. 1.]
    #  [0. 0. 1. 0. 0. 0.]]
    

    尝试用连续特征和离散特征进行交叉, 结果报错

    feature_a = tf.feature_column.categorical_column_with_identity("sex", num_buckets=2)
    feature_b = tf.feature_column.categorical_column_with_identity("degree", num_buckets=3)
    feature_c = tf.feature_column.numeric_column("age")
    feature_cross = tf.feature_column.crossed_column([feature_b, feature_c], hash_bucket_size=6)
    feature_cross_show = tf.feature_column.indicator_column(feature_cross)
    
    features = {"sex": [[1], [1], [0]], "degree": [[1], [0], [2]], "age": [[1], [2], [3]]}
    
    net = tf.feature_column.input_layer(features, feature_cross_show)
    with tf.Session() as sess:
        print(net.eval())
    # Unsupported key type. All keys must be either string, or categorical column except HashedCategoricalColumn. Given: 
    # NumericColumn(key='age', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)
    

    定义多种混合特征

    定义多个特征时, 在features字典中定义多个key, tensor对象传入一个list, 其中list的特征顺序不影响特征组合结果, 以feature_a.name的字符串顺序决定组合的特征组合的顺序

    feature_a = tf.feature_column.numeric_column("col1")
    feature_b = tf.feature_column.categorical_column_with_hash_bucket(
            "col2", hash_bucket_size=3)
    feature_c = tf.feature_column.embedding_column(
            feature_b, dimension=3)
    feature_d = tf.feature_column.indicator_column(feature_b)
    print(feature_a.name)
    print(feature_c.name)
    print(feature_d.name)
    features = {
            "col1": [[9], [10]],
            "col2": [["x"], ["yy"]]
            }
    net = tf.feature_column.input_layer(features, [feature_d, feature_c, feature_a])  # 跟此顺序无关
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        print(net.eval())
    
    # col1
    # col2_embedding
    # col2_indicator
    # [[ 9.          0.3664888   0.81272614 -0.45234588  0.          0.     1.        ]
    #  [10.          0.9593401  -0.18347593 -0.06236586  0.          1.     0.        ]]
    # 先是col1, 再是col2_embedding, 最后是col2_indicator, 一共7列特征
    

    相关文章

      网友评论

        本文标题:tf.feature_column实用特征工程总结

        本文链接:https://www.haomeiwen.com/subject/byqntktx.html