美文网首页程序员
每天进步一点点-tricks

每天进步一点点-tricks

作者: Klaas | 来源:发表于2016-02-25 21:04 被阅读67次

    由于正在进行深度学习的研究,主要用的语言是python. 在实际写程序的过程中, 经常会遇到一些技巧性的东西,特此下来来并且不断更新, 如果有任何疑问, 麻烦在下方留言或者联系邮箱 strikerklaas@gmail.com.


    one-hot vector

    one-hot vector 在自然语言中处理非常重要, 常作为神经网络的输入, 有indexing的效果. 那么,实际情况中如何建立这样一个矩阵呢. 先考虑小的数据集. 比如有数据标记为两类0,1
    one-hot vector is a term in NLP, as its name indicates, it is a vector where only one element is 1 and the others are 0s. Suppose that we have a vocabulary consists of 4000 words for text generation, there should exist 4000 unique one-hot vector for each word. For different tasks, there are different ways to initialize the vectors.

    • classification
      Suppose that there are only 2 classes: 0 and 1. The two one-hot vectors should be [1,0],[0,1]. suppose that we have six learning samples but they are store in an array like [0,1,0,1,1,0], so, we produce an eye matrix first and let the array selects which vector they belong to form a matrix includes all samples.
    >>> import numpy as np
    >>> x = np.eye(2) # Two types of vectors
    >>> y = np.array([0,1,0,1,1,0]) # classes
    >>> x
    array([[ 1.,  0.],
           [ 0.,  1.]])
    >>> y
    array([0, 1, 0, 1, 1, 0])
    >>> x[y] # By indexing, we generate a matrix for learning
    array([[ 1.,  0.],
           [ 0.,  1.],
           [ 1.,  0.],
           [ 0.,  1.],
           [ 0.,  1.],
           [ 1.,  0.]])
    

    float32 (theano)

    The default floating point data type is float64, however, data must be tranferred to float32 to store in the GPU.

    • convert to float32
    epilson = np.float32(0.01)
    
    • use shared statement
    import theano
    import theano.tensor as T
    w = theano.shared((np.random.randn(input_dimension,output_dimension).astype('float32'), name='w')
    

    MNIST dataset

    The MNIST dataset is a universally-used dataset for digit recognition, its characters can be summed up as the following:

    1. train set:50,000, validation set:10,000,test set:10,000
    2. 28 x 28 pixels (each training example is represented as a 1-dimensional array whose length is 784.
      Now, we begin with opening the dataset in Python and try to optimize it to be used for GPU acceleration.
      <pre><code>
      import cPickle, gzip, numpy, theano

    Load the dataset

    f = gzip.open('mnist.pkl.gz', 'rb')
    train_set, valid_set, test_set = cPickle.load(f)
    f.close()

    Next, store the data into GPU memory

    def share_dataset(data_xy):
    # use theano shared value form
    data_x, data_y = data_xy
    shared_x = theano.shared(numpy.asarray(data_x, dtype=theano.config.floatX))
    shared_y = theano.shared(numpy.asarray(data_y, dtype=theano.config.floatX))
    '''
    Can also use the following syntax, it also works!
    shared_x = theano.shared(data_x.astype('float32'))
    shared_y = theano.shared(data_y.astype('float32'))
    '''
    # Since 'Y' should be intergers, not floats, we cast it
    return shared_x, T.cast(shared_y, 'int32')

    Now try it!

    test_set_x, test_set_y = share_dataset(test_set)
    valid_set_x, valid_set_y = share_dataset(valid_set)
    train_set_x, train_set_y = share_dataset(train_set)

    </code></pre>

    代码块语法遵循标准markdown代码,例如:python@requires_authorizationdef somefunc(param1='', param2=0): '''A docstring''' if param1 > param2: # interesting print 'Greater' return (param2 - param1 + 1) or Noneclass SomeClass: pass>>> message = '''interpreter... prompt'''

    相关文章

      网友评论

        本文标题:每天进步一点点-tricks

        本文链接:https://www.haomeiwen.com/subject/mhzikttx.html