美文网首页
Introduction to Data Science in

Introduction to Data Science in

作者: Python休 | 来源:发表于2020-03-12 15:26 被阅读0次

    本文主要是作者在学习coursera的Introduction to Data Science in Python课程的学习笔记,仅供参考。


    1. 50 Years of Data Science

        (1) Data Exploration and Preparation 

        (2) Data Representation and Transformation

        (3) Computing with Data

        (4) Data Modeling

        (5) Data Visualization and Presentation

        (6) Science about Data Science


    2. Functions

    def add_numbers(x,  y,  z = None, flag = False):

        if (flag):

            print('Flag is true!')

        if (z == None):

            return x + y

        else:

            return x + y + z

    print(add_numbers(1, 2, flag=true))

    Assign function add_numbers to a variable a:

    a = add_numbers

    a = (1, 2, flag=true)


    3. 查看数据类型

    type('This is a string')

    -> str

    type(None)

    -> NoneType


    4. Tuple 元组

    Tuples are an immutable data structure (cannot be altered).

    元组是一个不变的数据结构(无法更改)。

    x = (1, 'a', 2, 'b')

    type(x)

    ->tuple


    5. List 列表

    Lists are a mutable data structure.

    列表是可变的数据结构。

    x = [1, 'a', 2, 'b']

    type(x)

    ->list


    6. Append 附加

    Use append to append an object to a list.

    使用附加将对象附加到列表。

    x.append(3.3)

    print(x)

    ->[1, 'a', 2, 'b', 3.3]


    7. Loop through each item in the list

    for item in x:

        print(item)

    ->1

        a

        2

        b

        3.3


    8. Using the indexing operator to loop through each item in the list

    i = 0

    while( i != len(x) ):

            print(x[I])

            i = i +1

    ->1

        a

        2

        b

        3.3


    9. List 基本操作

    (1)Use + to concatenate连接 lists

    [1, 2] + [3, 4]

    -> [1, 2, 3, 4]

    (2)Use * to repeat lists

    [1]*3

    ->[1, 1, 1]

    (3) Use the in operator to check if something is inside a list

    1 in [1, 2, 3]

    ->True


    10. String 基本操作

    (1)Use bracket notation to slice a string.

              使用方括号符号来分割字符串。

    x = 'This is a string'

    print(x[0])

    ->T

    print(x[0:1])

    ->T

    print(x[0:2])

    ->Th

    print(x[-1])  # the last element

    ->g

    print(x[-4:-2])  # start from the 4th element from the end and stop before the 2nd element from the end

    ->ri

    x[:3]  # This is a slice from the beginning of the string and stopping before the 3rd element.

    ->Thi

    x[3:] # this is a slice starting from the 4th element of the string and going all the way to the end.

    -> s is a string

    (2) New example on list

    firstname = 'Christopher'

    lastname = 'Brooks'

    print(firstname + ' ' + lastname)

    ->Christopher Brooks

    print(firstname*3)

    ->ChristopherChristopherChristopher

    print('Chris' in firstname)

    ->True

    (3) Split returns a list of all the words in a string, or a list split on a specific character.

    firstname = 'Christopher Arthur Hansen Brooks'.split(' ')[0] 

    lastname = 'Christopher Arthur Hansen Brooks'.split(' ')[-1] 

    print(firstname)

    ->Christopher

    print(lastname)

    ->Brooks

    (4) Make sure you convert objects to strings before concatenating串联.

    'Chris' + 2

    ->Error

    'Chris' + str(2)

    ->Chris2


    11. Dictionary 字典 

    (1)Dictionaries associate keys with values

    x = {'Christopher Brooks': 'brooksch@umich.edu', 'Bill Gates': 'billg@microsoft.com'}

    x['Christopher Brooks']

    ->brooksch@umich.edu

    x['Kevyn Collins-Thompson'] = None

    x['Kevyn Collins-Thompson']

    ->没有输出

    (2)Iterate over all of the keys:

              遍历所有的键:

    for name in x:

        print(x[name])

    ->brooksch@umich.edu

        billg@microsoft.com

        None

    (3) Iterate over all of the values:

    for email in x.values():

        print(email)

    ->brooksch@umich.edu

        billg@microsoft.com

        None

    (4) Iterate over all of the items in the list:

    for name, email in x.items():

        print(name)

        print(email)

    ->Christopher Brooks

        brooksch@umich.edu

        Bill Gates

        billg@microsoft.com

        Kevyn Collins-Thompson

        None

    (5) unpack a sequence into different variables:

              将序列解压为不同的变量:

    x = ('Christopher', 'Brooks', 'brooksch@umich.edu')

    fname, lname, email = x

    fname

    ->Christopher

    lname

    ->Brooks

    (6) Make sure the number of values you are unpacking matches the number of variables being assigned.

    x = ('Christopher', 'Brooks', 'brooksch@umich.edu', 'Ann Anbor')

    fname, lname, email = x

    ->error


    12. More on Strings

    (1) Simple Samples

    print('Chris' + 2)

    ->error

    print('Chris' + str(2))

    ->Chris2

    (2) Python has a built in method for convenient string formatting.

    sales_record = {'price': 3.24, 'num_items': 4, 'person': 'Chris' }

    sales_statement = '{} bought {} item(s) at a price of {} each for a total of {}'

    print(sales_statement.format(sales_record['person'], sales_record['num_items'], sales_record['price'], sales_record['num_items']*sales_record['price']))

    ->Chris bought 4 item(s) at a price of 3.24 each for a total of 12.96


    13. Reading and Writing CSV files

    (1)导入csv

    import csv

    %precision 2

    with open('mpg.csv') as csvfile:

        mpg = list(csv.DictReader(csvfile)) # 将csvfile转化为元素为字典的list

    mpg[:3]

    ->

    [OrderedDict([('', '1'),

                  ('manufacturer', 'audi'),

                  ('model', 'a4'),

                  ('displ', '1.8'),

                  ('year', '1999'),

                  ('cyl', '4'),

                  ('trans', 'auto(l5)'),

                  ('drv', 'f'),

                  ('cty', '18'),

                  ('hwy', '29'),

                  ('fl', 'p'),

                  ('class', 'compact')]),

    OrderedDict([('', '2'),

                  ('manufacturer', 'audi'),

                  ('model', 'a4'),

                  ('displ', '1.8'),

                  ('year', '1999'),

                  ('cyl', '4'),

                  ('trans', 'manual(m5)'),

                  ('drv', 'f'),

                  ('cty', '21'),

                  ('hwy', '29'),

                  ('fl', 'p'),

                  ('class', 'compact')]),

    OrderedDict([('', '3'),

                  ('manufacturer', 'audi'),

                  ('model', 'a4'),

                  ('displ', '2'),

                  ('year', '2008'),

                  ('cyl', '4'),

                  ('trans', 'manual(m6)'),

                  ('drv', 'f'),

                  ('cty', '20'),

                  ('hwy', '31'),

                  ('fl', 'p'),

                  ('class', 'compact')])]

    (2)查看list长度

    len(mpg)

    ->234

    (3)keys gives us the column names of our csv

    mpg[0].keys()

    ->odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

    (4)Find the average cty fuel economy across all car. All values in the dictionaries are strings, so we need to convert to float.

    sum(float(d['hwy']) for d in mpg) / len(mpg)

    ->23.44

    (5)Use set to return the unique values for the number of cylinders the cars in our dataset have.

    使用set返回数据集中汽车具有的汽缸数的唯一值。

    cylinders = set(d['cyl'] for d in mpg)

    cylinders

    ->'4', '5', '6', '8'

    (6) We are grouping the cars by number of cylinder, and find the average cty mpg for each group.

    CtyMpgByCyl = []

    for c in cylinders:

        summpg = 0

        cyltypecount = 0

        for d in mpg:

                if d['cyl'] == c:

                    summpg += float(d['cty'])

                    cyltypecount += 1

        CtyMpgByCyl.append((c, summpg / cyltypecount))

    CtyMpgByCyl.sort(key = lambda x: x[0])

    CtyMpgByCyl

    ->[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

    (7) Use set to return the unique values for the class types in our dataset

    vehicleclass = set(d['class'] for d in mpg)

    vehicleclass

    ->{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

    (8) How to find the average hwy mpg for each class of vehicle in our dataset.

    HwyMpgByClass = []

    for t in vehicleclass:

        summpg = 0

        vclasscount = 0

        for d in mpg:

                if d['class'] == t:

                        summpg += float(d['hwy'])

                        vclasscount += 1

        HwyMpgByClass.append((t, summpg / vclasscount))

    HwyMpgByClass.sort(key = lambda x: x[1])

    HwyMpgByClass

    ->

    [('pickup', 16.88),

    ('suv', 18.13),

    ('minivan', 22.36),

    ('2seater', 24.80),

    ('midsize', 27.29),

    ('subcompact', 28.14),

    ('compact', 28.30)]


    14. Dates and Times

    (1) 安装Datetime和Times的包

    import datetime as dt

    import time as tm

    (2) Time returns the current time in seconds since the Epoch

    tm.time()

    ->1583932727.90

    (3) Convert the timestamp to datetime

    dtnow = dt.datetime.fromtimestamp(tm.time())

    dtnow

    ->

    datetime.datetime(2020, 3, 11, 13, 18, 56, 990293)

    (4) Handy datetime attributes: get year, month, day, etc. from a datetime

    dtnow.year, dtnow.month, dtnow.day, dtnow.hour, dtnow.minute, dtnow.second

    ->(2020, 3, 11, 13, 18, 56)

    (5) Timedelta is a duration expressing the difference between two dates.

    delta = dt.timedelta(days = 100)

    delta

    ->datetime.timedelta(100)

    (6) date.today returns the current local date

    today = dt.date.today()

    today

    ->datetime.date(2020, 3, 11)

    (7) the date 100 days ago

    today - delta

    ->datetime.date(2019, 12, 2)

    (8) compare dates

    today > today - delta

    -> True


    15. Objects and map()

    (1) an example of a class in python:

    class Person:

        department = 'School of Information'

        def set_name(self, new_name)

                self.name = new_name

        def set_location(self, new_location)

                self.location = new_location

    person = Person()

    person.set_name('Christopher Brooks')

    person.set_location('Ann Arbor, MI, USA')

    print('{} live in {} and work in the department {}'.format(person.name, person.location, person.department))

    (2) mapping the min function between two lists

    store1 = [10.00, 11.00, 12.34, 2.34]

    store2 = [9.00, 11.10, 12.34, 2.01]

    cheapest = map(min, store1, store2)

    cheapest

    -><map at 0x7f74034a8860>

    (3) iterate through the map object to see the values

    for item in cheapest:

        print(item)

    ->

    9.0

    11.0

    12.34

    2.01


    16. Lambda and List Comprehensions

    (1) an example of lambda that takes in three parameters and adds the first two

    my_function = lambda a, b, c: a+b

    my_function(1, 2, 3)

    ->3

    (2) iterate from 0 to 999 and return the even numbers.

    my_list = []

    for number in range(0, 1000):

            if number % 2 == 0:

                    my_list.append(number)

    my_list

    ->[0, 2, 4,...]

    (3) Now the same thing but with list comprehension

    my_list = [number for number in range(0, 1000) if number % 2 == 0]

    my_list

    ->[0, 2, 4,...]


    17. Numpy

    (1) import package

    import numpy as np


    18.creating array数组(tuple元组,list列表)

    (1) create a list and convert it to a numpy array

    mylist = [1, 2, 3]

    x = np.array(mylist)

    x

    ->array([1, 2, 3])

    (2) just pass in a list directly

    y = np.array([4, 5, 6])

    y

    ->array([4, 5, 6])

    (3) pass in a list of lists to create a multidimensional array

    m = np.array([[[7, 8, 9,],[10, 11, 12]])

    m

    ->

    array([[ 7, 8, 9],

          [10, 11, 12]])

    (4) use the shape method to find the dimensions of array

    m.shape 

    ->(2,3)

    (5) arange returns evenly spaced values within a given interval

    n = np.arange(0, 30, 2)

    n

    ->array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

    (6) reshape returns an array with the same data with a new shape

    n = n.reshape(3, 5)

    n

    ->

    array([[ 0, 2, 4, 6, 8],

          [10, 12, 14, 16, 18],

          [20, 22, 24, 26, 28]])

    (7) linspace returns evenly spaced numbers over a specified interval

    o = np.linspace(0, 4, 9)

    o

    ->array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. ])

    (8) resize changes the shape and size of array in-space

    o.resize(3, 3)

    o

    ->

    array([[ 0. , 0.5, 1. ],

          [ 1.5,  2. ,  2.5],

          [ 3. ,  3.5,  4. ]])

    (9) ones returns a new array of given shape and type, filled with ones

    np.ones((3, 2))

    ->

    array([[ 1., 1.],

          [ 1.,  1.],

          [ 1.,  1.]])

    (10) zeros returns a new array of given shape and type, filled with zeros

    np.zeros((2,3))

    ->

    array([[ 0., 0., 0.],

          [ 0.,  0.,  0.]])

    (11) eye returns a 2D array with ones on the diagonal and zeros

    np.eye(3)

    ->

    array([[ 1., 0., 0.],

          [ 0.,  1.,  0.],

          [ 0.,  0.,  1.]])

    (12) diag extracts a diagonal or constructs a diagonal array

    np.diag(y)

    ->

    array([[4, 0, 0],

          [0, 5, 0],

          [0, 0, 6]])

    (13)creating an array using repeating list

    np.array([1, 2, 3]*3)

    ->array([1, 2, 3, 1, 2, 3, 1, 2, 3])

    (14) repeat elements of an array using repeat

    np.repeat([1, 2, 3], 3)

    ->array([1, 1, 1, 2, 2, 2, 3, 3, 3])

    (15) combine arrays

    p = np.ones([2, 3], int)

    p

    ->

    array([[1, 1, 1],

          [1, 1, 1]])

    (16) use vstack to stack arrays in sequence vertically (row wise).

    np.vstack([p, 2*p])

    ->

    array([[1, 1, 1],

          [1, 1, 1],

          [2, 2, 2],

          [2, 2, 2]])

    (17) use hstack to stack arrays in sequence horizontally (column wise).

    np.hstack([p, 2*p])

    ->

    array([[1, 1, 1, 2, 2, 2],

          [1, 1, 1, 2, 2, 2]])


    19. Operations

    (1) element wise + - * /

    print(x+y)

    print(x-y)

    ->

    [5 7 9]

    [-3 -3 -3]

    print(x*y)

    print(x/y)

    ->

    [ 4 10 18]

    [ 0.25  0.4  0.5 ]

    print(x**2)

    ->[1 4 9]

    (2) Dot Product

    x.dot(y) # x1y1+x2y2+x3y3

    ->32

    (3)

     z = np.array([y, y**2])

    print(z)

    print(len(z)) #number of rows of array

    ->

    [[ 4 5 6]

    [16 25 36]]

    2

    (4) transpose array

    z

    ->

    [[ 4 5 6]

    [16 25 36]]

    z.T

    ->

    array([[ 4, 16],

          [ 5, 25],

          [ 6, 36]])

    (5) use .dtype to see the data type of the elements in the array

    z.dtype

    ->dtype('int64')

    (6) use .astype to cast to a specific type 

    z = z.astype('f')

    z.dtype

    ->dtype('float32')

    (7) math functions 

    a = np.array([-4, -2, 1, 3, 5])

    a.sum()

    ->3

    a.max()

    ->5

    a.min()

    ->-4

    a.mean()

    ->0.59999999998

    a.std()

    ->3.2619012860600183

    a.argmax()

    ->4

    a.argmin()

    ->0

    (8) indexing / slicing

    s = np.arange(13)**2

    s

    ->array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144])

    (9)use bracket notation to get the value at a specific index

    s[0], s[4], s[-1]

    ->(0, 16, 144)

    (10) use : to indicate a range.array[start:stop]

    s[1:5]

    ->array([ 1, 4, 9, 16])

    (11) use negatives to count from the back

    s[-4:]

    ->array([ 81, 100, 121, 144])

    (12) A second : can be used to indicate step-size.array[start : stop : stepsize]

    Here we are starting 5th element from the end, and counting backwards by 2 until the beginning of the array is reached.

    s[-5::-2]

    ->array([64, 36, 16, 4, 0])

    (13) look at the multidimensional array

    r = np.arange(36)

    r.resize((6,6))

    r

    ->

    array([[ 0, 1, 2, 3, 4, 5],

          [ 6,  7,  8,  9, 10, 11],

          [12, 13, 14, 15, 16, 17],

          [18, 19, 20, 21, 22, 23],

          [24, 25, 26, 27, 28, 29],

          [30, 31, 32, 33, 34, 35]])

    (14) use bracket notation to slice

    r[2, 2]

    ->14

    (15) use : to select a range of rows or columns

    r[3, 3:6]

    ->array([21, 22, 23])

    (16) select all the rows up to row2 , and all the columns up to the last column.

    r[:2, :-1]

    ->

    array([[ 0, 1, 2, 3, 4],

          [ 6,  7,  8,  9, 10]])

    (17) a slice of last row, only every other element

    r[-1, ::2]

    ->array([30, 32, 34])

    (18) perform conditional indexing.

    r[r > 30]

    ->array([31, 32, 33, 34, 35])

    (19) assigning all values in the array that are greater than 30 to the value of 30

    r[r > 30] = 30

    r

    ->

    array([[ 0, 1, 2, 3, 4, 5],

          [ 6,  7,  8,  9, 10, 11],

          [12, 13, 14, 15, 16, 17],

          [18, 19, 20, 21, 22, 23],

          [24, 25, 26, 27, 28, 29],

          [30, 30, 30, 30, 30, 30]])

    (20) copy and modify arrays

    r2 = r[:3, :3]

    r2

    ->

    array([[ 0, 1, 2],

          [ 6,  7,  8],

          [12, 13, 14]])

    (21)set this slice's values to zero([:] selects the entire array)

    r2[:] = 0

    r2

    ->

    array([[0, 0, 0],

          [0, 0, 0],

          [0, 0, 0]])

    (22) r has also be changed

    r

    ->

    array([[ 0, 0, 0, 3, 4, 5],

          [ 0,  0,  0,  9, 10, 11],

          [ 0,  0,  0, 15, 16, 17],

          [18, 19, 20, 21, 22, 23],

          [24, 25, 26, 27, 28, 29],

          [30, 30, 30, 30, 30, 30]])

    (23) to avoid this, use .copy()

    r_copy = r.copy()

    r_copy

    ->

    array([[ 0, 0, 0, 3, 4, 5],

          [ 0,  0,  0,  9, 10, 11],

          [ 0,  0,  0, 15, 16, 17],

          [18, 19, 20, 21, 22, 23],

          [24, 25, 26, 27, 28, 29],

          [30, 30, 30, 30, 30, 30]])

    (24) now when r_copy is modified, r will not be changed

    r_copy[:] =10

    print(r_copy, '\n')

    print(r)

    ->

    [[10 10 10 10 10 10]

    [10 10 10 10 10 10]

    [10 10 10 10 10 10]

    [10 10 10 10 10 10]

    [10 10 10 10 10 10]

    [10 10 10 10 10 10]]

    [[ 0  0  0  3  4  5]

    [ 0  0  0  9 10 11]

    [ 0  0  0 15 16 17]

    [18 19 20 21 22 23]

    [24 25 26 27 28 29]

    [30 30 30 30 30 30]]

    (25) create a new 4*3 array of random numbers 0-9

    test = np.random.randint(0, 10, (4,3))

    test

    ->

    array([[1, 8, 2],

          [6, 1, 5],

          [7, 8, 0],

          [7, 6, 2]])

    (26) iterate by row

    for row in test:

        print(row)

    ->

    [1 8 2] 

    [6 1 5]

    [7 8 0]

    [7 6 2]

    (27) iterate by index

    for i in range(len(test)):

            print(test[I])

    ->

    [1 8 2]

    [6 1 5]

    [7 8 0]

    [7 6 2]

    (28) iterate by row and index

    for i, row in enumerate(test):

            print('row', i, 'is', row)

    ->

    row 0 is [1 8 2]

    row 1 is [6 1 5]

    row 2 is [7 8 0]

    row 3 is [7 6 2]

    (29) use zip to iterate over multiple iterables

    test2 = test**2

    test2

    ->

    array([[ 1, 64, 4],

          [36,  1, 25],

          [49, 64,  0],

          [49, 36,  4]])

    for i, j in zip(test, test2):

            print(i, '+', j, '=', i+j)

    ->

    [1 8 2] + [ 1 64 4] = [ 2 72 6]

    [6 1 5] + [36  1 25] = [42  2 30]

    [7 8 0] + [49 64  0] = [56 72  0]

    [7 6 2] + [49 36  4] = [56 42  6]

    相关文章

      网友评论

          本文标题:Introduction to Data Science in

          本文链接:https://www.haomeiwen.com/subject/qnlgdhtx.html