54 数据分析案例4

作者: 7125messi | 来源:发表于2018-06-13 07:42 被阅读7次

54 数据分析案例4
spark第二天作业
萨恩斯：硕士案例研究中的数据分析方法
第四阶段爬虫整理
约练收获及复盘收获
随感
随感
随感
随感
随感

美国农业部食品数据库

这个数据是关于食物营养成分的。存储格式是JSON，看起来像这样

{
    "id": 21441, 
    "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading", 
    "tags": ["KFC"], 
    "manufacturer": "Kentucky Fried Chicken", 
    "group": "Fast Foods", 
    "portions": [ 
        { "amount": 1, 
          "unit": "wing, with skin", 
          "grams": 68.0
        }
        ...
      ],
    "nutrients": [ 
      { "value": 20.8, 
        "units": "g", 
        "description": "Protein", 
        "group": "Composition" 
      },
      ...
     ]
}

每种食物都有一系列特征，其中有两个list，protions和nutrients。我们必须把这样的数据进行处理，方便之后的分析。

这里使用python内建的json模块

import pandas as pd
import numpy as np
import json
pd.options.display.max_rows = 10
db = json.load(open('../datasets/usda_food/database.json'))
len(db)

db[0].keys()

dict_keys(['manufacturer', 'description', 'group', 'id', 'tags', 'nutrients', 'portions'])

db[0]['nutrients'][0]

{'description': 'Protein',
 'group': 'Composition',
 'units': 'g',
 'value': 25.18}

nutrients = pd.DataFrame(db[0]['nutrients'])
nutrients

description group   units   value
0   Protein Composition g   25.180
1   Total lipid (fat)   Composition g   29.200
2   Carbohydrate, by difference Composition g   3.060
3   Ash Other   g   3.280
4   Energy  Energy  kcal    376.000
... ... ... ... ...
157 Serine  Amino Acids g   1.472
158 Cholesterol Other   mg  93.000
159 Fatty acids, total saturated    Other   g   18.584
160 Fatty acids, total monounsaturated  Other   g   8.275
161 Fatty acids, total polyunsaturated  Other   g   0.830
162 rows × 4 columns

当把由字典组成的list转换为DataFrame的时候，我们可以吹创业提取的list部分。这里我们提取食品名，群（group），ID，制造商：

info_keys = ['description', 'group', 'id', 'manufacturer']
info = pd.DataFrame(db, columns=info_keys)
info[:5]

description group   id  manufacturer
0   Cheese, caraway Dairy and Egg Products  1008    
1   Cheese, cheddar Dairy and Egg Products  1009    
2   Cheese, edam    Dairy and Egg Products  1018    
3   Cheese, feta    Dairy and Egg Products  1019    
4   Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028

info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description     6636 non-null object
group           6636 non-null object
id              6636 non-null int64
manufacturer    5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB

我们可以看到食物群的分布，使用value_counts:

pd.value_counts(info.group)[:10]

Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Pork Products                        328
Fruits and Fruit Juices              328
Name: group, dtype: int64

这里我们对所有的nutrient数据做一些分析，把每种食物的nutrient部分组合成一个大表格。首先，把每个食物的nutrient列表变为DataFrame，添加一列为id，然后把id添加到DataFrame中，接着使用concat联结到一起：

# 先创建一个空DataFrame用来保存最后的结果
# 这部分代码运行时间较长，请耐心等待
nutrients_all = pd.DataFrame()

for food in db:
    nutrients = pd.DataFrame(food['nutrients'])
    nutrients['id'] = food['id']
    nutrients_all = nutrients_all.append(nutrients, ignore_index=True)

作者在书中说了用concat联结在一起，但实际测试后，这个concat的方法非常耗时，用时几乎是append方法的两倍，所以上面的代码中使用了append方法。

一切正常的话出来的效果是这样的：

nutrients_all

description group   units   value   id
0   Protein Composition g   25.180  1008
1   Total lipid (fat)   Composition g   29.200  1008
2   Carbohydrate, by difference Composition g   3.060   1008
3   Ash Other   g   3.280   1008
4   Energy  Energy  kcal    376.000 1008
... ... ... ... ... ...
389350  Vitamin B-12, added Vitamins    mcg 0.000   43546
389351  Cholesterol Other   mg  0.000   43546
389352  Fatty acids, total saturated    Other   g   0.072   43546
389353  Fatty acids, total monounsaturated  Other   g   0.028   43546
389354  Fatty acids, total polyunsaturated  Other   g   0.041   43546
389355 rows × 5 columns

这个DataFrame中有一些重复的部分，看一下有多少重复的行：

nutrients_all.duplicated().sum() # number of duplicates

把重复的部分去掉：

nutrients_all = nutrients_all.drop_duplicates()
nutrients_all

description group   units   value   id
0   Protein Composition g   25.180  1008
1   Total lipid (fat)   Composition g   29.200  1008
2   Carbohydrate, by difference Composition g   3.060   1008
3   Ash Other   g   3.280   1008
4   Energy  Energy  kcal    376.000 1008
... ... ... ... ... ...
389350  Vitamin B-12, added Vitamins    mcg 0.000   43546
389351  Cholesterol Other   mg  0.000   43546
389352  Fatty acids, total saturated    Other   g   0.072   43546
389353  Fatty acids, total monounsaturated  Other   g   0.028   43546
389354  Fatty acids, total polyunsaturated  Other   g   0.041   43546
375176 rows × 5 columns

为了与info_keys中的group和descripton区别开，我们把列名更改一下：

col_mapping = {'description': 'food',
               'group': 'fgroup'}
info = info.rename(columns=col_mapping, copy=False)
info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
food            6636 non-null object
fgroup          6636 non-null object
id              6636 non-null int64
manufacturer    5195 non-null object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB

col_mapping = {'description' : 'nutrient',
               'group': 'nutgroup'}
nutrients_all = nutrients_all.rename(columns=col_mapping, copy=False)
nutrients_all

nutrient    nutgroup    units   value   id
0   Protein Composition g   25.180  1008
1   Total lipid (fat)   Composition g   29.200  1008
2   Carbohydrate, by difference Composition g   3.060   1008
3   Ash Other   g   3.280   1008
4   Energy  Energy  kcal    376.000 1008
... ... ... ... ... ...
389350  Vitamin B-12, added Vitamins    mcg 0.000   43546
389351  Cholesterol Other   mg  0.000   43546
389352  Fatty acids, total saturated    Other   g   0.072   43546
389353  Fatty acids, total monounsaturated  Other   g   0.028   43546
389354  Fatty acids, total polyunsaturated  Other   g   0.041   43546
375176 rows × 5 columns

上面所有步骤结束后，我们可以把info和nutrients_all合并（merge）：

ndata = pd.merge(nutrients_all, info, on='id', how='outer')
ndata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns (total 8 columns):
nutrient        375176 non-null object
nutgroup        375176 non-null object
units           375176 non-null object
value           375176 non-null float64
id              375176 non-null int64
food            375176 non-null object
fgroup          375176 non-null object
manufacturer    293054 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 25.8+ MB

ndata.iloc[30000]

nutrient                                       Glycine
nutgroup                                   Amino Acids
units                                                g
value                                             0.04
id                                                6158
food            Soup, tomato bisque, canned, condensed
fgroup                      Soups, Sauces, and Gravies
manufacturer                                          
Name: 30000, dtype: object

我们可以对食物群（food group）和营养类型（nutrient type）分组后，对中位数进行绘图：

result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
%matplotlib inline
result['Zinc, Zn'].sort_values().plot(kind='barh', figsize=(10, 8))
<matplotlib.axes._subplots.AxesSubplot at 0x109c80d68>

我们还可以找到每一种营养成分含量最多的食物是什么：

by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])

get_maximum = lambda x: x.loc[x.value.idxmax()]
get_minimum = lambda x: x.loc[x.value.idxmin()]

max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]

# make the food a little smaller
max_foods.food = max_foods.food.str[:50]

因为得到的DataFrame太大，这里只输出'Amino Acids'(氨基酸)的营养群（nutrient group）:

max_foods.loc['Amino Acids']['food']

nutrient
Alanine                          Gelatins, dry powder, unsweetened
Arginine                              Seeds, sesame flour, low-fat
Aspartic acid                                  Soy protein isolate
Cystine               Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                  Soy protein isolate
                                       ...                        
Serine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine        Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan        Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine         Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine           Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, Length: 19, dtype: object

54 数据分析案例4
美国农业部食品数据库这个数据是关于食物营养成分的。存储格式是JSON，看起来像这样每种食物都有一系列特征，其中...
spark第二天作业
基站数据分析案例 [TOC] 本节任务场景解读练习数据维度分析教学目标使用案例练习Spark算子通过基站数据...
萨恩斯：硕士案例研究中的数据分析方法
硕士案例研究中的数据分析方法硕士案例数据分析往往与数据搜集同步进行。初步搜集数据的同时进行数据分析，之后会产生下...
第四阶段爬虫整理
爬虫概述爬虫案例案例1：爬取百度贴吧数据分析：GET方式爬取数据抓包：设计：实现：案例2：抓妹子图分析：...
约练收获及复盘收获
中原焦点团队---欧阳小兰焦点中26期分享261天20210827本周约练1次，来54观15咨36案例分析4BOX...
随感
中原焦点团队---欧阳小兰焦点中26期分享274天20210909本周约练0次，来54观15咨36案例分析4BOX...
随感
中原焦点团队---欧阳小兰焦点中26期分享270天20210905本周约练1次，来54观15咨36案例分析4BOX...
随感
中原焦点团队---欧阳小兰焦点中26期分享271天20210906本周约练0次，来54观15咨36案例分析4BOX...
随感
中原焦点团队---欧阳小兰焦点中26期分享273天20210908本周约练0次，来54观15咨36案例分析4BOX...
随感
中原焦点团队---欧阳小兰焦点中26期分享275天20210911本周约练0次，来54观15咨36案例分析4BOX...

54 数据分析案例4

美国农业部食品数据库

相关文章

54 数据分析案例4

spark第二天作业

萨恩斯：硕士案例研究中的数据分析方法

第四阶段爬虫整理

约练收获及复盘收获

随感

随感

随感

随感

随感

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python