实例分析
14.4 USDA食品数据库
import json
db = json.load(open('datasets/usda_food/database.json'))
len(db)
6636
#数据中是字典形式,可以取关键词看数据内容
db[0].keys()
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
import pandas as pd
nutrients = pd.DataFrame(db[0]['nutrients'])
print(nutrients[:7])
value units description group
0 25.18 g Protein Composition
1 29.20 g Total lipid (fat) Composition
2 3.06 g Carbohydrate, by difference Composition
3 3.28 g Ash Other
4 376.00 kcal Energy Energy
5 39.28 g Water Composition
6 1573.00 kJ Energy Energy
info_keys = ['description', 'group', 'id', 'manufacturer']#取出食物的名称、分类、编号以及制造商信息
info = pd.DataFrame(db, columns=info_keys)
print(info.info())
print('\n')
print(info[:5])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
description 6636 non-null object
group 6636 non-null object
id 6636 non-null int64
manufacturer 5195 non-null object
dtypes: int64(1), object(3)
memory usage: 129.7+ KB
None
description group id \
0 Cheese, caraway Dairy and Egg Products 1008
1 Cheese, cheddar Dairy and Egg Products 1009
2 Cheese, edam Dairy and Egg Products 1018
3 Cheese, feta Dairy and Egg Products 1019
4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
manufacturer
0
1
2
3
4
print(pd.value_counts(info.group)[:10])
Vegetables and Vegetable Products 812
Beef Products 618
Baked Products 496
Breakfast Cereals 403
Fast Foods 365
Legumes and Legume Products 365
Lamb, Veal, and Game Products 345
Sweets 341
Pork Products 328
Fruits and Fruit Juices 328
Name: group, dtype: int64
书中说明现在需要把营养成分整合到一个大表中,需要添加一个ID列。
主要操作内容为:将各食物的营养成分列表转换为1个DataFrame,并添加1个
表示编号的列,然后将该DataFrame添加到1个列表中。
#参照了https://blog.csdn.net/yunini2/article/details/73604631 的做法。因为nutrients本身的数据量只有54个。所以需要对db数据表中每个数据进行提取。
#对db数据操作后,进行单个数据标签列追加。
nutrients=[]
for rec in db:
fnuts=pd.DataFrame(rec['nutrients'])
fnuts['id']=rec['id']
nutrients.append(fnuts)
nutrients=pd.concat(nutrients,ignore_index=True)
nutrients.duplicated().sum()
nutrients = nutrients.drop_duplicates()
#重命名
col_mapping = {'description' : 'food',
'group' : 'fgroup'}
info = info.rename(columns=col_mapping, copy=False)
info.info()
col_mapping = {'description' : 'nutrient',
'group' : 'nutgroup'}
nutrients = nutrients.rename(columns=col_mapping, copy=False)
print(nutrients)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
food 6636 non-null object
fgroup 6636 non-null object
id 6636 non-null int64
manufacturer 5195 non-null object
dtypes: int64(1), object(3)
memory usage: 129.7+ KB
value units nutrient nutgroup id
0 25.180 g Protein Composition 1008
1 29.200 g Total lipid (fat) Composition 1008
2 3.060 g Carbohydrate, by difference Composition 1008
3 3.280 g Ash Other 1008
4 376.000 kcal Energy Energy 1008
... ... ... ... ... ...
389350 0.000 mcg Vitamin B-12, added Vitamins 43546
389351 0.000 mg Cholesterol Other 43546
389352 0.072 g Fatty acids, total saturated Other 43546
389353 0.028 g Fatty acids, total monounsaturated Other 43546
389354 0.041 g Fatty acids, total polyunsaturated Other 43546
[375176 rows x 5 columns]
#合并数据列
ndata = pd.merge(nutrients, info, on='id', how='outer')
ndata.iloc[30000]
value 0.04
units g
nutrient Glycine
nutgroup Amino Acids
id 6158
food Soup, tomato bisque, canned, condensed
fgroup Soups, Sauces, and Gravies
manufacturer
Name: 30000, dtype: object
result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
result['Zinc, Zn'].sort_values().plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x132fa3b0>
#各营养成分最为丰富的食物
by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
get_maximum = lambda x: x.loc[x.value.idxmax()]
get_minimum = lambda x: x.loc[x.value.idxmin()]
max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]
max_foods.food = max_foods.food.str[:50]
max_foods.loc['Amino Acids']['food']
nutrient
Alanine Gelatins, dry powder, unsweetened
Arginine Seeds, sesame flour, low-fat
Aspartic acid Soy protein isolate
Cystine Seeds, cottonseed flour, low fat (glandless)
Glutamic acid Soy protein isolate
Glycine Gelatins, dry powder, unsweetened
Histidine Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINA...
Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Lysine Seal, bearded (Oogruk), meat, dried (Alaska Na...
Methionine Fish, cod, Atlantic, dried and salted
Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Proline Gelatins, dry powder, unsweetened
Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTE...
Name: food, dtype: object
说明:
放上参考链接,这个系列都是复现的这个链接中的内容。
放上原链接: https://www.jianshu.com/p/04d180d90a3f
作者在链接中放上了书籍,以及相关资源。因为平时杂七杂八的也学了一些,所以这次可能是对书中的部分内容的复现。也可能有我自己想到的内容,内容暂时都还不定。在此感谢原简书作者SeanCheney的分享。
网友评论