美文网首页
python pandas入门一

python pandas入门一

作者: 九月_1012 | 来源:发表于2019-04-10 15:29 被阅读0次

Creating DataFrames from scratch

import pandas as pd
data = {
   'apples': [3, 2, 0, 1], 
   'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases
purchases.loc['June']

Reading data from CSVs

df = pd.read.csv('purchases.csv')
df
df=pd.read.csv('purchases.csv', index=0)
df

Reading data from JSON

df = pd.read_json('purchases.json')

Reading data from a SQL database

pip install pysqlite3
import sqlite3

con = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM purchases", con)

df = df.set_index('index')

Converting back to a CSV, JSON, or SQL

df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

df.to_sql('new_purchases', con)

Most important DataFrame operations

movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")
movies_df.head( )
movies_df.tail(2)
movies_df.info()
movies_df.shape

Handling duplicates

temp_df = movies_df.append(movies_df)
temp_df = temp_df.drop_duplicates()

temp_df.drop_duplicates(inplace=True) #temp_df直接被改变。

temp_df = movies_df.append(movies_df)  # make a new copy

temp_df.drop_duplicates(inplace=True, keep=False)#keep去掉了存在重复的行,所以数据框变空了。

temp_df.shape

Column cleanup

movies_df.columns#打印列名

movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

How to work with missing values

movies_df.isnull()
movies_df.isnull().sum()

movies_df.dropna()    #Removing null values 按行
movies_df.dropna(axis=1)    #Removing null values 按列,axis 来自numpy

Imputation

revenue = movies_df['revenue_millions']
revenue_mean = revenue.mean()
revenue.fillna(revenue_mean, inplace=True)
movies_df.isnull().sum()

movies_df.describe() #每列的统计值,数量、中位数、最大值、最小值等
movies_df['genre'].describe()
movies_df['genre'].value_counts().head(10)  #值的频率
movies_df.corr()   #列之间的相关性

DataFrame slicing, selecting, extracting

genre_col = movies_df['genre']   #数据类型是series
type(genre_col)

genre_col = movies_df[['genre']]   #数据类型是DataFrame
type(genre_col)

subset = movies_df[['genre', 'rating']]
subset.head()

以上主要基于列,基于行的话为有.loc,.iloc

prom = movies_df.loc["Prometheus"]
prom = movies_df.iloc[1]    #.iloc行的数字

movie_subset = movies_df.loc['Prometheus':'Sing']
movie_subset = movies_df.iloc[1:4]

slicing

Conditional selections

condition = (movies_df['director'] == "Ridley Scott") #true or false
condition.head()

movies_df[movies_df['director'] == "Ridley Scott"]  #filtered  false
movies_df[movies_df['rating'] >= 8.6].head(3)

movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()

movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]

Applying functions

def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"
          
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)
movies_df.head(2)

movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(2)

Brief Plotting

pip install matplotlib

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

"""For categorical variables utilize Bar Charts* and Boxplots.For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots."""

movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

movies_df['rating'].plot(kind='hist', title='Rating');

movies_df['rating'].describe()

movies_df['rating'].plot(kind="box")

movies_df.boxplot(column='revenue_millions', by='rating_category')

Wrapping up

来源:https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
最后附上总的学习网址:
https://github.com/LearnDataSci/article-resources

相关文章

网友评论

      本文标题:python pandas入门一

      本文链接:https://www.haomeiwen.com/subject/kwaniqtx.html