EDA-识别outlier

作者: IntoTheVoid | 来源:发表于2018-10-08 18:05 被阅读2次
Visualizing single variables with histograms

在IPython Shell中,首先'Existing Zoning Sqft'列使用.describe()方法计算列的摘要统计信息。您会注意到min和max值之间存在极大的差异,因此需要相应地调整绘图。在这种情况下,最好以对数刻度查看图。关键字参数logx=Truelogy=True可以传入

image.png
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)

# Display the histogram
plt.show()
image.png

As you saw here, you still needed to look at the summary statistics to help understand your data better. You expected a large amount of counts on the left side of the plot because the 25th, 50th, and 75th percentiles have a value of 0. The plot shows us that there are barely any counts near the max value, signifying an outlier.

Visualizing multiple variables with boxplots

直方图是可视化单个变量的好方法。为了可视化多个变量,箱图很有用,尤其是当其中一个变量是分类变量时

使用箱线图来比较列(数值变量)'initial_cost'的不同值'Borough'(分类变量)。pandas .boxplot()方法是一种快速的方法,您必须指定columnby参数。在这里,可视化'initial_cost'变化的通过 'Borough'的不同分类

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt

# Create the boxplot
df.boxplot(column='initial_cost',by='Borough', rot=90)

# Display the plot
plt.show()
image.png

You can see the 2 extreme outliers are in the borough of Manhattan. An initial guess could be that since land in Manhattan is extremely expensive, these outliers may be valid data points. Again, further investigation is needed to determine whether or not you can drop or keep those points in your data.

Visualizing multiple variables with scatter plots

比较两个数值变量列时,使用散点图更好.

# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
image.png

it seems like there is a strong correlation between 'initial_cost' and 'total_est_fee'. In addition, take note of the large number of points that have an 'initial_cost' of 0.

相关文章

  • EDA-识别outlier

    Visualizing single variables with histograms 在IPython She...

  • outlier

    今天读的一本书,觉得读的晚了……

  • outlier

    今天读的一本书,觉得读的晚了……

  • Active Outlier

    Active Outlier

  • 统计:均值,中位数,众数

    一、异常值是指什么?请列举 1种识别连续型变量异常值的方法? 异常值(Outlier) 是指样本中的个别值,其数值...

  • R中的箱线图进阶

    箱线图能够显示出离群点(outlier),离群点也叫做异常值,通过箱线图能够很容易识别出数据中的异常值。 geom...

  • 从阿里巴巴笔试试题看数据分析师的职业要求

    一、异常值是指什么?请列举1种识别连续型变量异常值的方法? 异常值(Outlier) 是指样本中的个别值,其数值明...

  • Statistics

    平均数 均值Mean、中位数Median、众数Mode 异常值Outlier普遍定义 Outlier < Q1 -...

  • 留学相关

    Data science machine learning Extreme value model Outlier...

  • PIV_6:Outlier 之我见

    Outlier 之我见 从14年初博士开题到现在2016年初我一直在做流场数据的outlier的检测,这个问题的通...

网友评论

    本文标题:EDA-识别outlier

    本文链接:https://www.haomeiwen.com/subject/jrujaftx.html