We'll start with an overview of how machine learning models work and
how they are used. This may feel basic if you've done statistical modeling or machine learning before. Don't worry, we will progress to building powerful models soon.
我们将以一个简单的概述来看机器学习模型是怎么工作和被应用的。这对于已经做过数据模型或者机器学习的你们会看起来很基础。但是不要担心,我们很快将会开始着手构建强大的模型。
This micro-course will have you build models as you go through following scenario:
这个微课程将通过以下的场景来帮助你构建模型:
Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.
你的老表在房地产投资上赚了几百万美元,因为你在数据科学上的兴趣是他邀请你成为他的生意伙伴。他提供钱,你则提供可以预测房产价值的模型。
You ask your cousin how he's predicted real estate values in the past. and he says it is just intuition. But more questioning reveals that he's identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.
你问你的老表过去他是怎么预测房地产价值的,而他的回答是——直觉。但是经过更多的问题显露出来过去的经验让他认识到了一些房屋的价值模式,他以这些模式去预测新的房产价值。
Machine learning works the same way. We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.
机器学习以同样的工作方式运行。我们将以一个叫做决策树的模型作为开始。关于预测有更多更精确的模型,但是决策树是最简单易懂的,并且过程包含了一些数据科学领域最优秀的模型也样需要具备的基础构建模块
For simplicity, we'll start with the simplest possible decision tree.
为了更简单,我们以一个最简单的决策树开始
desicion-tree.pngIt divides houses into only two categories. The predicted price for any house under consideration is the historical average price of houses in the same category.
它将房产分层两类,相同类型的房屋预估价值一律按照历史的平均价格得出
We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.
我们使用数据去决定如何进行分组,然后继续决定各分组内的预估价值。这种数据捕捉模式我们称之为对模型的"适配"或者"训练",对模型进行训练的数据被叫做"训练数据"
The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to predict prices of additional homes.
有关模型是如何被训练的(例如:数据是如何被分隔的)这部分内容太过复杂,我们留到后面再讲。当模型经过了训练之后,你可以使用新的数据去得出额外房产的预测价值
Improving the Decision Tree 改进决策树
Which of the following two decisions trees is more likely to result from fitting the real estate training data?
通过训练数据,下面哪一个决策树更能反映出真实的房地产价值呢?
decision-tree01.pngThe decision tree on the left (Decision Tree 1) probably makes more sense, because it captures the reality that houses with more bedrooms tend to sell at higher prices than houses with fewer bedrooms. The biggest shortcoming of this model is that it doesn't capture most factors affecting home price, like number of bathrooms, lot size, location, etc.
左侧的决策树(Dscision Tree 1)可能更有价值,因为它反映了一个现实:拥有更多卧室的房屋会比卧室较少的能卖出更高价格。但是最大的一个缺点是这个模型并没有捕捉到最能影响房价的因素,例如:浴室的数量、房间大小、区域等。
You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the total size of each house's lot might look like this:
你能通过拥有更多分支的数来展现更多的因素,这就叫做"深度树"。一个同时考虑了每个房间的大小以及总大小的决策树也许长这样:
decision-tree02.png
You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.
你通过决策树的不同路径来预测任意的房屋价值,永远选择与房屋属性相应的路径。预测价格都是出现在树的最底端,做出决策的点我们称之为"叶子节点"
The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.
数据将决定分支和值,所以该你去检验将要处理的数据了
Continue
Let's get more specific. It's time to Examine Your Data.
让我们说得更具体些,是时候检验你的数据了
网友评论