从Output Space/Data Label/Protocol/Input Space四个维度介绍常见机器学习类型,见详细课件。
Output Space
Binary Classification
- credit approve/disapprove
- email spam/non-spam
- patient sick/not sick
- ad profitable/not profitable
Core and important problem with many tools as building block of other tools.
Multiclass Classification
- coin recognition
- written digits ⇒ 0, 1, · · · , 9
- pictures ⇒ apple, orange, strawberry
- emails ⇒ spam, primary, social, promotion, update
回归问题,输出空间或者,对应bounded regression。常见的例子比如:
- patient features ⇒ how many days before recovery
- company data ⇒ stock price
- climate data ⇒ temperature
Also core and important with many ‘statistical’ tools as building block of other tools.
Structured Learning
- sentence ⇒ structure (class of each word)(序列标注)
- protein data ⇒ protein folding
- speech data ⇒ speech parse tree
Huge multiclass classification problem (structure = hyperclass) without ‘explicit’ class definition.
Data Label
从data label 的有无、多少、形式划分:
- supervised: all
- unsupervised: no
- semi-supervised: some
- reinforcement: implicit by goodness
Supervised Learning
Supervised learning: every comes with corresponding .
Unsupervised Learning
Unsupervised learning: diverse, with possibly very different performance goals.
- clustering
- unsupervised multiclass classification
- i.e. articles ⇒ topics
- density estimation
- unsupervised bounded regression
- traffic reports with location ⇒ dangerous areas
- outlier detection
- extreme ‘unsupervised binary classification’
- i.e. Internet logs ⇒ intrusion alert
Semi-supervised Learning
Semi-supervised learning: leverage unlabeled data to avoid ‘expensive’ labeling.
- face images with a few labeled ⇒ face identifier (Facebook)
- medicine data with a few labeled ⇒ medicine effect predictor
详细解释见Semi-supervised learning。
Reinforcement Learning
Reinforcement: learn with ‘partial/implicit information’ (often sequentially).
- (customer, ad choice, ad click earning) ⇒ ad system
- (cards, strategy, winning amount) ⇒ black jack agent
Different Protocol
不同Protocol对应不同Learning Philosophy:
- batch: duck feeding
- online: passive sequential
- active: question asking (sequentially)(query the of the chosen )
- batch: all known data
- online: sequential (passive) data
- active: strategically-observed data
Batch Learning
Batch supervised multiclass classification: learn from all known data.
- batch of (email, spam?) ⇒ spam filter
- batch of (patient, cancer) ⇒ cancer classifier
- batch of patient data ⇒ group of patients
Online Learning
Online: hypothesis ‘improves’ through receiving data instances sequentially
比如online spam filter, which sequentially:
- observe an email
- predict spam status with current
- receive ‘desired label’ from user, and then update with
PLA can be easily adapted to online protocol.
Active Learning
Active: improve hypothesis with fewer labels (hopefully) by asking questions strategically
Different Input Space
Concrete Features
Concrete features: each dimension of represents ‘sophisticated physical meaning’.
- (size, mass) for coin classification
- customer info for credit approval
- patient info for cancer diagnosis
- often including human intelligence on the task
这些具体特征,有明确的含义,可解释性很强,同时easy for ML。
Raw Features
Raw features: often need human or machines to convert to concrete ones.
比如image pixels, speech signal等场景。
Abstract Features
Abstract: again need feature conversion/extraction/construction.
- student ID in online tutoring system (KDDCup 2010)
- advertisement ID in online ad system