@[toc]
Rule-Based Classifier
Classify records by using a collection of “if…then…” rules
Rule: (Condition) → y
-
where
-
Condition is a conjunctions of attributes
-
y is the class label
-
LHS: rule antecedent or condition
-
RHS: rule consequent
-
Examples of classification rules:
(Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
(Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No
-
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule.
Rule Coverage and Accuracy
Coverage of a rule: Fraction of records that satisfy the antecedent of a rule.
Accuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule.
Characteristics of Rule-Based Classifier
Mutually exclusive rules
- Classifier contains mutually exclusive rules if the rules are independent of each other
Exhaustive rules
-
Every record is covered by at most one rule
-
Classifier has exhaustive coverage if it accounts for every possible combination of attribute values
-
Each record is covered by at least one rule
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
Effect of Rule Simplification
Rules are no longer mutually exclusive
- A record may trigger more than one rule
- Ordered rule set - dependent on the importance
- Unordered rule set – use voting schemes
Rules are no longer exhaustive
- A record may not trigger any rules
- Use a default class
Ordered Rule Set
Rules are rank ordered according to their priority
- An ordered rule set is known as a decision list
When a test record is presented to the classifier
- It is assigned to the class label of the highest ranked rule it has
triggered - If none of the rules fired, it is assigned to the default class
Rule Ordering Schemes
Rule-based ordering
- Individual rules are ranked based on their quality
Class-based ordering
- Rules that belong to the same class appear together
Building Classification Rules
Direct Method
-
Extract rules directly from data
-
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method
-
Extract rules from other classification models (e.g. decision trees, neural networks, etc).
-
e.g: C4.5rules
Direct Method: Sequential Covering
- Start from an empty rule
- Grow a rule using the Learn-One-Rule function
- Remove training records covered by the rule
- Repeat Step (2) and (3) until stopping criterion is met
Aspects of Sequential Covering
- Rule Growing
- Instance Elimination
- Rule Evaluation
- Stopping Criterion
- Rule Pruning
Rule Growing (Examples)
CN2 Algorithm:
- Start from an empty conjunct: {}
- Add conjuncts that minimizes the entropy measure: {A}, {A,B},…
- Determine the rule consequent by taking majority class of instances
covered by the rule
RIPPER Algorithm:
- Start from an empty rule: {} => class
- Add conjuncts that maximizes FOIL’s information gain measure:
- R0: {} => class (initial rule)
- R1: {A} => class (rule after adding conjunct)
- where t: number of positive instances covered by both
and
: number of positive instances covered by
: number of negative instances covered by
: number of positive instances covered by
: number of negative instances covered by
Instance Elimination
Why do we need to eliminate instances?
- Otherwise, the next rule is
identical to previous rule
Why do we remove positive instances?
- Ensure that the next rule is different
Why do we remove negative instances?
- Prevent underestimating accuracy of rule
Stopping Criterion and Rule Pruning
Stopping criterion
- Compute the gain
- If gain is not significant, discard the new rule
Rule Pruning
- Similar to post-pruning of decision trees
- Reduced Error Pruning:
- Remove one of the conjuncts in the rule
- Compare error rate on validation set before and after pruning
- If error improves, prune the conjunct
Summary of Direct Method
- Grow a single rule
- Remove Instances from rule
- Prune the rule (if necessary)
- Add rule to Current Rule Set
- Repeat
Indirect Methods C4.5 rules
- consider an alternative rule r’: A’ → y where A’ is obtained by removing one of the conjuncts in A
- Compare the pessimistic error rate for r against all r’s
- Prune if one of the r’s has lower pessimistic error rate
- Repeat until we can no longer improve generalization error
Instead of ordering the rules, order subsets of rules (class ordering)
- Each subset is a collection of rules with the same rule consequent (class)
- Compute description length of each subset
- Description length = L(error) + g L(model) (Similar with decision tree)
- g is a parameter that takes into account the presence of redundant attributes in a rule set (default value = 0.5)
Advantages of Rule-Based Classifiers
- As highly expressive as decision trees
- Easy to interpret
- Easy to generate
- Can classify new instances rapidly
- Performance comparable to decision trees
网友评论