CH1 Data mining
Major data mining tasks
-
Classication and regression
- Classication predicts categorical attribute values;
- regression predicts numerical attribute values
-
Cluster analysis
Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that
- objects in one cluster are more similar to one another
- objects in separate clusters are less similar to one another
unlike classication, clustering analyzes objects without
consulting a known class label
- Association analysis
Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining
Various data repositories
- relational data
- data warehouses
- transactional data
- graph data
- sequence data
- time series
- spatial data
- text & multimedia data
CH2a Data preprocessing
-noisy
-inconsistent
-redundant
Data preprocessing tasks
- types of attributes
-
Categorical
- nominal: provide enough information to distinguish one object from another
Example zip codes, employee ID numbers, eye color, gender
- binary: assume only two values (e.g., yes/no, true/false, 0/1)
- ordinal: provide enough information to order objects
Example grades, fgood,better,bestg - Numeric (continuous)
-
Categorical
- descriptive data summarization
gives the overall picture of the data
involves- measuring the central tendency
- mean
The mean is sensitive to extreme values - weighted mean
- Trimmed mean: disregards the low and high extremes
- a measure that is not sensitive to extreme values is the
median, which represents the middle value of an ordered set
of observations - mode: the value that occurs most frequently in the set
- midrange: average of the largest and smallest values in the
data
- mean
- measuring the dispersion
- range: di�erence between the largest and smallest value
- kth percentile: value xi with the property that k percent of
the data are smaller than xi (what percentile is the median?)
- quartiles: 25th percentile (denoted by Q1), 50th percentile,
and 75th percentile (denoted by Q3)
- interquartile range:
IQR = Q3 - Q1
- five number summary: consists of minimum, Q1, median, Q3,
maximum
- standard deviation : square root of variance ^2 -
graphical display of descriptive summaries
- boxplots
- histograms
- scatter plots
- measuring the central tendency
-
Data cleaning
fill in missing values
e.g., Occupation="
smooth out noise, containing errors or outliers
faulty data collection instruments
human or computer error at data entry
errors in data transmissionoutlier: usually, a value higher/lower than 1.5 x IQR
e.g., Salary = -10"
correct inconsistencies in the data
e.g., Age = \42", Birthday = \03/07/2010"
e.g., discrepancy between duplicate records
Given N tuples, are numerical attributes A and B correlated?

- Data integration
Data integration combines data from multiple sources into a coherent data store
Entity identification problem
Do two objects from different data sources refer to the same entity?
Example Is the record that has customer id = 234 (from one source) equivalent to that where cust num = 234 (from the other source)?
Metadata can help e.g., for each attribute, look at the name, meaning, data type, range of values permitted, etc
data value conflicts
For the same entity, attribute values from different sources may differ e.g., weight measured in kilograms or pounds
data redundancy
- Data transformation
(Goal: modify the data in order to improve data mining performance) - Data reduction
attribute/feature construction
normalization: scaled to fall within a smaller, specied range
网友评论