- motivation: to rank-order the documents matching a query by giving a score to each (query,document) pair
parametric and zone indexes
- index and retrieve documents by metadata.
- parametric index vs zone index: fixed vocabulary, whatever vocabulary from the text of that zone.
data:image/s3,"s3://crabby-images/41396/41396a6d619ef7e63c0faa47c6ce0793adc9a575" alt=""
data:image/s3,"s3://crabby-images/51ff7/51ff7ced4ff93840bb55445d5f2fc9fd0852c07c" alt=""
data:image/s3,"s3://crabby-images/67080/670808e15407e54dffbfbb6b52b793002ddc30ac" alt=""
- weighted zone scoring
- learning weights
- the optimal weight g
machine learning algorithm
term frequency and weighting
- intuition: scores relate to term frequency, but are all words equally important?
- free text query: document - the set of weights, bag of words model
score = the sum of all terms - inverse document frequency
- tf-idf weighting
terms with lower document frequency weigh higher
data:image/s3,"s3://crabby-images/a19eb/a19eb281616b1f0a7fb7ec0d8b750e721b9ce7cf" alt=""
the vector space for scoring
- dot products : similarity between two documents
the magnitude of the vector difference? the effect of document length.
data:image/s3,"s3://crabby-images/d17ab/d17abcfa0ab3b01b99e7e30fda4743216dbe4f11" alt=""
data:image/s3,"s3://crabby-images/72d47/72d475037983aa5695582d7a0462fefd50afe152" alt=""
-
query as vectors
computation is expensive -
computing vector scores
data:image/s3,"s3://crabby-images/362a7/362a7e2380970c6de379bf46114d0ac84dfa7374" alt=""
Variant tf–idf functions
data:image/s3,"s3://crabby-images/d58ba/d58bae3f17126587661a7091d7d8e9847f247578" alt=""
- Pivoted normalized document length
the relationship between document length and relevance
data:image/s3,"s3://crabby-images/2647d/2647d2635caa29f167fd5a35424d432e3366fd57" alt=""
linear model
machine learning techniques
网友评论