之前没做过搜索引擎相关的业务,最近口袋的文献和指南搜索需要进行调整优化,遂入坑solr,出乎意料的是solr的相关资料非常少(更别提中文了),官网的介绍又非常的干,堆砌各种example,刚好发现了一本《solr in action》(以solr4为例讲解),读了几章后,收获颇丰,所以这次的双周分享是摘录《solr in action》中那些让我感到醍醐灌顶的话。
Why do I need a search engine?
Search engines like Solr are optimized to handle data exhibiting four main characteristics:
- Text-centric(文本为中心)
- Read- dominant(以读为主)
- Document- oriented(面向文档)
- Flexible schema(灵活的schema)
Text-centric
We think text-centric is more appropriate for describing the type of data Solr handles.
Of course, a search engine also supports non text data such as dates and numbers, but its primary strength is handling text data based on natural language.
搜索引擎主要是用来处理大段文本的搜索。
Read- dominant
Think of read-dominant as meaning that documents are read far more often than they’re created or updated.
if you must update existing data in a search engine often, that could be an indication that a search engine might not be the best solution for your needs. Another NoSQL technology, like Cassandra, might be a better choice when you need fast random writes to existing data.
搜索引擎以读为主,如果需要频繁的更新,那么solr不会是个好选择。
Document-oriented
Ina search engine, a document is a self-contained collection of fields, in which each field only holds data and doesn’t contain nested fields.
In general, you should store the minimal set of information for each document needed to satisfy search requirements.
在搜索引擎的数据结构中,是面向文档的,文档中包含一组fields。
Flexible schema
In a relational database, every row in a table has the same structure. In Solr, documents can have different fields.
文档是非结构化的,不同的文档可以由完全不同的fields组成,前提是field在managed-schema中有定义。
Don’t use a search engine to ...
- First, search engines are designed to return a small set of documents per query, usually 10 to 100.
搜索引擎应该只用来返回少量的结果集。如果一次性请求所有大量的结果,索引查询是会比较快,但是根据index重建大量的document会很慢。 - Another use case in which you shouldn’t use a search engine is deep analytic tasks that require access to a large subset of the index.
- Also, there’s no direct support in most search engines for document-level security, at least not in Solr.
solr不支持文档级别的安全校验。
What is Solr?
Information retrieval engine
Solr is built on Apache Lucene, a popular, Java-based, open source, information retrieval library.
In a nutshell, Solr uses Lucene to provide the core data structures for indexing documents and executing searches to find documents.
如你所见,solr其实是使用Lucene来实现建立index&执行search等核心操作的。
one key difference between a Lucene query and a database query is that in Lucene results are ranked by their relevance to a query, and database results can only be sorted by one or more of the table columns.
Lucene对搜索结果的排序有一套复杂的公式,被///因素所影响,而数据库只能根据一列或多列column来简单的排序。
Map Reduce is a programming model that distributes large-scaled data-processing operations across a cluster of commodity servers by formulating an algorithm into two phases: map and reduce.
MapReduce最早是Google提出的,被用来进行海量网页的索引和搜索。同样的,Solr提供了SolrCloud,可以运用MapReduce的思想来处理large-scaled数据的检索,大大提高的性能及服务的高可用。
With Lucene, you need to write Java code to define fields and how to analyze those fields. Solr adds a simple, declarative way to define the structure of your index and how you want fields to be represented and analyzed: an XML-configuration document named schema.xml. Solr also provides copy and dynamic fields.
ok,既然Solr is built on Lucene,那么两者有什么区别呢?Lucene其实是用户不友好的,直接使用Lucene的话,你需要写繁琐的java代码去定义field,而solr提供了简单的xml文件来配置field,同时solr还提供了copy and dynamic fields。
所谓copy field,提供了一个联合field,即一个name可以对应多个Field。
网友评论