Solr配置文件浅析

作者: 阿懒土灵 | 来源:发表于2018-09-07 10:53 被阅读33次

Solr配置文件浅析
Solr7.1学习笔记
solr简单使用
SOLR
使用solrj连接solr集群进行操作
Solr之配置文件Solrconfig.xml和solr.xml
Apache Solr RCE漏洞（CVE-2019-1240
Apache Solr JMX服务远程代码执行漏洞
Solr配置
PHP 安装 solr 扩展

接上一篇Linux下安装solr7.4，来谈谈solr的配置文件schema.xml和db-data-config.xml

首先看schema.xml：

 <!-- If you remove this field, you must _also_ disable the update log in solrconfig.xml
      or Solr won't start. _version_ and update log are required for SolrCloud
   -->
   <field name="_version_" type="plong" indexed="true" stored="true"/>

   <!-- points to the root document of a block of nested documents. Required for nested
      document support, may be removed otherwise
   -->
   <field name="_root_" type="string" indexed="true" stored="false"/>

   <!-- Only remove the "id" field if you have a very good reason to. While not strictly
     required, it is highly recommended. A <uniqueKey> is present in almost all Solr 
     installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".
   -->
   <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

<!-- Field to use to determine and enforce document uniqueness. 
      Unless this field is marked with required="false", it will be a required field
   -->
 <uniqueKey>id</uniqueKey>

field标签用来定义solr core中的字段。这里列出的三个字段如果没有特殊原因尽量保留。字段id被声明为uniqueKey,是让id来唯一标明一个solrdocument。通过这个id来对solrdocument进行操作。

type对应的是字段的属性，solr在schema中定义了很多属性，当然也可以自己定义属性。这里常见的属性有pint,pdate,string,boolean等。

  <!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    <!-- boolean type: "true" or "false" -->
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldType name="pint" class="solr.IntPointField" docValues="true"/>
    <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>
    <fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
    <!-- KD-tree versions of date fields -->
    <fieldType name="pdate" class="solr.DatePointField" docValues="true"/>

不常见或者自定义的属性：

  <!-- A text field that only splits on whitespace for exact matching of words -->
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

该属性type="text_ws"定义的字段通过空格去分割文本变成一个一个的词，然后可以通过被分割的词去查找该document。
这里用到的逆向索引是solr的精髓，将分好的词作为key，文档标签作为value，对key建索引，去查询文档。

indexed属性如果为true则说明该字段将被建索引。

stored属性如果为true，则将该字段内容进行存储。

  <field name="title" type="string" indexed="false" stored="true" required="true" multiValued="false" />
   <field name="content" type="string" indexed="false" stored="true" required="true" multiValued="false"/>
<field name="text" type="text_hanlp" indexed="true" stored="false" required="true" multiValued="true"/>
   <copyField source="content" dest="text" />
   <copyField source="title" dest="text" />

multiValued如果设置为true，则表明该字段是由多个字段值组成的。比如上面例子中的text字段，它是由content和title字段组成。对text字段的操作就是对content和title字段进行操作。
上面这一段配置的意思是：有两个字段title和content，他们是自定义的text_hanlp属性，含有这属性的字段都接受hanlp的分词。这两个字段不创建索引，只做存储。text字段负责组合title和content字段，并创建索引用来检索。

required属性表明该字段值是否必须。

自定义属性text_hanlp来达到中文分词效果

 <!-- text_cn字段类型: 指定使用HanLP分词器，同时开启索引模式。通过solr自带的停用词过滤器，使用"stopwords.txt"（默认空白）过滤。
         在搜索的时候，还支持solr自带的同义词词典。-->
    <fieldType name="text_hanlp" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- 取消注释可以启用索引期间的同义词词典-->
       <!-- <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="com.hankcs.lucene.HanLPTokenizerFactory" enableIndexMode="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

因为solr常用的ik分词、mmseg4j都已经不维护了。所以这里使用还有大神维护的Hanlp分词器。
配置Hanlp分词可以参考教程
将hanlp-portable.jar和hanlp-lucene-plugin.jar共两个jar放入${tomcat}/webapps/solr/WEB-INF/lib下

配置好之后，可以在solr admin界面查看分词效果：

solr分词

定义动态字段

<dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>

dynamicField定义的就是动态字段，只要符合_i结尾的字段都可以被这个字段所定义。同样的，schema.xml中已经定义好了很多动态字段。可以直接拿来用。

db-data-config.xml配置文件
该文件主要配置数据库连接和字段对应关系。用来做全量和增量索引的创建，相对schema.xml简单很多。
下面看下主要配置：

<dataSource driver="com.mysql.jdbc.Driver"                   
                         url="jdbc:mysql://127.0.0.1:3306/database? 
                                           useUnicode=true&amp;characterEncoding=UTF-8" 
                         user="root" 
                         batchSize="-1"
                         password="123456"/>

dataSource用来定义数据库连接，batchSize设为-1是为了避免查询创建索引导致内存溢出。

<document>
        <entity dataSource="jdbcDataSource" name="core" pk="id"  
        query="select * from tableName" >
            <field column="id" name="id"></field>
            <field column="title" name="title"></field>
            <field column="content" name="content"></field>
            <field column="author" name="author"></field>
        </entity>
    </document>

这里做了一个简单的定义，看着很清楚。columen标明的是数据库查出的字段，name标明的属性和schema中定义的字段对应。

<entity name="item" query="select * from item"
                deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">
            <field column="NAME" name="name" />

deltaQuery用来做增量索引的创建。

当文件配置好之后，重启tomcat。访问solr/index.html。

solr创建索引

选择1，然后2可以选择全量索引或者创建增量索引。勾选clean会清楚上次的索引，点选commit创建索引进行提交。点击execute进行执行。

下一篇，更新spring boot 中集成solrJ对solr进行操作。

Solr配置文件浅析
接上一篇Linux下安装solr7.4，来谈谈solr的配置文件schema.xml和db-data-config...
Solr7.1学习笔记
1.solr修改zookeeper配置文件: solr zk upconfig -d C:\soft\solr-7...
solr简单使用
Apache Solr中的主要配置文件如下 Solr.xml - 它是包含Solr Cloud相关信息，此文件是在...
SOLR
脚本执行：单机:bin/slor start 修改solr配置文件：vim solr.in.sh 注意点：名称...
使用solrj连接solr集群进行操作
1.solr配置文件 2.solr配置 3.solr查询接口--做两个接口 3.使用solr查询数据 4.封装so...
Solr之配置文件Solrconfig.xml和solr.xml
1 Solrconfig.xml 在Solr中solrconfig.xml文件是影响Solr本身参数最多的配置文件...
Apache Solr RCE漏洞（CVE-2019-1240
1. 漏洞详情 Apache Solr的8.1.1和8.2.0版本的自带配置文件solr.in.sh中存在ENAB...
Apache Solr JMX服务远程代码执行漏洞
0x00背景描述 Apache Solr的8.1.1和8.2.0版本的自带配置文件solr.in.sh中存在不安全...
Solr配置
solr.xml配置详解 solr.xml配置文件支持使用变量，格式为${属性名称：默认值}，后面的默认值是可选的...
PHP 安装 solr 扩展
php solr extension 修改 php.ini（当前php配置文件 php -i | grep php...