美文网首页大数据
使用Solr+Hbase-solr(Hbase-indexer)

使用Solr+Hbase-solr(Hbase-indexer)

作者: 耗子在简书 | 来源:发表于2018-09-25 10:44 被阅读1101次

    前言:
    因为项目需要,试着搭建了一下HBase二级索引的环境,网上看了一些教程,无一不坑,索性整理一份比较完整的。本文适当的精简和绕过了一些“老司机一看就知道”的内容,适合刚接触这一领域但是有一定Linux和Hadoop基础的读者,不适合完全初学者。

    环境约束:
    OS:CentOS6.7-x86_64
    JDK:jdk1.7.0_109
    hadoop-2.6.0+cdh5.4.1
    hbase-solr-1.5+cdh5.4.1 (hbase-indexer-1.5-cdh5.4.1)
    solr-4.10.3-cdh5.4.1
    zookeeper-3.4.5-cdh5.4.1
    hbase-1.0.0-cdh5.4.1

    文中所用CDH软件下载页:
    CDH 5.4.x Packaging and Tarball Information | 5.x | Cloudera Documentation

    一、基本环境准备

    1.一个3节点Hadoop集群,服务器计划角色分配如下:
    服务器角色分配

    先把Namenode、Datanode、zookeeper、Journalnode、ZKFC跑起来,具体技术自行突破,不是本文重点,无需多言。

    2.下载好所需的CDH版本软件:

    在文首的链接页面下载好tarball,需要注意的是HBase-solr的tarball是整个项目文件,但是我们用到的只是它的部署文件,解压缩hbase-solr-1.5+cdh5.4.1的tarball,在 hbase-solr-1.5-cdh5.4.1\hbase-indexer-dist\target 下找到hbase-indexer-1.5-cdh5.4.1.tar.gz,后面会用到。

    二、部署hbase-indexer

    将hbase-indexer-1.5-cdh5.4.1.tar.gz拷贝到node2或者node3上
    解压缩hbase-indexer-1.5-cdh5.4.1.tar.gz:

    tar zxvf hbase-indexer-1.5-cdh5.4.1.tar.gz
    

    修改hbase-indexer的参数:

    vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-site.xml
    
    <?xml version="1.0"?>
    <configuration>
    <property>
      <name>hbaseindexer.zookeeper.connectstring</name>
      <!--此处需根据zookeeper集群的实际配置修改-->
      <value>node1:2181,node2:2181,node3:2181</value>
    </property>
    <property>
      <name>hbase.zookeeper.quorum</name>
      <!--此处需根据zookeeper集群的实际配置修改-->
      <value>node1,node2,node3</value>
    </property>
    </configuration>
    

    配置hbase-indexer-env.sh:

    vim hbase-indexer-1.5-cdh5.4.1/conf/hbase-indexer-env.sh
    

    修改JAVA_HOME

    # Set environment variables here.
    
    # This script sets variables multiple times over the course of starting an hbase-indexer process,
    # so try to keep things idempotent unless you want to take an even deeper look
    # into the startup scripts (bin/hbase-indexer, etc.)
    
    # The java implementation to use.  Java 1.6 required.
    export JAVA_HOME=/usr/java/jdk1.7.0/
    #根据实际环境修改
    
    # Extra Java CLASSPATH elements.  Optional.
    # export HBASE_INDEXER_CLASSPATH=
    
    # The maximum amount of heap to use, in MB. Default is 1000.
    # export HBASE_INDEXER_HEAPSIZE=1000
    
    # Extra Java runtime options.
    # Below are what we set by default.  May only work with SUN JVM.
    # For more on why as well as other possible settings,
    # see http://wiki.apache.org/hadoop/PerformanceTuning
    export HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -XX:+UseConcMarkSweepGC"
    

    使用scp命令把整个hbase-indexer-1.5-cdh5.4.1复制到node3上

    三、部署HBase

    解压缩hbase的tarball

    tar zxvf hbase-1.0.0-cdh5.4.1.tar.gz
    

    同样要修改hbase-site.xml

    vim hbase-1.0.0-cdh5.4.1/conf/hbase-site.xml
    

    需要在<configuration>标签内增加如下内容:

       <property>
        <name>hbase.rootdir</name>
        <value>hdfs://node1:9000/hbase</value>
        <description>The directory shared by RegionServers</description>
      </property>
      <property>
        <name>hbase.master</name>
        <value>node1:60000</value>
      </property>
      <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
        <description>The mode the cluster will be in.Possible values are
          false: standalone and pseudo-distributed setups with managed Zookeeper
          true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh)
        </description>
      </property>
      <property>
        <name>hbase.replication</name>
        <value>true</value>
        <description>SEP is basically replication, so enable it</description>
      </property>
      <property>
        <name>replication.source.ratio</name>
        <value>1.0</value>
        <description>Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters)</description>
      </property>
      <property>
        <name>replication.source.nb.capacity</name>
        <value>1000</value>
        <description>Maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out.</description>
      </property>
      <property>
        <name>replication.replicationsource.implementation</name>
        <value>com.ngdata.sep.impl.SepReplicationSource</value>
        <description>A custom replication source that fixes a few things and adds some functionality (doesn't interfere with normal replication usage).</description>
      </property>
      <property>
        <name>hbase.zookeeper.quorum</name>
        <value>node1,node2,node3</value>
        <description>The directory shared by RegionServers</description>
      </property>
      <property>
        <name>hbase.zookeeper.property.dataDir</name>
        <!--注意这里配置的是zookeeper集群的数据目录,参照zookeeper的zoo.cfg-->
        <value>/home/HBasetest/zookeeperdata</value>
        <description>Property from ZooKeeper's config zoo.cfg.
          The directory where the snapshot is stored.
        </description>
      </property>
    

    类似的,修改hbase-env.sh

    vim hbase-1.0.0-cdh5.4.1/conf/hbase-env.sh
    

    修改JAVA_HOME和HBASE_HOME

    # Set environment variables here.
    
    # This script sets variables multiple times over the course of starting an hbase process,
    # so try to keep things idempotent unless you want to take an even deeper look
    # into the startup scripts (bin/hbase, etc.)
    
    # The java implementation to use.  Java 1.7+ required.
    # export JAVA_HOME=/usr/java/jdk1.6.0/
    
    export JAVA_HOME=/opt/jdk1.7.0_79
    export HBASE_HOME=/home/HBasetest/hbase-1.0.0-cdh5.4.1
    #根据实际填写
    
    # Extra Java CLASSPATH elements.  Optional.
    # export HBASE_CLASSPATH=
    
    # The maximum amount of heap to use, in MB. Default is 1000.
    # export HBASE_HEAPSIZE=1000
    
    # Uncomment below if you intend to use off heap cache.
    # export HBASE_OFFHEAPSIZE=1000
    
    # For example, to allocate 8G of offheap, to 8G:
    # export HBASE_OFFHEAPSIZE=8G
    
    # Extra Java runtime options.
    # Below are what we set by default.  May only work with SUN JVM.
    # For more on why as well as other possible settings,
    # see http://wiki.apache.org/hadoop/PerformanceTuning
    export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
    

    hbase-indexer-1.5-cdh5.4.1/lib目录下的这4个文件复制到hbase-1.0.0-cdh5.4.1/lib/目录下

    hbase-sep-api-1.5-cdh5.4.1.jar
    hbase-sep-impl-1.5-hbase1.0-cdh5.4.1.jar
    hbase-sep-impl-common-1.5-cdh5.4.1.jar
    hbase-sep-tools-1.5-cdh5.4.1.jar
    

    修改hbase-1.0.0-cdh5.4.1/conf/regionservers为如下内容:

    node2
    node3
    

    然后将目录hbase-1.0.0-cdh5.4.1复制到node2和node3上面

    四、部署Solr

    直接在node1上解压缩就好。。。

    五、运行测试

    1.运行HBase

    在node1上执行:

    ./hbase-1.0.0-cdh5.4.1/bin/start-hbase.sh 
    
    2.运行HBase-indexer

    分别在node2和node3上执行:

    ./hbase-indexer-1.5-cdh5.4.1/bin/hbase-indexer server 
    

    如果想以后台方式运行,可以使用screen或者nohup

    3.运行Solr

    分别在node1上进入solr下面的sample子目录,执行:

    java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkHost=node1:2181,node3:2181,node4:2181/solr -jar start.jar
    

    同样,如果想以后台方式运行,可以使用screen或者nohup
    使用http://node1:8983/solr/#/访问solr的主页

    六、数据索引测试

    将Hadoop集群、HBase、HBase-Indexer、Solr都跑起来之后,首先用HBase创建一个数据表:
    在任一node上的HBase安装目录下运行:

    ./bin/hbase shell
    create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' } 
    

    在部署了HBase-Indexer的节点上,进入HBase-Indexer部署目录,使用HBase-Indexer的demo下的配置文件创建一个索引:

    ./bin/hbase-indexer add-indexer -n myindexer -c .demo/user_indexer.xml -cp solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 
    

    编辑hbase-indexer-1.5-cdh5.4.1/demo/下的字段定义文件:

    <?xml version="1.0"?>
    <indexer table="indexdemo-user">
      <field name="firstname_s" value="info:firstname"/>
      <field name="lastname_s" value="info:lastname"/>
      <field name="age_i" value="info:age" type="int"/>
    </indexer>
    

    保存为indexdemo-indexer.xml

    添加indexer实例
    在hbase-indexer-1.5-cdh5.4.1/demo下运行:

    ./bin/hbase-indexer add-indexer -n myindexer -c indexdemo-indexer.xml -cp \
    solr.zk=node1:2181,node2:2181,node3:2181/solr -cp solr.collection=collection1 -z node1,node2,node3
    

    准备一些测试数据,因为项目需要对千万级以上的记录进行索引的测试,所以用命令行手敲的方式插入数据有点不大现实,HBase也支持使用shell命令批量执行以文本方式存储的命令集合,但在千万级别这个数量级的数据量面前还是很苍白,最后我还是选择了用Java编程的方式实现快速的批量插入记录。
    Eclipse里面新建一个Java工程,导入HBase部署目录下lib内的所有内容。程序源代码如下:

    package com.hbasetest.hbtest;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.hbase.HBaseConfiguration;
    import org.apache.hadoop.hbase.client.HTable;
    import org.apache.hadoop.hbase.client.Put;
    
    public class DataInput {
        private static Configuration configuration;
        static {
            configuration = HBaseConfiguration.create();
            configuration.set("hbase.zookeeper.property.clientPort", "2181");
            configuration.set("hbase.zookeeper.quorum", "node1,node2,node3");
        }
        public static void main(String[] args) {        
            try {
                List<Put> putList = new ArrayList<Put>();
                HTable table = new HTable(configuration, "indexdemo-user");
                for (int i =0; i<=14000000 ;i++)
            {
                Put put = new Put(Integer.toString(i).getBytes());
                put.add("info".getBytes(), "firstname".getBytes(), ("Java.value.firstname"+Integer.toString(i)).getBytes());
                put.add("info".getBytes(), "lastname".getBytes(), ("Java.value.lastname"+Integer.toString(i)).getBytes());
                putList.add(put);
                System.out.println("put successfully! " + Integer.toString(i) );        
               
            }   table.put(putList);     
            } catch (IOException e) {
                e.printStackTrace();
                    }       
        }
    }
    

    这段代码使用了批量put的办法,如果运行这个程序的机器内存不够大,建议做问题分治,多搞几个putList。

    剩下的检索测试就简单了,不再赘述。

    相关文章

      网友评论

        本文标题:使用Solr+Hbase-solr(Hbase-indexer)

        本文链接:https://www.haomeiwen.com/subject/vuunnftx.html