引言
本文总结了本人搭建Nutch平台的过程,也为初探nutch的小伙伴提供一些指导。
环境说明
·操作系统:Ubuntu18.04LTS
·软件版本:nutch2.2.1、solr4.10.3
平台结构
如同文章标题一样,平台可以分为3个部分:Nutch、数据库、前端
Nutch:图中Index左边的一部分,负责对网页进行抓取解析,调用数据库进行存储
数据库:存储抓取到的网页数据。1.x版本是基于Hadoop架构的,底层存储使用的是HDFS,而2.x通过使用Apache Gora,使得Nutch可以访问HBase、Accumulo、Cassandra、MySQL、DataFileAvroStore、AvroStore等数据库。
前端:Tomcat 是一个免费的开放源代码的Web 应用服务器,Solr是一个搜索应用。
平台部署
我们从一台全新的Ubuntu18.04服务器开始,先新建一个文件夹来存放平台所需软件,这里可以根据个人情况选择文件夹的位置。若无十足把握确保接下来教程中的路径没有问题,可以按照教程一字不差地进行操作。
lemon@ubuntu:~$ mkdir ~/download/ #新建一个文件夹用来存放下载文件
一、安装JDK
step1.下载OracleJDK
step2. 解压
step3. 加入环境变量
具体操作如下:
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ tar vxf jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ ls #查看当前目录下的文件
jdk1.8.0_191 jdk-8u191-linux-x64.tar.gz
lemon@ubuntu:~/download$ sudo mv jdk1.8.0_191/ /usr/local/jdk1.8/ #将jdk1.8.0_191文件夹移动到/usr/local/下并重命名为jdk1.8
lemon@ubuntu:~/download$ sudo vim /etc/profile #编辑环境变量
在环境变量末尾加入如下内容:
export JAVA_HOME=/usr/local/jdk1.8
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=.:${JAVA_HOME}/bin:$PATH
保存后重新加载环境变量,使生效:
lemon@ubuntu:~/download$ source /etc/profile #刷新环境变量,使生效
lemon@ubuntu:~$ java -version#输入java -version,如显示以下信息,则JDK安装成功
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
二、安装MySQL
step1. 安装MySQL并配置
step2. 创建数据库与表
由于在安装Ubuntu系统时,本人选择了安装LAMP服务,所以MySQL已安装完成,仅需设置即可启用。
测试是否安装:
lemon@ubuntu:~$ mysql #输入mysql,如出现以下提示,说明已安装mysql
ERROR 1045 (28000): Access denied for user 'lemon'@'localhost' (using password: NO)
如未安装:
lemon@ubuntu:~$ sudo apt-get install mysql-server
lemon@ubuntu:~$sudo apt isntall mysql-client
lemon@ubuntu:~$sudo apt install libmysqlclient-dev
如已安装:
lemon@ubuntu:~$ sudo mysql_secure_installation
两者都会进入MySQL设置过程,具体设置内容如下:
#1
VALIDATE PASSWORD PLUGIN can be used to test passwords...
Press y|Y for Yes, any other key for No: N(不启用弱密码检查)
#2
Please set the password for root here...
New password: (设置root密码)
Re-enter new password: (重复输入)
#3
By default, a MySQL installation has an anonymous user,
allowing anyone to log into MySQL without having to have
a user account created for them...
Remove anonymous users? (Press y|Y for Yes, any other key for No) : Y(不启用匿名用户)
#4
Normally, root should only be allowed to connect from
'localhost'. This ensures that someone cannot guess at
the root password from the network...
Disallow root login remotely? (Press y|Y for Yes, any other key for No) : Y (不允许root远程登陆)
#5
By default, MySQL comes with a database named 'test' that
anyone can access...
Remove test database and access to it? (Press y|Y for Yes, any other key for No) : N
#6
Reloading the privilege tables will ensure that all changes
made so far will take effect immediately.
Reload privilege tables now? (Press y|Y for Yes, any other key for No) : Y (立刻刷新权限表)
All done!
接下来进入进入MySQL进行操作:
#最新版的MySQL安装之后无法使用密码进行登陆,需要sudo登录修改登录方式
lemon@ubuntu:~$ sudo mysql -uroot -p
Enter password: (空密码)
mysql>
mysql>UPDATE mysql.user SET authentication_string=PASSWORD('LEMON'), plugin='mysql_native_password' WHERE user='root';
mysql> FLUSH PRIVILEGES;
mysql>exit
lemon@ubuntu:~$ sudo service mysql restart
lemon@ubuntu:~$ mysql -u root -p
Enter password: (上一步设置的密码,PASSWORD括号内的)
mysql>CREATE DATABASE nutch;
mysql>USE nutch
mysql> CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
`batchId`varchar(767)DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;
mysql>exit
*最新版本默认情况下,MySQL是不允许远程登录的,如需远程访问需要做一些修改:
lemon@ubuntu:~$sudo vim /etc/mysql/mysql.conf.d/mysqld.cnf
#将bind-address = 127.0.0.1注释掉,重启MySQL服务
lemon@ubuntu:~$sudo service mysqld start
接下来就可以通过Navicat等软件,在其他计算机访问数据库了。
Navicat
三、安装Nutch
step1.下载Nutch
step2. 解压
step3. 修改ivy.xml、gora.properties、nutch-site.xml
step4. 编译Nutch
step5. 网页抓取配置
具体操作如下:
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.zip
lemon@ubuntu:~/download$ unzip apache-nutch-2.2.1-src.zip
#如果提示未安装unzip,就先安装一下sudo apt install unzip
lemon@ubuntu:~/download$ mkdir ~/software
lemon@ubuntu:~/download$ mv apache-nutch-2.2.1 ~/software/
修改ivy.xml:(用于配置存储层使用的数据库)
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/ivy/ivy.xml
将以下两行取消注释
<dependency org=”mysql” name=”mysql-connector-java”rev=”5.1.18″ conf=”*->default”/>
<dependency org="org.apache.gora"name="gora-sql" rev="0.1.1-incubating"conf="*->default" />
将
<dependency org="org.apache.gora" name="gora-core" rev="0.3"conf="*->default"/>
改成
<dependency org="org.apache.gora" name="gora-core"rev="0.2.1"conf="*->default"/>
修改gora.properties:(数据库的具体参数)
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/gora.properties
注释掉默认的数据库连接配置,同时添加以下配置内容:
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxx(MySQL用户名)
gora.sqlstore.jdbc.password=xxxx(MySQL密码)
如数据库非本机,需修改localhost为数据库地址
修改nutch-site:(配置Nutch)
lemon@ubuntu:~/software$ vim apache-nutch-2.2.1/conf/nutch-site.xml
增加如下内容:
<configuration>
<property>
<name>http.agent.name</name>
<value>LemonSpider</value>
</property>
<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header failed. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.
</description>
</property>
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information is available</description>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>
<property>
<name>generate.batch.id</name>
<value>*</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-jsoup</value>
</property>
<property>
<name>http.robots.agents</name>
<value>LemonSpider,*</value>
</property>
编译Nutch
lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1
lemon@ubuntu:~/software/apache-nutch-2.2.1$ant
#编译需要较长时间,请保持联网
网页抓取配置
lemon@ubuntu:~$cd ~/software/apache-nutch-2.2.1/runtime/local
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$mkdir -p urls
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$echo 'http://www.apache.org/' > urls/seed.txt#设置要抓取的网站
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$bin/nutch crawl urls -depth 3 -topN 5#执行抓取
-depth -topN 参数分别是深度和返回前N页面,具体参数可以参考官网手册
如果报错,请仔细检查是否完全按照上述教程操作、检查有无修改内容时多删除或者少删除了字符。
成功运行示意图:
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread9, activeThreads=6
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread1, activeThreads=8
-finishing thread FetcherThread8, activeThreads=9
-finishing thread FetcherThread6, activeThreads=4
-finishing thread FetcherThread7, activeThreads=3
-finishing thread FetcherThread0, activeThreads=2
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=0
0/0 spinwaiting/active, 11 pages, 0 errors, 0.4 0 pages/s, 78 36 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
Parsing http://accumulo.apache.org/
Parsing http://activemq.apache.org/
Parsing http://airavata.apache.org/
Parsing http://allura.apache.org/
Parsing http://ambari.apache.org/
Parsing http://www.apache.org/
Parsing http://www.apache.org/foundation/sponsorship.html
Parsing http://www.apache.org/foundation/thanks.html
Parsing http://www.apache.org/licenses/
Parsing http://www.apache.org/licenses/LICENSE-2.0
Parsing http://www.apache.org/security/
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local$
四、安装tomcat
step1.下载tomcat
step2. 解压
step3. 启动
lemon@ubuntu:~$ cd download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/tomcat/tomcat-8/v8.0.33/bin/apache-tomcat-8.0.33.tar.gz
lemon@ubuntu:~/download$ tar vxf apache-tomcat-8.0.33.tar.gz
lemon@ubuntu:~/download$ mv apache-tomcat-8.0.33 ~/software/
lemon@ubuntu:~/download$ cd ~/software/apache-tomcat-8.0.33/
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh
此时,在本地浏览器中打开localhost:8080或者127.0.0.1:8080,同一局域网下计算机可以访问本机ip:8080,例如,本服务器内网ip为114.212.167.106,同一局域网下计算机可以访问114.212.167.106:8080
看到以下页面就说明tomcat安装完成:
tomcat页面
五、安装solr与tomcat集成
step1.下载solr并解压
step2. 解压
step3. 在tomcat的webapps目录下新建solr文件夹
step4. 将solr-4.10.3/example/webapps/文件夹下的solr.war拷贝到step2新建的solr文件夹并解压
step5. step4完成后solr文件夹下会生成collection1文件夹,将apache-nutch-2.2.1/conf/文件夹下的schema.xml拷贝到collection1/conf/文件夹下
step6. 修改tomcat文件夹下webapps/solr/WEB_INF/web.xml
step7. 复制solr-4.10.3/example/lib/ext/文件夹下的jar包到tomcat/webapps/solr/WEB-INF/lib/
step8.在tomcat/webapps/solr/WEB-INF/文件夹下新建classes文件夹,并将solr-4.10.3/example/resources文件夹下的log4j.properties复制到新建classes文件夹里
step9. 重启tomcat
lemon@ubuntu:~$ cd ~/download/
lemon@ubuntu:~/download$ wget http://archive.apache.org/dist/lucene/solr/4.10.3/solr-4.10.3.zip
lemon@ubuntu:~/download$ unzip solr-4.10.3.zip
lemon@ubuntu:~/download$ mv solr-4.10.3 ../software/
lemon@ubuntu:~/download$ cd ../software/
lemon@ubuntu:~/software$ cd apache-tomcat-8.0.33/webapps/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ mkdir solr
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/solr-4.10.3/example/webapps/solr.war ./solr/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ jar vxf solr.war
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp -r ~/software/solr-4.10.3/example/solr ../
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ cp ~/software/apache-nutch-2.2.1/conf/schema.xml ../solr/collection1/conf/
lemon@ubuntu:~/software/apache-tomcat-8.0.33/webapps$ vim solr/WEB-INF/web.xml
取消以下内容的注释,并修改solrhome的值
<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>/home/lemon/software/apache-tomcat-8.0.33/solr</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ vim ~/software/apache-tomcat-8.0.33/solr/collection1/conf/solrconfig.xml
<!-- Data Directory
Used to specify an alternate directory to hold all index data
other than the default ./data under the Solr home. If
replication is in use, this should match the replication
configuration.
-->
<dataDir>${solr.data.dir:/home/lemon/software/apache-tomcat-8.0.33/solr/collection1/data}</dataDir>
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/lib/ext/* ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/lib/
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ mkdir ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ cp ~/software/solr-4.10.3/example/resources/log4j.properties ~/software/apache-tomcat-8.0.33/webapps/solr/WEB-INF/classes
最后,重新启动tomcat.
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/shutdown.sh
lemon@ubuntu:~/software/apache-tomcat-8.0.33$ bin/startup.sh
image.png
五、利用solr为抓取到的数据建立索引
lemon@ubuntu:~/software/apache-nutch-2.2.1$ cd ~/software/apache-nutch-2.2.1/runtime/local/
lemon@ubuntu:~/software/apache-nutch-2.2.1/runtime/local/$bin/nutch crawl -solr http://127.0.0.1:8080/solr/ -reindex
检索界面:
检索结果
结语
我在本次搭建也踩了许多坑,本文是避坑后的完整过程,严格按照本文操作应该不会出现问题。由于用于演示,未采用较为复杂的Hbase作为存储,不过接下来我也将尝试。
如果遇到错误,请核对版本是否一致、路径是否正确、代码修改是否有误。
我将部署过程中所遇到的错误做了总结,写了一篇错误集锦,将于最近完成,希望届时对大家有所帮助。
网友评论