环境:hadoop2.7.7 + hbase0.98 + nutch2.3 + solr4.9
大致步骤思想:hadoop 提供底层数据存储 hbase 在其之上建立非关系型数据库 nutch将爬的数据 存到 hbase上 并 建立索引 到solr 展示
首先采用简单命令:
#$1 $2 ... $n 表示命令后跟的第n个参数
#存放待注入种子的路径
SEEDDIR="$1"
#存放爬取数据(URL状态信息、爬取数据、解析数据)文件夹的路径
CRAWL_PATH="$2"
#如果要将爬取结果提交给solr做索引,第三个参数要设置为solr的url
#如果不需要索引到solr,可将下面这行注释,但需要对其他地方做相应修改
SOLRURL="$3"
#爬取的深度(广度遍历的层数)
#如果不需要提交到solr,将上行注释之后,注意将下行的$4改为$3
LIMIT="$4"
上面的是 crawl 里面的定义
.bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
SEEDDIR CRAWL_PATH SOLRURL LIMIT
可以把crawl理解为一个快捷脚本,那么现在来一步一步使用 nutch来实现
首先使用 ./nutch 查看可以接哪些参数
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
index run the plugin-based indexer on parsed batches
elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
webapp run a local Nutch web application
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
由于官网wiki 写的东西实在是不堪入目 只能靠crawl里面的shell代码来一葫芦画瓢
1. inject urls/ -crawlId mywebtable
urls/ 是要爬的地址存放地方
mywebtable 是你准备在hbase建立的表名 实际会在你这个名字后面加上 _webpage 也就是说 最终的表名是 mywebtable_webpage
root@pixel:/opt/nutch/runtime/local# ./bin/nutch inject urls/ -crawlId 1
成功后是这样
InjectorJob: starting at 2019-05-18 16:30:25
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2019-05-18 16:30:32, elapsed: 00:00:07
2.batchId=`date +%s`-$RANDOM
这个具体是个啥 我也说不清 但是错误结果告诉我 这个有用
为此我输出看了一下 现象看是制造随机数
root@pixel:/opt/nutch/runtime/local# date
2019年 05月 18日 星期六 16:34:20 CST
root@pixel:/opt/nutch/runtime/local# echo `date +%s`-$RANDOM
1558168514-954
3.bin/nutch generate
root@pixel:/opt/nutch/runtime/local#./bin/nutch generate -topN 2 -crawlId '1'
成功后是这样
GeneratorJob: starting at 2019-05-18 16:59:36
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 2
GeneratorJob: finished at 2019-05-18 16:59:42, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1558169976-1599953580 containing 2 URLs
4.fetch $batchId -crawlId "$CRAWL_ID" -threads 50
./bin/nutch fetch -all -crawlId '1' -threads 10
成功后是这样
5.updatedb $batchId -crawlId "$CRAWL_ID"
root@pixel:/opt/nutch/runtime/local# ./bin/nutch updatedb -all -crawlId "1"
成功后是这样
DbUpdaterJob: starting at 2019-05-18 16:45:54
DbUpdaterJob: batchId: 1558168514-954
DbUpdaterJob: finished at 2019-05-18 16:45:59, time elapsed: 00:00:04
6.index -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"
root@pixel:/opt/nutch/runtime/local# ./bin/nutch index -D solr.server.url=http://localhost:8983/solr/ -all -crawlId "1"
成功后是这样
IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
IndexingJob: done.
网友评论