nutch爬取网站数据详细步骤

作者: Echoooo_o | 来源:发表于2019-05-18 17:17 被阅读0次

nutch爬取网站数据详细步骤
1.3 插件基本操作步骤 --webscraper操作手册
用Python爬取猫眼电影排行榜TOP100
「爬虫」15爬虫之scrapy爬虫项目实战（无登录）
打卡：1-3爬取真实的网络数据
爬虫入门01-获取网络数据的原理作业
Python学习笔记7——爬取大规模数据
电影天堂爬虫
爬虫实战之Scrapy模拟登陆
爬虫入门01作业

环境：hadoop2.7.7 + hbase0.98 + nutch2.3 + solr4.9
大致步骤思想：hadoop 提供底层数据存储 hbase 在其之上建立非关系型数据库 nutch将爬的数据存到 hbase上并建立索引到solr 展示

首先采用简单命令：

#$1 $2 ... $n 表示命令后跟的第n个参数
#存放待注入种子的路径
SEEDDIR="$1"
#存放爬取数据（URL状态信息、爬取数据、解析数据）文件夹的路径
CRAWL_PATH="$2"
#如果要将爬取结果提交给solr做索引，第三个参数要设置为solr的url
#如果不需要索引到solr,可将下面这行注释，但需要对其他地方做相应修改
SOLRURL="$3"
#爬取的深度（广度遍历的层数）
#如果不需要提交到solr，将上行注释之后，注意将下行的$4改为$3
LIMIT="$4"
上面的是 crawl 里面的定义
.bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/ 2
          SEEDDIR       CRAWL_PATH  SOLRURL                LIMIT

可以把crawl理解为一个快捷脚本，那么现在来一步一步使用 nutch来实现

首先使用 ./nutch 查看可以接哪些参数

Usage: nutch COMMAND
where COMMAND is one of:
 inject     inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate   generate new batches to fetch from crawl db
 fetch      fetch URLs marked during generate
 parse      parse URLs marked during fetch
 updatedb   update web table after parsing
 updatehostdb   update host table after parsing
 readdb     read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command instead
 solrindex  run the solr indexer on parsed batches - DEPRECATED use the index command instead
 solrdedup  remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin     load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 or
 CLASSNAME  run the class named CLASSNAME

由于官网wiki 写的东西实在是不堪入目只能靠crawl里面的shell代码来一葫芦画瓢

1. inject urls/ -crawlId mywebtable

urls/ 是要爬的地址存放地方
mywebtable 是你准备在hbase建立的表名实际会在你这个名字后面加上 _webpage 也就是说最终的表名是 mywebtable_webpage

root@pixel:/opt/nutch/runtime/local# ./bin/nutch inject urls/ -crawlId 1

成功后是这样

InjectorJob: starting at 2019-05-18 16:30:25
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2019-05-18 16:30:32, elapsed: 00:00:07

2.batchId=`date +%s`-$RANDOM

这个具体是个啥我也说不清但是错误结果告诉我这个有用
为此我输出看了一下现象看是制造随机数

root@pixel:/opt/nutch/runtime/local# date
2019年 05月 18日 星期六 16:34:20 CST
root@pixel:/opt/nutch/runtime/local# echo `date +%s`-$RANDOM
1558168514-954

3.bin/nutch generate

root@pixel:/opt/nutch/runtime/local#./bin/nutch generate -topN 2 -crawlId '1'

成功后是这样

GeneratorJob: starting at 2019-05-18 16:59:36
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 2
GeneratorJob: finished at 2019-05-18 16:59:42, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1558169976-1599953580 containing 2 URLs

4.fetch $batchId -crawlId "$CRAWL_ID" -threads 50

./bin/nutch fetch -all -crawlId '1' -threads 10

成功后是这样

5.updatedb $batchId -crawlId "$CRAWL_ID"

root@pixel:/opt/nutch/runtime/local# ./bin/nutch updatedb -all -crawlId "1"

成功后是这样

DbUpdaterJob: starting at 2019-05-18 16:45:54
DbUpdaterJob: batchId: 1558168514-954
DbUpdaterJob: finished at 2019-05-18 16:45:59, time elapsed: 00:00:04

6.index -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"

root@pixel:/opt/nutch/runtime/local# ./bin/nutch index -D solr.server.url=http://localhost:8983/solr/ -all -crawlId "1"

成功后是这样

IndexingJob: starting
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance (mandatory)
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication
IndexingJob: done.