美文网首页Hadoop
ubuntu15.10下nutch2.2.1+mysql搭建爬虫

ubuntu15.10下nutch2.2.1+mysql搭建爬虫

作者: trieyouth | 来源:发表于2016-04-05 11:21 被阅读360次

    引言

    该教程献给那些刚刚知道nutch这个东西,充满好奇心想要尝试却一脸懵逼的小伙伴们。

    nutch源码下载

    简书上没有上传的地方,有点淡淡的忧伤,所以我只有借助<a href="http://download.csdn.net/detail/trieyouth/9480480">CSDN</a>了(走过路过不要错过,只要2个C币,业界良心)。

    nutch编译前的配置

    • 打开mysql支持
        <!--配置ivy/ivy.xml-->
        <!--ivy也是一种包管理工具,和maven差不多,这里就是添加sql的依赖-->
        <!--解注释-->
        <dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
        <dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
        <修改
        <dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
        <为
        <dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
        <原因
        <!-- Uncomment this to use SQL as Gora backend. It should be noted that the 
        gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should 
        downgrade to gora-core 0.2.1 in order to use SQL as a backend. -->
      
    • 配置mysql参数
        //conf/gora.properties
        //注释掉Default SqlStore properties并添加MySQL properties
       //MySQL properties           
      gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
      gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
      gora.sqlstore.jdbc.user=root
      gora.sqlstore.jdbc.password=password
    
    • 修改nutch的参数
       <!--将nutch-site.xml.template重命名为nutch-site.xml-->
       <!--conf/nutch-site.xml文件中添加-->
      <property>
           <name>http.agent.name</name>
           <value>LiuXun Nutch Spider</value>
       </property>
    
       <property>
           <name>http.accept.language</name>
           <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
           <description>Value of the “Accept-Language” request header field.
    
           This allows selecting non-English language as default one to retrieve.
    
           It is a useful setting for search engines build for certain national group.
    
           </description>
       </property>
    
       <property>
           <name>parser.character.encoding.default</name>
           <value>utf-8</value>
           <description>The character encoding to fall back to when no other information
           is available</description>
       </property>
    
       <property>
           <name>storage.data.store.class</name>
           <value>org.apache.gora.sql.store.SqlStore</value>
           <description>The Gora DataStore class for storing and retrieving data.
           Currently the following stores are available: ….
           </description>
       </property>
    
       <property>
               <name>generate.batch.id</name>
               <value>*</value>
       </property>
    

    nutch编译工具的安装

    下载<a href="http://download.csdn.net/detail/trieyouth/9481370">ant</a>并配置path(就这么简单)

    nutch的编译

    • 配置
      将<a href="http://download.csdn.net/detail/trieyouth/9481385">sonar-ant-task-2.1.jar</a>放入nutch根目录,并修改build.xml
      <!-- Define the Sonar task if this hasn't been done in a common script -->
     <taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
             <classpath path="${ant.library.dir}" />
             <classpath path="${mysql.library.dir}" />
             <classpath><fileset dir="." includes="sonar*.jar" /></classpath>
     </taskdef>
    
    • ant编译
      在nutch的根目录运行ant runtime命令,然后就是漫长的依赖下载时间。
      真是可怕:
    Paste_Image.png

    预告

    下一篇<a href="http://www.jianshu.com/p/6c8d59d1f920">ubuntu15.10下nutch2.2.1+hbase1.1.1搭建爬虫平台(失败的尝试)</a>

    相关文章

      网友评论

        本文标题:ubuntu15.10下nutch2.2.1+mysql搭建爬虫

        本文链接:https://www.haomeiwen.com/subject/fqpulttx.html