美文网首页Go语言Go 语言学习专栏 程序员
『Go 语言学习专栏』-- 第七期

『Go 语言学习专栏』-- 第七期

作者: 谢小路 | 来源:发表于2018-05-14 07:11 被阅读259次
    golang-learning-seven.png 7.png

    大家好,我叫谢伟,是一名程序员。

    我们已经研究了:

    不管学习什么,如果没有得到快速入门的机会,会丧失学习的动力。进而失去深入研究一门技能的机会。这对初学者或者自学者来说,这一点非常的重要,不然的话,会重复的抓起沙子,而建设不了大厦,所以说自信心很重要。

    这节呢,使用之前学习的知识。完成一个小任务。

    作为程序员呢。我们在专注学习研究技术的同时,也需要关注一些技术的热点。那怎么才能关注技术热点,比如现在的技术人员在研究些什么、关注些什么?

    方法当然是上主流的技术社区,了解现在的技术人员在研究些什么东西。

    这里我们说的主流的技术社区,认为是 Github 。因为这个托管网站实在是存在太多值得你研究的东西、巨多开源的技术值得你去研究。

    Github 专门有一个链接指向当天最热门的项目。从这一个侧面,我们大概可以了解到热门的语言的一些热门项目。

    Github Trending

    还可以根据编程语言查看热门的项目:

    比如:

    语言 链接
    Python Github Trending Python
    Go Github Trending Go

    我们的目的是:抓取这些热门的项目的一些信息。(因为我发现,不管是Python 还是Go 爬虫似乎总能很好的激发学习者的兴趣?)

    github-trending.png github-trending-dev.png

    任务就是上面两张图里的内容:

    • 定义抓取字段
    • 获取网页信息
    • 解析网页信息
    • 任务调度
    • 函数主入口

    这里在提一点:初学者往往不太注重自己的项目的工程结构。什么意思呢?意思是说初学者往往注重在实现部分,认为实现了功能,整个工程就差不多结束了,就理所当然的认为自己的开发任务完成了。实际上在企业里的任务开发和你自己练手玩的项目很不一样,企业里的任务开发往往会根据需求变动,假如在学校里,你做一个项目,老师给你定下了一个任务,中途又改变了,待你代码差不多写好了,又更改了任务目标,看上去你肯定会抱怨老师,实际上这种情形在企业里开发是日常很常见的。所以,刚开始我就建议初学者或者自学者坚持一项好的工程组织结构,以后都在这个项目的组织结构上动态的调整(主体不变,内部细节调整)。事实上很多设计模式或者软件设计架构都是有一套固定的项目组织结构。这样保证项目可扩展性、低耦合等

    项目结构

    就爬虫项目,给你推荐下面一个工程目录:

    workspace
      download
          download.go
      engine
          engine.go
          object.go
      infra
         util.go
      main
         main.go
      parse
         github
              github_trending_parse.go
    

    解释下各个文件的含义:

    download
       download.go
    

    定位为:下载器
    download.go完成的是:获取网页信息

      engine
          engine.go
          object.go
    

    定位为:调度引擎
    engine.go 完成的是:爬虫任务的调度
    object.go 完成的是: 定义抓取的字段

      infra
         util.go
    

    定位为:基础设施

    util.go 完成的是:项目需要的一些辅助函数

      main
         main.go
    

    主函数入口。没什么好说的。

      parse
         github
              github_trending_parse.go
    

    定位为:解析器

    github_trending_parse.go 完成的是:解析github 网站的一些解析函数


    下载器

    // download.go
    var (
        ErrorNil       = errors.New("response is  nil")
        ErrorWrongCode = errors.New("http response code is wrong")
    )
    
    func Download(url string) (*goquery.Document, error) {
    
        var (
            resp *http.Response
            err  error
        )
    
        if resp, err = http.Get(url); err != nil {
            return nil, ErrorNil
        }
    
        defer resp.Body.Close()
    
        if resp.StatusCode != 200 {
            return nil, ErrorWrongCode
        }
    
        return goquery.NewDocumentFromReader(resp.Body)
    }
    
    
    
    • 注意函数命名
    • 注意错误处理机制:建议每个文件的开头定义一些错误信息

    解析器

    // github_trending_parse.go
    
    func ParseForGithub(document *goquery.Document) {
    
        document.Find("div.explore-content ol.repo-list li").Each(func(i int, selection *goquery.Selection) {
            RespName, _ := infra.HandleCommon(selection.Find("div h3 a").Text())
            URL, _ := infra.HandlerURL(selection.Find("div").Eq(0).Find("h3 a").AttrOr("href", "None"))
            Description, _ := infra.HandleCommon(selection.Find("div").Eq(2).Find("p").Text())
            Stars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(0).Text())
            Fork, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("a").Eq(1).Text())
            TodayStars, _ := infra.HandleCommon(selection.Find("div").Eq(3).Find("span").Eq(1).Text())
    
            fmt.Println(RespName, URL, Description, Stars, Fork, TodayStars)
        })
    
    }
    
    func ParseForDevelopers(document *goquery.Document) {
    
        document.Find("div.explore-content ol li").Each(func(i int, selection *goquery.Selection) {
            DevName, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").Text())
            Description, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("a span").Text())
            URL, _ := infra.HandleCommon(selection.Find("li div div").Eq(1).Find("h2 a").AttrOr("href", "None"))
    
            fmt.Println(DevName, Description, URL)
        })
    }
    
    
    
    • 一个解析函数解析:https://github.com/trending/developers
    • 一解析函数解析:https://github.com/trending

    调度器

    // enigin.go
    
    package engine
    
    import (
        "errors"
        "fmt"
        "go-example-for-live/seven_learning/download"
    
        "github.com/PuerkitoBio/goquery"
    )
    
    var (
        ErrorDocWrong = errors.New("document wrong")
    )
    
    type Trending struct {
    }
    
    func (t Trending) Run(request RequestForGithub) {
    
        var doc *goquery.Document
    
        doc, err := download.Download(request.URL)
        if err != nil {
            fmt.Println(ErrorDocWrong)
            return
        }
        if doc != nil {
            fmt.Println("Game start!")
            request.ParseFunc(doc)
        } else {
            fmt.Println("Game over!")
        }
    
    }
    
    
    
    • 负责串接:下载器和解析器,获取到抓取的字段
    package engine
    
    import "github.com/PuerkitoBio/goquery"
    
    type RequestForGithub struct {
        URL       string
        ParseFunc func(doc *goquery.Document)
    }
    
    type Repositories struct {
        RespName    string
        URL         string
        Stars       int
        Fork        int
        TodayStars  string
        Description string
    }
    
    type Developers struct {
        DevName     string
        Description string
        URL         string
    }
    
    
    
    • 定义三个结构体:

    1、称之为种子:包括URL 和 解析函数
    2、Developers 定义为https://github.com/trending/developers网页的抓取字段
    3、Repositories 定义为https://github.com/trending网页的抓取字段

    基础设施

    // util.go
    package infra
    
    import (
        "errors"
        "strings"
    )
    
    var (
        ErrorStringSpace = errors.New("string trim error")
    )
    
    func HandleCommon(oldString string) (string, error) {
        newReplacer := strings.NewReplacer("\n", "", "\t", "")
        return strings.TrimSpace(newReplacer.Replace(oldString)), nil
    }
    
    func HandlerURL(oldString string) (string, error) {
        return "https://github.com" + strings.TrimSpace(oldString), nil
    }
    
    
    

    即:一些字符串的处理函数,比如替换函数、拼接函数

    主函数入口

    // main.go
    
    package main
    
    import (
        "go-example-for-live/seven_learning/engine"
        "go-example-for-live/seven_learning/parse/github"
    )
    
    func main() {
    
        var simplerTest engine.Trending
    
        simplerTest.Run(
            engine.RequestForGithub{
                URL:       "https://github.com/trending",
                ParseFunc: github.ParseForGithub,
            },
        )
        simplerTest.Run(
            engine.RequestForGithub{
                URL:       "https://github.com/trending/developers",
                ParseFunc: github.ParseForDevelopers,
            },
        )
    
    }
    
    
    

    结果:

    Game start!
    xingshaocheng / architect-awesome https://github.com/xingshaocheng/architect-awesome 后端架构师技术图谱 13,220 3,150 1,528 stars today
    google / gvisor https://github.com/google/gvisor Container Runtime Sandbox 5,080 190 
    davideuler / architecture.of.internet-product https://github.com/davideuler/architecture.of.internet-product 互联网公司技术架构,微信/淘宝/微博/腾讯/阿里/美团点评/百度/Google/Facebook/Amazon/eBay的架构,欢迎PR补充 7,058 1,123 1,427 stars today
    kusti8 / proton-native https://github.com/kusti8/proton-native A React environment for cross platform native desktop apps 6,216 153 
    github / gh-ost https://github.com/github/gh-ost GitHub's Online Schema Migrations for MySQL 5,136 335 
    pytorch / ELF https://github.com/pytorch/ELF ELF: a platform for game research 1,525 218 
    cyanharlow / purecss-francine https://github.com/cyanharlow/purecss-francine HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. 4,035 169 
    sallar / github-contributions-chart https://github.com/sallar/github-contributions-chart Generate an image of all your Github contributions 2,228 60 
    RelaxedJS / ReLaXed https://github.com/RelaxedJS/ReLaXed Create PDF documents using web technologies 7,116 181 
    sindresorhus / ow https://github.com/sindresorhus/ow Function argument validation for humans 1,790 18 
    xx45 / dayjs https://github.com/xx45/dayjs Fast 2KB immutable date library alternative to Moment.js with the same modern API 9,310 269 
    sharkdp / bat https://github.com/sharkdp/bat A cat(1) clone with wings. 2,102 26 
    CyC2018 / Interview-Notebook https://github.com/CyC2018/Interview-Notebook  技术面试需要掌握的基础知识整理,欢迎编辑~ 21,845 5,763 256 stars today
    shimohq / chinese-programmer-wrong-pronunciation https://github.com/shimohq/chinese-programmer-wrong-pronunciation 中国程序员容易发音错误的单词 5,963 530 257 stars today
    binhnguyennus / awesome-scalability https://github.com/binhnguyennus/awesome-scalability High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns 10,778 810 246 stars today
    nhnent / tui.calendar https://github.com/nhnent/tui.calendar A JavaScript calendar that everything you need. 4,759 192 
    YadiraF / PRNet https://github.com/YadiraF/PRNet The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. 1,296 108 
    hasura / skor https://github.com/hasura/skor Listen to postgres events and forward them as JSON payloads to a webhook 946 19 
    roytseng-tw / Detectron.pytorch https://github.com/roytseng-tw/Detectron.pytorch A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. 926 114 
    iotexproject / iotex-core https://github.com/iotexproject/iotex-core Connecting the physical world, block by block. 654 58 
    layerJS / layerJS https://github.com/layerJS/layerJS layerJS: Javascript UI composition framework 1,039 26 
    AllThingsSmitty / css-protips https://github.com/AllThingsSmitty/css-protips A collection of tips to help take your CSS skills pro 11,921 785 188 stars today
    tabler / tabler https://github.com/tabler/tabler Tabler is free and open-source HTML Dashboard UI Kit built on Bootstrap 4 13,629 983 
    tmcw / big https://github.com/tmcw/big presentations for busy messy hackers 2,436 137 
    cgoldsby / LoginCritter https://github.com/cgoldsby/LoginCritter An animated avatar that responds to text field interactions 3,351 141 
    Game start!
    google                                                          (Google) (Google)                                      material-design-icons            material-design-icons        Material Design icons by Google /google
    davideuler                                                          (david l euler) (david l euler)                                      architecture.of.internet-product            architecture.of.internet-product        互联网公司技术架构,微信/淘宝/微博/腾讯/阿里/美团点评/百度/Google/Facebook/Amazon/eBay的架构,欢迎PR补充 /davideuler
    xingshaocheng architect-awesome            architect-awesome        后端架构师技术图谱 /xingshaocheng
    cyanharlow                                                          (Diana Smith) (Diana Smith)                                      purecss-francine            purecss-francine        HTML/CSS drawing in the style of an 18th-century oil painting. Hand-coded entirely in HTML & CSS. /cyanharlow
    kusti8                                                          (Gustav Hansen) (Gustav Hansen)                                      proton-native            proton-native        A React environment for cross platform native desktop apps /kusti8
    pytorch pytorch            pytorch        Tensors and Dynamic neural networks in Python with strong GPU acceleration /pytorch
    github                                                          (GitHub) (GitHub)                                      gitignore            gitignore        A collection of useful .gitignore templates /github
    sindresorhus                                                          (Sindre Sorhus) (Sindre Sorhus)                                      awesome            awesome         Curated list of awesome lists /sindresorhus
    sallar                                                          (Sallar Kaboli) (Sallar Kaboli)                                      github-contributions-chart            github-contributions-chart         Generate an image of all your Github contributions /sallar
    RelaxedJS                                                          (ReLaXed) (ReLaXed)                                      ReLaXed            ReLaXed        Create PDF documents using web technologies /RelaxedJS
    symfony                                                          (Symfony) (Symfony)                                      symfony            symfony        The Symfony PHP framework /symfony
    facebook                                                          (Facebook) (Facebook)                                      react            react        A declarative, efficient, and flexible JavaScript library for building user interfaces. /facebook
    Microsoft                                                          (Microsoft) (Microsoft)                                      vscode            vscode        Visual Studio Code /Microsoft
    xx45 dayjs            dayjs        Fast 2KB immutable date library alternative to Moment.js with the same modern API /xx45
    apache                                                          (The Apache Software Foundation) (The Apache Software Foundation)                                      incubator-echarts            incubator-echarts        A powerful, interactive charting and visualization library for browser /apache
    tensorflow tensorflow            tensorflow        Computation using data flow graphs for scalable machine learning /tensorflow
    sharkdp                                                          (David Peter) (David Peter)                                      fd            fd        A simple, fast and user-friendly alternative to 'find' /sharkdp
    vuejs                                                          (vuejs) (vuejs)                                      vue            vue         A progressive, incrementally-adoptable JavaScript framework for building UI on the web. /vuejs
    CyC2018 Interview-Notebook            Interview-Notebook         技术面试需要掌握的基础知识整理,欢迎编辑~ /CyC2018
    nhnent                                                          (NHN Entertainment) (NHN Entertainment)                                      tui.editor            tui.editor         Markdown WYSIWYG Editor. GFM Standard + Chart & UML Extensible. /nhnent
    binhnguyennus                                                          (Binh Nguyen) (Binh Nguyen)                                      awesome-scalability            awesome-scalability        High Scalability, High Availability, High Stability, High Performance, and High Intelligence Back-End Design Patterns /binhnguyennus
    shimohq                                                          (Shimo Docs) (Shimo Docs)                                      chinese-programmer-wrong-pronunciation            chinese-programmer-wrong-pronunciation        中国程序员容易发音错误的单词 /shimohq
    YadiraF PRNet            PRNet        The source code of 'Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network'. /YadiraF
    yyx990803                                                          (Evan You) (Evan You)                                      pod            pod        Git push deploy for Node.js /yyx990803
    roytseng-tw                                                          (Roy) (Roy)                                      Detectron.pytorch            Detectron.pytorch        A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available. /roytseng-tw
    
    
    

    需要强调的是这个项目的组织结构能够很好的进行扩展:比如说,我又想抓取其他网页。即重新再 parse 定义个新的解析器即可。其他可以复用。

    另外,最后抓取的字段并没有填充进定义的结构体内。

    再有,看上去这项目没什么值得提的,事实上,已经有人做了这个项目。每天抓取github trending 写入文件并托管在 github 上。有兴趣的可以看看别人的实现方式。

    josephyzhou/github-trending

    如果你自学者,接触不到企业级的项目,我建议你从 github 上寻找自己感兴趣的编程语言的项目重新写一遍。这样相当于,给自己出了一个题,而又有一份参考答案,能给自己一些反馈,同时不断的精进自己的技术。

    全文完。希望大家学的开心。

    相关文章

      网友评论

      • 7113a44bb3c1:有没有适合新手的github项目介绍?
        7113a44bb3c1:@谢小路 能否推荐一下
        谢小路:Go ? 有啊。
      • 候鸟_ywh:请问下你的 下载器是不是少了import啊?
        候鸟_ywh:@候鸟_ywh 没基础的前来观看
        谢小路:@候鸟_ywh 是是是,有点基础应该能自己搞出来吧?

      本文标题:『Go 语言学习专栏』-- 第七期

      本文链接:https://www.haomeiwen.com/subject/fapqdftx.html