采集

作者: Pope_Li | 来源:发表于2018-11-14 17:16 被阅读0次

    遍历单个域名

    如果你需要的信息在一个页面的话,要爬取很方便,如果你要爬取你需要的信息的连接的话,可以试试"维基百科六度分割理论"的查找方法.现在很多网页都是靠词条连接的,通过不相干主题的关联词条就可以把这些主题连接起来,假如我们要在维基百科上搜索Kevin_Bacon的词条连接.

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a"):
        if 'href' in link.attrs:
            print(link.attrs['href'])
    
        /wiki/Wikipedia:Protection_policy#semi
        #mw-head
        #p-search
        /wiki/Kevin_Bacon_(disambiguation)
        /wiki/File:Kevin_Bacon_SDCC_2014.jpg
        /wiki/Philadelphia
        /wiki/Pennsylvania
        /wiki/Kyra_Sedgwick
        /wiki/Sosie_Bacon
        /wiki/Edmund_Bacon_(architect)
        /wiki/Michael_Bacon_(musician)
        http://baconbros.com/
        #cite_note-1
        #cite_note-actor-2
        /wiki/Footloose_(1984_film)
        /wiki/JFK_(film)
        /wiki/A_Few_Good_Men
        /wiki/Apollo_13_(film)
        /wiki/Mystic_River_(film)
        /wiki/Sleepers
        /wiki/The_Woodsman_(2004_film)
        /wiki/Fox_Broadcasting_Company
        /wiki/The_Following
        /wiki/HBO
        /wiki/Taking_Chance
        /wiki/Golden_Globe_Award
        /wiki/Screen_Actors_Guild_Award
        /wiki/Primetime_Emmy_Award
        /wiki/The_Guardian
        /wiki/Academy_Award
        #cite_note-3
        /wiki/Hollywood_Walk_of_Fame
        #cite_note-4
        /wiki/Social_networks
        /wiki/Six_Degrees_of_Kevin_Bacon
        /wiki/SixDegrees.org
        #cite_note-walk-5
        #Early_life_and_education
        #Acting_career
        #Early_work
        #1980s
        #1990s
        #2000s
        #2010s
        #Advertising_work
        #Personal_life
        #Six_Degrees_of_Kevin_Bacon
        #Music
        #Awards_and_nominations
        #Honors
        #Accolades
        #Filmography
        #See_also
        #References
        #External_links
        /wiki/Philadelphia
        #cite_note-actor-2
        #cite_note-actor-2
        /wiki/Edmund_Bacon_(architect)
        #cite_note-bacon-6
        /wiki/Pennsylvania_Governor%27s_School_for_the_Arts
        /wiki/Bucknell_University
        #cite_note-7
        /wiki/Glory_Van_Scott
        #cite_note-walk-5
        #cite_note-bacon-6
        /wiki/Circle_in_the_Square
        /wiki/Nancy_Mills
        /wiki/Cosmopolitan_(magazine)
        #cite_note-cosmo91-8
        /wiki/Fraternities_and_sororities
        /wiki/Animal_House
        #cite_note-bacon-6
        /wiki/Search_for_Tomorrow
        /wiki/Guiding_Light
        /wiki/Friday_the_13th_(1980_film)
        #cite_note-9
        /wiki/Phoenix_Theater
        /wiki/Flux
        /wiki/Second_Stage_Theatre
        #cite_note-bio-10
        /wiki/Obie_Award
        /wiki/Forty_Deuce
        #cite_note-kevin-11
        /wiki/Slab_Boys
        /wiki/Sean_Penn
        /wiki/Val_Kilmer
        /wiki/Barry_Levinson
        /wiki/Diner_(film)
        /wiki/Steve_Guttenberg
        /wiki/Daniel_Stern_(actor)
        /wiki/Mickey_Rourke
        /wiki/Tim_Daly
        /wiki/Ellen_Barkin
        #cite_note-12
        /wiki/Footloose_(1984_film)
        #cite_note-bio-10
        /wiki/James_Dean
        /wiki/Rebel_Without_a_Cause
        /wiki/Mickey_Rooney
        /wiki/Judy_Garland
        #cite_note-time84-13
        #cite_note-bacon-6
        #cite_note-14
        #cite_note-15
        /wiki/People_(American_magazine)
        /wiki/Typecasting_(acting)
        /wiki/John_Hughes_(filmmaker)
        /wiki/She%27s_Having_a_Baby
        #cite_note-bio-10
        /wiki/The_Big_Picture_(1989_film)
        #cite_note-16
        /wiki/Tremors_(film)
        #cite_note-17
        /wiki/Joel_Schumacher
        /wiki/Flatliners
        #cite_note-bio-10
        /wiki/Elizabeth_Perkins
        /wiki/He_Said,_She_Said_(film)
        #cite_note-bio-10
        /wiki/The_New_York_Times
        #cite_note-nyt94-18
        /wiki/Oliver_Stone
        /wiki/JFK_(film)
        #cite_note-19
        /wiki/A_Few_Good_Men_(film)
        #cite_note-20
        /wiki/Michael_Greif
        #cite_note-bio-10
        /wiki/Golden_Globe_Award
        /wiki/The_River_Wild
        #cite_note-bio-10
        /wiki/Meryl_Streep
        /wiki/Murder_in_the_First_(film)
        #cite_note-bio-10
        /wiki/Blockbuster_(entertainment)
        /wiki/Apollo_13_(film)
        #cite_note-21
        /wiki/Sleepers_(film)
        #cite_note-22
        /wiki/Picture_Perfect_(1997_film)
        #cite_note-bio-10
        /wiki/Losing_Chase
        #cite_note-austin-23
        /wiki/Digging_to_China
        #cite_note-bio-10
        /wiki/Payola
        /wiki/Telling_Lies_in_America_(film)
        #cite_note-bio-10
        /wiki/Wild_Things_(film)
        /wiki/Stir_of_Echoes
        /wiki/David_Koepp
        #cite_note-24
        /wiki/File:KevinBaconTakingChanceFeb09.jpg
        /wiki/File:KevinBaconTakingChanceFeb09.jpg
        /wiki/Taking_Chance
        /wiki/Paul_Verhoeven
        /wiki/Hollow_Man
        #cite_note-25
        /wiki/Colin_Firth
        /wiki/Rachel_Blanchard
        /wiki/M%C3%A9nage_%C3%A0_trois
        /wiki/Where_the_Truth_Lies
        #cite_note-26
        /wiki/Atom_Egoyan
        /wiki/MPAA
        /wiki/MPAA_film_rating_system
        #cite_note-27
        /wiki/Sean_Penn
        /wiki/Tim_Robbins
        /wiki/Clint_Eastwood
        /wiki/Mystic_River_(film)
        /wiki/Pedophile
        /wiki/The_Woodsman_(2004_film)
        #cite_note-28
        /wiki/HBO_Films
        /wiki/Taking_Chance
        /wiki/Michael_Strobl
        /wiki/Desert_Storm
        #cite_note-29
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
        /wiki/Matthew_Vaughn
        /wiki/X-Men:_First_Class
        #cite_note-30
        /wiki/Sebastian_Shaw_(comics)
        #cite_note-31
        /wiki/Dustin_Lance_Black
        /wiki/8_(play)
        /wiki/Perry_v._Brown
        /wiki/Proposition_8
        /wiki/Charles_J._Cooper
        #cite_note-8_the_play-32
        /wiki/Wilshire_Ebell_Theatre
        /wiki/American_Foundation_for_Equal_Rights
        #cite_note-8_play_video-33
        #cite_note-34
        /wiki/The_Following
        #cite_note-35
        /wiki/Saturn_Award_for_Best_Actor_on_Television
        #cite_note-36
        /wiki/Huffington_Post
        #cite_note-37
        /wiki/Tremors_(film)
        /wiki/Wikipedia:Citation_needed
        /wiki/Tremors_5:_Bloodline
        /wiki/EE_(telecommunications_company)
        /wiki/United_Kingdom
        #cite_note-38
        #cite_note-39
        /wiki/Egg_as_food
        #cite_note-40
        /wiki/Kyra_Sedgwick
        /wiki/PBS
        /wiki/Lanford_Wilson
        /wiki/Lemon_Sky
        #cite_note-cosmo91-8
        /wiki/Pyrates
        /wiki/Murder_in_the_First_(film)
        /wiki/The_Woodsman_(2004_film)
        /wiki/Loverboy_(2005_film)
        /wiki/Sosie_Bacon
        /wiki/Upper_West_Side
        /wiki/Manhattan
        #cite_note-41
        /wiki/Tracy_Pollan
        #cite_note-42
        #cite_note-43
        #cite_note-44
        /wiki/The_Times
        #cite_note-45
        #cite_note-46
        /wiki/Will.i.am
        /wiki/It%27s_a_New_Day_(Will.i.am_song)
        /wiki/Barack_Obama
        /wiki/Ponzi_scheme
        /wiki/Bernard_Madoff
        #cite_note-financialpost-47
        #cite_note-48
        /wiki/Finding_Your_Roots
        /wiki/Henry_Louis_Gates
        #cite_note-49
        #cite_note-50
        #cite_note-51
        /wiki/Six_Degrees_of_Kevin_Bacon
        /wiki/Trivia
        /wiki/Big_screen
        /wiki/Six_degrees_of_separation
        /wiki/Internet_meme
        /wiki/SixDegrees.org
        #cite_note-52
        /wiki/Bacon_number
        /wiki/Internet_Movie_Database
        #cite_note-53
        /wiki/Paul_Erd%C5%91s
        /wiki/Erd%C5%91s_number
        /wiki/Paul_Erd%C5%91s
        /wiki/Bacon_number
        /wiki/Erd%C5%91s_number
        /wiki/Erd%C5%91s%E2%80%93Bacon_number
        #cite_note-54
        /wiki/The_Bacon_Brothers
        /wiki/Michael_Bacon_(musician)
        /wiki/Music_album
        #cite_note-55
        /wiki/Wikipedia:Biographies_of_living_persons
        /wiki/Wikipedia:Citing_sources
        /wiki/Wikipedia:Verifiability
        /wiki/Wikipedia:Identifying_reliable_sources
        //www.google.com/search?as_eq=wikipedia&q=%22Kevin+Bacon%22&num=50
        //www.google.com/search?tbm=nws&q=%22Kevin+Bacon%22+-wikipedia
        //www.google.com/search?&q=%22Kevin+Bacon%22+site:news.google.com/newspapers&source=newspapers
        //www.google.com/search?tbs=bks:1&q=%22Kevin+Bacon%22+-wikipedia
        //scholar.google.com/scholar?q=%22Kevin+Bacon%22
        https://www.jstor.org/action/doBasicSearch?Query=%22Kevin+Bacon%22&acc=on&wc=on
        /wiki/Help:Maintenance_template_removal
        /wiki/File:Kevin_Bacon%27s_Star_Walk_of_Fame.jpg
        /wiki/File:Kevin_Bacon%27s_Star_Walk_of_Fame.jpg
        /wiki/Hollywood_Walk_of_Fame
        /wiki/Hollywood_Walk_of_Fame
        #cite_note-56
        /wiki/Denver_Film_Festival
        #cite_note-57
        /wiki/Phoenix_Film_Festival
        #cite_note-58
        /wiki/Santa_Barbara_International_Film_Festival
        #cite_note-59
        /wiki/Broadcast_Film_Critics_Association
        #cite_note-60
        /wiki/Seattle_International_Film_Festival
        #cite_note-61
        /wiki/Apollo_13_(film)
        /wiki/Mystic_River_(film)
        /wiki/Blockbuster_Entertainment_Awards
        /wiki/Blockbuster_Entertainment_Awards
        /wiki/Hollow_Man
        /wiki/Boston_Society_of_Film_Critics
        /wiki/Boston_Society_of_Film_Critics_Award_for_Best_Cast
        /wiki/Mystic_River_(film)
        /wiki/Bravo_Otto
        /wiki/Bravo_Otto
        /wiki/Footloose_(1984_film)
        /wiki/CableACE_Award
        /wiki/CableACE_Award
        /wiki/Losing_Chase
        /wiki/The_Woodsman_(2004_film)
        /wiki/Critics%27_Choice_Movie_Awards
        /wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
        /wiki/Murder_in_the_First_(film)
        /wiki/Ghent_International_Film_Festival
        /wiki/Ghent_International_Film_Festival
        /wiki/The_Woodsman_(2004_film)
        /wiki/Giffoni_Film_Festival
        /wiki/Giffoni_Film_Festival
        /wiki/Digging_to_China
        /wiki/Gold_Derby_Awards
        /wiki/Gold_Derby_Awards
        /wiki/Mystic_River_(film)
        /wiki/Golden_Globe_Award
        /wiki/Golden_Globe_Award_for_Best_Supporting_Actor_%E2%80%93_Motion_Picture
        /wiki/The_River_Wild
        /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
        /wiki/Taking_Chance
        /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
        /wiki/I_Love_Dick_(TV_series)
        /wiki/Independent_Spirit_Awards
        /wiki/Independent_Spirit_Award_for_Best_Male_Lead
        /wiki/The_Woodsman_(2004_film)
        /wiki/Mystic_River_(film)
        /wiki/MTV_Movie_%26_TV_Awards
        /wiki/MTV_Movie_Award_for_Best_Villain
        /wiki/Hollow_Man
        /wiki/Taking_Chance
        /wiki/The_Following
        /wiki/E!_People%27s_Choice_Awards
        /wiki/E!_People%27s_Choice_Awards
        /wiki/The_Following
        /wiki/E!_People%27s_Choice_Awards
        /wiki/The_Following
        /wiki/Primetime_Emmy_Award
        /wiki/Primetime_Emmy_Award_for_Outstanding_Lead_Actor_in_a_Limited_Series_or_Movie
        /wiki/Taking_Chance
        /wiki/Satellite_Awards
        /wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Motion_Picture
        /wiki/The_Woodsman_(2004_film)
        /wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
        /wiki/Taking_Chance
        /wiki/Saturn_Award
        /wiki/Saturn_Award_for_Best_Actor_on_Television
        /wiki/The_Following
        /wiki/Saturn_Award_for_Best_Actor_on_Television
        /wiki/The_Following
        /wiki/Scream_Awards
        /wiki/Scream_Awards
        /wiki/X-Men:_First_Class
        /wiki/Screen_Actors_Guild_Award
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Supporting_Role
        /wiki/Murder_in_the_First_(film)
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
        /wiki/Apollo_13_(film)
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
        /wiki/Mystic_River_(film)
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
        /wiki/Frost/Nixon_(film)
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
        /wiki/Taking_Chance
        /wiki/Teen_Choice_Awards
        /wiki/Teen_Choice_Award_for_Choice_Movie_Villain
        /wiki/Beauty_Shop
        /wiki/Teen_Choice_Award_for_Choice_Movie_Villain
        /wiki/X-Men:_First_Class
        /wiki/TV_Guide_Award
        /wiki/TV_Guide_Award
        /wiki/The_Following
        /wiki/Kevin_Bacon_filmography
        /wiki/List_of_actors_with_Hollywood_Walk_of_Fame_motion_picture_stars
        #cite_ref-1
        https://web.archive.org/web/20090113222205/http://www.newenglandancestors.org/research/services/articles_gbr78.asp
        http://www.newenglandancestors.org/research/services/articles_gbr78.asp
        #cite_ref-actor_2-0
        #cite_ref-actor_2-1
        #cite_ref-actor_2-2
        http://www.biography.com/people/kevin-bacon-9542173
        #cite_ref-3
        https://www.theguardian.com/film/filmblog/2009/feb/19/best-actors-never-nominated-for-oscars
        #cite_ref-4
        http://www.walkoffame.com/kevin-bacon
        #cite_ref-walk_5-0
        #cite_ref-walk_5-1
        https://web.archive.org/web/20141016202657/http://www.thebiographychannel.co.uk/biographies/kevin-bacon.html
        http://www.thebiographychannel.co.uk/biographies/kevin-bacon.html
        #cite_ref-bacon_6-0
        #cite_ref-bacon_6-1
        #cite_ref-bacon_6-2
        #cite_ref-bacon_6-3
        http://www.biography.com/news/kevin-bacon-biography-facts
        #cite_ref-7
        https://movies.yahoo.com/person/kevin-bacon/biography.html
        #cite_ref-cosmo91_8-0
        #cite_ref-cosmo91_8-1
        #cite_ref-9
        http://www.nydailynews.com/entertainment/happy-halloween-superstars-start-horror-flick-gallery-1.98345
        #cite_ref-bio_10-0
        #cite_ref-bio_10-1
        #cite_ref-bio_10-2
        #cite_ref-bio_10-3
        #cite_ref-bio_10-4
        #cite_ref-bio_10-5
        #cite_ref-bio_10-6
        #cite_ref-bio_10-7
        #cite_ref-bio_10-8
        #cite_ref-bio_10-9
        #cite_ref-bio_10-10
        https://www.pbs.org/wnet/finding-your-roots/profiles/kevin-bacon%C2%A0/
        #cite_ref-kevin_11-0
        http://www.tvguide.com/celebrities/kevin-bacon/bio/160550
        #cite_ref-12
        https://web.archive.org/web/20141021030336/http://news.moviefone.com/2012/03/02/diner-30th-anniversary/
        http://news.moviefone.com/2012/03/02/diner-30th-anniversary/
        #cite_ref-time84_13-0
        http://www.time.com/time/magazine/article/0,9171,950019,00.html
        #cite_ref-14
        http://www.huffingtonpost.com/2014/08/25/kevin-bacon-footloose_n_5710413.html
        #cite_ref-15
        https://web.archive.org/web/20090109152125/http://www.thebiographychannel.co.uk/biography_story/522%3A492/1/Kevin_Bacon.htm
        http://www.thebiographychannel.co.uk/biography_story/522:492/1/Kevin_Bacon.htm
        #cite_ref-16
        https://www.nytimes.com/1994/09/25/movies/a-second-wind-is-blowing-for-kevin-bacon.html
        #cite_ref-17
        https://www.nytimes.com/movie/review?res=9C0CE2DE1631F93AA25752C0A966958260
        #cite_ref-nyt94_18-0
        https://query.nytimes.com/gst/fullpage.html?res=9C07E6D91F3BF936A1575AC0A962958260
        #cite_ref-19
        http://www.jfk-online.com/jfkbacon.html
        #cite_ref-20
        http://www.tcm.com/this-month/article/143158%7C0/A-Few-Good-Men.html
        #cite_ref-21
        http://collider.com/kevin-bacon-commercials-footloose/
        #cite_ref-22
        http://www.rogerebert.com/reviews/sleepers-1996
        #cite_ref-austin_23-0
        http://www.austinchronicle.com/calendar/film/1997-02-07/283342/
        /wiki/The_Austin_Chronicle
        #cite_ref-24
        http://www.criminalelement.com/blogs/2013/09/under-the-raderhorror-movies-you-may-have-missed-stir-of-echoes
        #cite_ref-25
        http://www.rogerebert.com/reviews/hollow-man-2000
        #cite_ref-26
        https://web.archive.org/web/20141017080013/http://movies.about.com/od/wherethetruthlies/a/truthkb101305.htm
        http://movies.about.com/od/wherethetruthlies/a/truthkb101305.htm
        #cite_ref-27
        https://archive.is/20120604150801/http://jam.canoe.ca/Movies/2005/09/14/1216527.html
        http://jam.canoe.ca/Movies/2005/09/14/1216527.html
        #cite_ref-28
        http://www.latimes.com/entertainment/la-et-kevin-bacon-photo6-photo.html
        #cite_ref-29
        http://www.nydailynews.com/entertainment/tv-movies/kevin-bacon-chance-body-fallen-marine-home-article-1.392226
        #cite_ref-30
        https://web.archive.org/web/20100722010545/http://heatvision.hollywoodreporter.com/2010/07/winters-bone-star-cast-as-mystique-in-xmen-first-class.html
        http://heatvision.hollywoodreporter.com/2010/07/winters-bone-star-cast-as-mystique-in-xmen-first-class.html
        #cite_ref-31
        https://web.archive.org/web/20100720060214/http://www.forcesofgeek.com/2010/07/kevin-bacon-playing-sebastian-shaw-in-x.html
        http://www.forcesofgeek.com/2010/07/kevin-bacon-playing-sebastian-shaw-in-x.html
        #cite_ref-8_the_play_32-0
        http://www.accesshollywood.com/jesse-tyler-ferguson/glee-stars-touched-by-brad-pitt-and-george-clooneys-support-of-8_article_61543
        /wiki/Access_Hollywood
        #cite_ref-8_play_video_33-0
        https://www.youtube.com/watch?v=qlUG8F9uVgM
        #cite_ref-34
        http://www.pinknews.co.uk/2012/03/01/youtube-to-broadcast-proposition-8-play-live/
        #cite_ref-35
        http://www.fox.com/the-following/
        #cite_ref-36
        https://news.yahoo.com/blogs/trending-now/kevin-bacon-gives-millennials-a-history-lesson-about-the--80s-162525915.html
        #cite_ref-37
        http://www.huffingtonpost.com.au/entry/kevin-bacon-tremors-tv-reboot_us_5655b651e4b072e9d1c13a11
        #cite_ref-38
        http://www.campaignlive.co.uk/news/1294856/
        #cite_ref-39
        http://parade.condenast.com/269380/ashleighschmitz/kevin-bacon-reprises-his-most-iconic-film-roles-in-british-commercial/
        #cite_ref-40
        http://money.cnn.com/2015/03/13/media/kevin-bacon-eggs/index.html
        #cite_ref-41
        http://www.nydailynews.com/entertainment/tv-movies/kevin-bacon-loyalty-nyc-philly-origins-peace-bustling-city-article-1.147197
        #cite_ref-42
        http://www.people.com/people/archive/article/0,,20093025,00.html
        #cite_ref-43
        https://web.archive.org/web/20141023014658/https://www.au.org/media/church-and-state/archives/2008/05/two-thumbs-up.html
        http://www.au.org/media/church-and-state/archives/2008/05/two-thumbs-up.html
        #cite_ref-44
        https://www.washingtonpost.com/wp-dyn/content/article/2008/03/25/AR2008032503852.html
        #cite_ref-45
        #cite_ref-46
        http://www.foxnews.com/story/0,2933,343589,00.html
        #cite_ref-financialpost_47-0
        https://web.archive.org/web/20140314085857/http://economiccrisis.us/2009/06/may-god-spare-mercy-victim-tells-madoff/
        http://economiccrisis.us/2009/06/may-god-spare-mercy-victim-tells-madoff/
        #cite_ref-48
        #cite_ref-49
        http://www.huffingtonpost.com/megan-smolenyak-smolenyak/6-degrees-of-separation-k_b_900707.html
        #cite_ref-50
        https://web.archive.org/web/20130405182304/http://www.drawtheline.org/watch-stuff/
        http://www.drawtheline.org/watch-stuff
        #cite_ref-51
        http://www.drawtheline.org/sign-now/
        #cite_ref-52
        http://www.sixdegrees.org/
        #cite_ref-53
        http://www.webmonkey.com/2012/09/easter-egg-google-connects-the-dots-for-bacon-number-search/
        #cite_ref-54
        https://www.telegraph.co.uk/science/science-news/4768389/And-the-winner-tonight-is.html
        #cite_ref-55
        http://baconbros.com/
        #cite_ref-56
        http://www.walkoffame.com/kevin-bacon
        #cite_ref-57
        http://www.imdb.com/event/ev0000209/2004/1/
        #cite_ref-58
        http://www.imdb.com/event/ev0000536/2005/1/
        #cite_ref-59
        http://www.imdb.com/event/ev0000589/2005/1/
        #cite_ref-60
        http://www.criticschoice.com/movie-awards/critics%E2%80%99-choice-movie-awards-winners-archive/
        #cite_ref-61
        http://www.imdb.com/event/ev0000600/2015/1/
        https://commons.wikimedia.org/wiki/Category:Kevin_Bacon
        https://www.imdb.com/name/nm0000102/
        /wiki/IMDb
        https://www.ibdb.com/broadway-cast-staff/90569
        /wiki/Internet_Broadway_Database
        https://www.wikidata.org/wiki/Q3454165#P1220
        http://www.lortel.org/Archives/CreditableEntity/5597
        /wiki/Lortel_Archives
        https://www.allmovie.com/artist/p3164
        /wiki/AllMovie
        http://oracleofbacon.org
        /wiki/Template:Kevin_Bacon
        /w/index.php?title=Template_talk:Kevin_Bacon&action=edit&redlink=1
        //en.wikipedia.org/w/index.php?title=Template:Kevin_Bacon&action=edit
        /wiki/Losing_Chase
        /wiki/Loverboy_(2005_film)
        /wiki/The_Closer
        /wiki/Wild_Things_(film)
        /wiki/The_Woodsman_(2004_film)
        /wiki/Loverboy_(2005_film)
        /wiki/The_Following
        /wiki/Cop_Car_(film)
        /wiki/I_Love_Dick_(TV_series)
        /wiki/Kevin_Bacon_filmography
        /wiki/Template:Critics%27_Choice_Movie_Award_for_Best_Actor
        /wiki/Template_talk:Critics%27_Choice_Movie_Award_for_Best_Actor
        //en.wikipedia.org/w/index.php?title=Template:Critics%27_Choice_Movie_Award_for_Best_Actor&action=edit
        /wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
        /wiki/Geoffrey_Rush
        /wiki/Jack_Nicholson
        /wiki/Ian_McKellen
        /wiki/Russell_Crowe
        /wiki/Russell_Crowe
        /wiki/Russell_Crowe
        /wiki/Daniel_Day-Lewis
        /wiki/Jack_Nicholson
        /wiki/Sean_Penn
        /wiki/Jamie_Foxx
        /wiki/Philip_Seymour_Hoffman
        /wiki/Forest_Whitaker
        /wiki/Daniel_Day-Lewis
        /wiki/Sean_Penn
        /wiki/Jeff_Bridges
        /wiki/Colin_Firth
        /wiki/George_Clooney
        /wiki/Daniel_Day-Lewis
        /wiki/Matthew_McConaughey
        /wiki/Michael_Keaton
        /wiki/Leonardo_DiCaprio
        /wiki/Casey_Affleck
        /wiki/Gary_Oldman
        /wiki/Template:GoldenGlobeBestActorTVMiniseriesFilm
        /wiki/Template_talk:GoldenGlobeBestActorTVMiniseriesFilm
        //en.wikipedia.org/w/index.php?title=Template:GoldenGlobeBestActorTVMiniseriesFilm&action=edit
        /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
        /wiki/Mickey_Rooney
        /wiki/Anthony_Andrews
        /wiki/Richard_Chamberlain
        /wiki/Ted_Danson
        /wiki/Dustin_Hoffman
        /wiki/James_Woods
        /wiki/Randy_Quaid
        /wiki/Michael_Caine
        /wiki/Stacy_Keach
        /wiki/Robert_Duvall
        /wiki/James_Garner
        /wiki/Beau_Bridges
        /wiki/Robert_Duvall
        /wiki/James_Garner
        /wiki/Ra%C3%BAl_Juli%C3%A1
        /wiki/Gary_Sinise
        /wiki/Alan_Rickman
        /wiki/Ving_Rhames
        /wiki/Stanley_Tucci
        /wiki/Jack_Lemmon
        /wiki/Brian_Dennehy
        /wiki/James_Franco
        /wiki/Albert_Finney
        /wiki/Al_Pacino
        /wiki/Geoffrey_Rush
        /wiki/Jonathan_Rhys_Meyers
        /wiki/Bill_Nighy
        /wiki/Jim_Broadbent
        /wiki/Paul_Giamatti
        /wiki/Al_Pacino
        /wiki/Idris_Elba
        /wiki/Kevin_Costner
        /wiki/Michael_Douglas
        /wiki/Billy_Bob_Thornton
        /wiki/Oscar_Isaac
        /wiki/Tom_Hiddleston
        /wiki/Ewan_McGregor
        /wiki/Template:Saturn_Award_for_Best_Actor_on_Television
        /wiki/Template_talk:Saturn_Award_for_Best_Actor_on_Television
        //en.wikipedia.org/w/index.php?title=Template:Saturn_Award_for_Best_Actor_on_Television&action=edit
        /wiki/Saturn_Award_for_Best_Actor_on_Television
        /wiki/Kyle_Chandler
        /wiki/Steven_Weber_(actor)
        /wiki/Richard_Dean_Anderson
        /wiki/David_Boreanaz
        /wiki/Robert_Patrick
        /wiki/Ben_Browder
        /wiki/David_Boreanaz
        /wiki/David_Boreanaz
        /wiki/Ben_Browder
        /wiki/Matthew_Fox
        /wiki/Michael_C._Hall
        /wiki/Matthew_Fox
        /wiki/Edward_James_Olmos
        /wiki/Josh_Holloway
        /wiki/Stephen_Moyer
        /wiki/Bryan_Cranston
        /wiki/Bryan_Cranston
        /wiki/Mads_Mikkelsen
        /wiki/Hugh_Dancy
        /wiki/Andrew_Lincoln
        /wiki/Bruce_Campbell
        /wiki/Andrew_Lincoln
        /wiki/Kyle_MacLachlan
        /wiki/Template:ScreenActorsGuildAward_MaleTVMiniseriesMovie
        /wiki/Template_talk:ScreenActorsGuildAward_MaleTVMiniseriesMovie
        //en.wikipedia.org/w/index.php?title=Template:ScreenActorsGuildAward_MaleTVMiniseriesMovie&action=edit
        /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
        /wiki/Ra%C3%BAl_Juli%C3%A1
        /wiki/Gary_Sinise
        /wiki/Alan_Rickman
        /wiki/Gary_Sinise
        /wiki/Christopher_Reeve
        /wiki/Jack_Lemmon
        /wiki/Brian_Dennehy
        /wiki/Ben_Kingsley
        /wiki/William_H._Macy
        /wiki/Al_Pacino
        /wiki/Geoffrey_Rush
        /wiki/Paul_Newman
        /wiki/Jeremy_Irons
        /wiki/Kevin_Kline
        /wiki/Paul_Giamatti
        /wiki/Al_Pacino
        /wiki/Paul_Giamatti
        /wiki/Kevin_Costner
        /wiki/Michael_Douglas
        /wiki/Mark_Ruffalo
        /wiki/Idris_Elba
        /wiki/Bryan_Cranston
        /wiki/Alexander_Skarsg%C3%A5rd
        /wiki/Help:Authority_control
        https://www.wikidata.org/wiki/Q3454165
        /wiki/WorldCat
        https://www.worldcat.org/identities/containsVIAFID/39570812
        /wiki/Biblioteca_Nacional_de_Espa%C3%B1a
        http://catalogo.bne.es/uhtbin/authoritybrowse.cgi?action=display&authority_id=XX1298810
        /wiki/Biblioth%C3%A8que_nationale_de_France
        https://catalogue.bnf.fr/ark:/12148/cb139817766
        http://data.bnf.fr/ark:/12148/cb139817766
        /wiki/Integrated_Authority_File
        https://d-nb.info/gnd/124109659
        /wiki/International_Standard_Name_Identifier
        http://isni.org/isni/0000000121291300
        /wiki/Library_of_Congress_Control_Number
        https://id.loc.gov/authorities/names/n88034930
        /wiki/MusicBrainz
        https://musicbrainz.org/artist/cc0dbdfc-9b2c-4e31-8448-808412388406
        /wiki/National_Library_of_the_Czech_Republic
        https://aleph.nkp.cz/F/?func=find-c&local_base=aut&ccl_term=ica=xx0025279&CON_LNG=ENG
        /wiki/SNAC
        http://socialarchive.iath.virginia.edu/ark:/99166/w6w67gw2
        /wiki/Syst%C3%A8me_universitaire_de_documentation
        https://www.idref.fr/084292652
        /wiki/Virtual_International_Authority_File
        https://viaf.org/viaf/39570812
        https://en.wikipedia.org/w/index.php?title=Kevin_Bacon&oldid=867364577
        /wiki/Help:Category
        /wiki/Category:1958_births
        /wiki/Category:20th-century_American_male_actors
        /wiki/Category:21st-century_American_male_actors
        /wiki/Category:American_atheists
        /wiki/Category:American_male_film_actors
        /wiki/Category:American_male_soap_opera_actors
        /wiki/Category:American_male_television_actors
        /wiki/Category:American_male_voice_actors
        /wiki/Category:The_Bacon_Brothers_members
        /wiki/Category:Best_Miniseries_or_Television_Movie_Actor_Golden_Globe_winners
        /wiki/Category:Circle_in_the_Square_Theatre_School_alumni
        /wiki/Category:Living_people
        /wiki/Category:Male_actors_from_Philadelphia
        /wiki/Category:Obie_Award_recipients
        /wiki/Category:Outstanding_Performance_by_a_Cast_in_a_Motion_Picture_Screen_Actors_Guild_Award_winners
        /wiki/Category:Sedgwick_family
        /wiki/Category:Wikipedia_indefinitely_semi-protected_biographies_of_living_people
        /wiki/Category:Use_mdy_dates_from_October_2016
        /wiki/Category:Articles_with_hCards
        /wiki/Category:All_articles_with_unsourced_statements
        /wiki/Category:Articles_with_unsourced_statements_from_January_2016
        /wiki/Category:BLP_articles_lacking_sources_from_October_2017
        /wiki/Category:All_BLP_articles_lacking_sources
        /wiki/Category:Articles_with_IBDb_links
        /wiki/Category:Wikipedia_articles_with_BNE_identifiers
        /wiki/Category:Wikipedia_articles_with_BNF_identifiers
        /wiki/Category:Wikipedia_articles_with_GND_identifiers
        /wiki/Category:Wikipedia_articles_with_ISNI_identifiers
        /wiki/Category:Wikipedia_articles_with_LCCN_identifiers
        /wiki/Category:Wikipedia_articles_with_MusicBrainz_identifiers
        /wiki/Category:Wikipedia_articles_with_NKC_identifiers
        /wiki/Category:Wikipedia_articles_with_SNAC-ID_identifiers
        /wiki/Category:Wikipedia_articles_with_SUDOC_identifiers
        /wiki/Category:Wikipedia_articles_with_VIAF_identifiers
        /wiki/Special:MyTalk
        /wiki/Special:MyContributions
        /w/index.php?title=Special:CreateAccount&returnto=Kevin+Bacon
        /w/index.php?title=Special:UserLogin&returnto=Kevin+Bacon
        /wiki/Kevin_Bacon
        /wiki/Talk:Kevin_Bacon
        /wiki/Kevin_Bacon
        /w/index.php?title=Kevin_Bacon&action=edit
        /w/index.php?title=Kevin_Bacon&action=history
        /wiki/Main_Page
        /wiki/Main_Page
        /wiki/Portal:Contents
        /wiki/Portal:Featured_content
        /wiki/Portal:Current_events
        /wiki/Special:Random
        https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
        //shop.wikimedia.org
        /wiki/Help:Contents
        /wiki/Wikipedia:About
        /wiki/Wikipedia:Community_portal
        /wiki/Special:RecentChanges
        //en.wikipedia.org/wiki/Wikipedia:Contact_us
        /wiki/Special:WhatLinksHere/Kevin_Bacon
        /wiki/Special:RecentChangesLinked/Kevin_Bacon
        /wiki/Wikipedia:File_Upload_Wizard
        /wiki/Special:SpecialPages
        /w/index.php?title=Kevin_Bacon&oldid=867364577
        /w/index.php?title=Kevin_Bacon&action=info
        https://www.wikidata.org/wiki/Special:EntityPage/Q3454165
        /w/index.php?title=Special:CiteThisPage&page=Kevin_Bacon&id=867364577
        /w/index.php?title=Special:Book&bookcmd=book_creator&referer=Kevin+Bacon
        /w/index.php?title=Special:ElectronPdf&page=Kevin+Bacon&action=show-download-screen
        /w/index.php?title=Kevin_Bacon&printable=yes
        https://commons.wikimedia.org/wiki/Category:Kevin_Bacon
        https://af.wikipedia.org/wiki/Kevin_Bacon
        https://ar.wikipedia.org/wiki/%D9%83%D9%8A%D9%81%D9%8A%D9%86_%D8%A8%D9%8A%D9%83%D9%86
        https://an.wikipedia.org/wiki/Kevin_Bacon
        https://ast.wikipedia.org/wiki/Kevin_Bacon
        https://azb.wikipedia.org/wiki/%DA%A9%D9%88%DB%8C%D9%86_%D8%A8%DB%8C%DA%A9%D9%86
        https://zh-min-nan.wikipedia.org/wiki/Kevin_Bacon
        https://bi.wikipedia.org/wiki/Kevin_Bacon
        https://bg.wikipedia.org/wiki/%D0%9A%D0%B5%D0%B2%D0%B8%D0%BD_%D0%91%D0%B5%D0%B9%D0%BA%D1%8A%D0%BD
        https://bs.wikipedia.org/wiki/Kevin_Bacon
        https://ca.wikipedia.org/wiki/Kevin_Bacon
        https://cs.wikipedia.org/wiki/Kevin_Bacon
        https://da.wikipedia.org/wiki/Kevin_Bacon
        https://de.wikipedia.org/wiki/Kevin_Bacon
        https://el.wikipedia.org/wiki/%CE%9A%CE%AD%CE%B2%CE%B9%CE%BD_%CE%9C%CF%80%CE%AD%CE%B9%CE%BA%CE%BF%CE%BD
        https://eml.wikipedia.org/wiki/Kevin_Bacon
        https://es.wikipedia.org/wiki/Kevin_Bacon
        https://eu.wikipedia.org/wiki/Kevin_Bacon
        https://fa.wikipedia.org/wiki/%DA%A9%D9%88%DB%8C%D9%86_%D8%A8%DB%8C%DA%A9%D9%86
        https://fr.wikipedia.org/wiki/Kevin_Bacon
        https://ga.wikipedia.org/wiki/Kevin_Bacon
        https://gl.wikipedia.org/wiki/Kevin_Bacon
        https://ko.wikipedia.org/wiki/%EC%BC%80%EB%B9%88_%EB%B2%A0%EC%9D%B4%EC%BB%A8
        https://hy.wikipedia.org/wiki/%D5%94%D6%87%D5%AB%D5%B6_%D4%B2%D5%A5%D5%B5%D6%84%D5%B8%D5%B6
        https://hr.wikipedia.org/wiki/Kevin_Bacon
        https://io.wikipedia.org/wiki/Kevin_Bacon
        https://id.wikipedia.org/wiki/Kevin_Bacon
        https://it.wikipedia.org/wiki/Kevin_Bacon
        https://he.wikipedia.org/wiki/%D7%A7%D7%95%D7%95%D7%99%D7%9F_%D7%91%D7%99%D7%99%D7%A7%D7%95%D7%9F
        https://ka.wikipedia.org/wiki/%E1%83%99%E1%83%94%E1%83%95%E1%83%98%E1%83%9C_%E1%83%91%E1%83%94%E1%83%98%E1%83%99%E1%83%9D%E1%83%9C%E1%83%98
        https://kk.wikipedia.org/wiki/%D0%9A%D0%B5%D0%B2%D0%B8%D0%BD_%D0%91%D1%8D%D0%B9%D0%BA%D0%BE%D0%BD
        https://lv.wikipedia.org/wiki/Kevins_B%C4%93kons
        https://hu.wikipedia.org/wiki/Kevin_Bacon
        https://xmf.wikipedia.org/wiki/%E1%83%99%E1%83%94%E1%83%95%E1%83%98%E1%83%9C_%E1%83%91%E1%83%94%E1%83%98%E1%83%99%E1%83%9D%E1%83%9C%E1%83%98
        https://mn.wikipedia.org/wiki/%D0%9A%D0%B5%D0%B2%D0%B8%D0%BD_%D0%91%D1%8D%D0%B9%D0%BA%D0%BE%D0%BD
        https://nl.wikipedia.org/wiki/Kevin_Bacon
        https://ja.wikipedia.org/wiki/%E3%82%B1%E3%83%B4%E3%82%A3%E3%83%B3%E3%83%BB%E3%83%99%E3%83%BC%E3%82%B3%E3%83%B3
        https://no.wikipedia.org/wiki/Kevin_Bacon
        https://oc.wikipedia.org/wiki/Kevin_Bacon
        https://pl.wikipedia.org/wiki/Kevin_Bacon
        https://pt.wikipedia.org/wiki/Kevin_Bacon
        https://ro.wikipedia.org/wiki/Kevin_Bacon
        https://ru.wikipedia.org/wiki/%D0%91%D0%B5%D0%B9%D0%BA%D0%BE%D0%BD,_%D0%9A%D0%B5%D0%B2%D0%B8%D0%BD
        https://sco.wikipedia.org/wiki/Kevin_Bacon
        https://simple.wikipedia.org/wiki/Kevin_Bacon
        https://sk.wikipedia.org/wiki/Kevin_Bacon
        https://ckb.wikipedia.org/wiki/%DA%A9%DB%8E%DA%A4%D9%86_%D8%A8%DB%95%DB%8C%DA%A9%D9%86
        https://sr.wikipedia.org/wiki/%D0%9A%D0%B5%D0%B2%D0%B8%D0%BD_%D0%91%D0%B5%D1%98%D0%BA%D0%BE%D0%BD
        https://sh.wikipedia.org/wiki/Kevin_Bacon
        https://fi.wikipedia.org/wiki/Kevin_Bacon
        https://sv.wikipedia.org/wiki/Kevin_Bacon
        https://th.wikipedia.org/wiki/%E0%B9%80%E0%B8%84%E0%B8%A7%E0%B8%B4%E0%B8%99_%E0%B9%80%E0%B8%9A%E0%B8%84%E0%B8%AD%E0%B8%99
        https://tr.wikipedia.org/wiki/Kevin_Bacon
        https://uk.wikipedia.org/wiki/%D0%9A%D0%B5%D0%B2%D1%96%D0%BD_%D0%91%D0%B5%D0%B9%D0%BA%D0%BE%D0%BD
        https://vi.wikipedia.org/wiki/Kevin_Bacon
        https://zh.wikipedia.org/wiki/%E5%87%AF%E6%96%87%C2%B7%E8%B4%9D%E8%82%AF
        https://www.wikidata.org/wiki/Special:EntityPage/Q3454165#sitelinks-wikipedia
        //en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
        //creativecommons.org/licenses/by-sa/3.0/
        //foundation.wikimedia.org/wiki/Terms_of_Use
        //foundation.wikimedia.org/wiki/Privacy_policy
        //www.wikimediafoundation.org/
        https://foundation.wikimedia.org/wiki/Privacy_policy
        /wiki/Wikipedia:About
        /wiki/Wikipedia:General_disclaimer
        //en.wikipedia.org/wiki/Wikipedia:Contact_us
        https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
        https://foundation.wikimedia.org/wiki/Cookie_statement
        //en.m.wikipedia.org/w/index.php?title=Kevin_Bacon&mobileaction=toggle_view_mobile
        https://wikimediafoundation.org/
        //www.mediawiki.org/
    

    在这里边含有很多没有用的,这些连接里边还含有许多侧边栏,页眉,页脚连接,我们可以看下原网页


    图1.png

    可以发现一个特点:

    • 它们都在id是bodyCountent的div标签里
    • URL链接不包含分号
    • URL链接都以/wiki/开头

    我们可以优化一下上边的代码:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    html = urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon")
    bsObj = BeautifulSoup(html)
    a = bsObj.find("div",{'id':'bodyContent'})
    b = a.findAll("a",href = re.compile("^(/wiki/)((?!:).)*$"))
    
    for link in b:
        if 'href' in link.attrs:
            print(link.attrs["href"])
        
    
    D:\anconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
    
    The code that caused this warning is on line 193 of the file D:\anconda3\lib\runpy.py. To get rid of this warning, change code that looks like this:
    
     BeautifulSoup(YOUR_MARKUP})
    
    to this:
    
     BeautifulSoup(YOUR_MARKUP, "lxml")
    
      markup_type=markup_type))
    
    
    /wiki/Kevin_Bacon_(disambiguation)
    /wiki/Philadelphia
    /wiki/Pennsylvania
    /wiki/Kyra_Sedgwick
    /wiki/Sosie_Bacon
    /wiki/Edmund_Bacon_(architect)
    /wiki/Michael_Bacon_(musician)
    /wiki/Footloose_(1984_film)
    /wiki/JFK_(film)
    /wiki/A_Few_Good_Men
    /wiki/Apollo_13_(film)
    /wiki/Mystic_River_(film)
    /wiki/Sleepers
    /wiki/The_Woodsman_(2004_film)
    /wiki/Fox_Broadcasting_Company
    /wiki/The_Following
    /wiki/HBO
    /wiki/Taking_Chance
    /wiki/Golden_Globe_Award
    /wiki/Screen_Actors_Guild_Award
    /wiki/Primetime_Emmy_Award
    /wiki/The_Guardian
    /wiki/Academy_Award
    /wiki/Hollywood_Walk_of_Fame
    /wiki/Social_networks
    /wiki/Six_Degrees_of_Kevin_Bacon
    /wiki/SixDegrees.org
    /wiki/Philadelphia
    /wiki/Edmund_Bacon_(architect)
    /wiki/Pennsylvania_Governor%27s_School_for_the_Arts
    /wiki/Bucknell_University
    /wiki/Glory_Van_Scott
    /wiki/Circle_in_the_Square
    /wiki/Nancy_Mills
    /wiki/Cosmopolitan_(magazine)
    /wiki/Fraternities_and_sororities
    /wiki/Animal_House
    /wiki/Search_for_Tomorrow
    /wiki/Guiding_Light
    /wiki/Friday_the_13th_(1980_film)
    /wiki/Phoenix_Theater
    /wiki/Flux
    /wiki/Second_Stage_Theatre
    /wiki/Obie_Award
    /wiki/Forty_Deuce
    /wiki/Slab_Boys
    /wiki/Sean_Penn
    /wiki/Val_Kilmer
    /wiki/Barry_Levinson
    /wiki/Diner_(film)
    /wiki/Steve_Guttenberg
    /wiki/Daniel_Stern_(actor)
    /wiki/Mickey_Rourke
    /wiki/Tim_Daly
    /wiki/Ellen_Barkin
    /wiki/Footloose_(1984_film)
    /wiki/James_Dean
    /wiki/Rebel_Without_a_Cause
    /wiki/Mickey_Rooney
    /wiki/Judy_Garland
    /wiki/People_(American_magazine)
    /wiki/Typecasting_(acting)
    /wiki/John_Hughes_(filmmaker)
    /wiki/She%27s_Having_a_Baby
    /wiki/The_Big_Picture_(1989_film)
    /wiki/Tremors_(film)
    /wiki/Joel_Schumacher
    /wiki/Flatliners
    /wiki/Elizabeth_Perkins
    /wiki/He_Said,_She_Said_(film)
    /wiki/The_New_York_Times
    /wiki/Oliver_Stone
    /wiki/JFK_(film)
    /wiki/A_Few_Good_Men_(film)
    /wiki/Michael_Greif
    /wiki/Golden_Globe_Award
    /wiki/The_River_Wild
    /wiki/Meryl_Streep
    /wiki/Murder_in_the_First_(film)
    /wiki/Blockbuster_(entertainment)
    /wiki/Apollo_13_(film)
    /wiki/Sleepers_(film)
    /wiki/Picture_Perfect_(1997_film)
    /wiki/Losing_Chase
    /wiki/Digging_to_China
    /wiki/Payola
    /wiki/Telling_Lies_in_America_(film)
    /wiki/Wild_Things_(film)
    /wiki/Stir_of_Echoes
    /wiki/David_Koepp
    /wiki/Taking_Chance
    /wiki/Paul_Verhoeven
    /wiki/Hollow_Man
    /wiki/Colin_Firth
    /wiki/Rachel_Blanchard
    /wiki/M%C3%A9nage_%C3%A0_trois
    /wiki/Where_the_Truth_Lies
    /wiki/Atom_Egoyan
    /wiki/MPAA
    /wiki/MPAA_film_rating_system
    /wiki/Sean_Penn
    /wiki/Tim_Robbins
    /wiki/Clint_Eastwood
    /wiki/Mystic_River_(film)
    /wiki/Pedophile
    /wiki/The_Woodsman_(2004_film)
    /wiki/HBO_Films
    /wiki/Taking_Chance
    /wiki/Michael_Strobl
    /wiki/Desert_Storm
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
    /wiki/Matthew_Vaughn
    /wiki/Sebastian_Shaw_(comics)
    /wiki/Dustin_Lance_Black
    /wiki/8_(play)
    /wiki/Perry_v._Brown
    /wiki/Proposition_8
    /wiki/Charles_J._Cooper
    /wiki/Wilshire_Ebell_Theatre
    /wiki/American_Foundation_for_Equal_Rights
    /wiki/The_Following
    /wiki/Saturn_Award_for_Best_Actor_on_Television
    /wiki/Huffington_Post
    /wiki/Tremors_(film)
    /wiki/EE_(telecommunications_company)
    /wiki/United_Kingdom
    /wiki/Egg_as_food
    /wiki/Kyra_Sedgwick
    /wiki/PBS
    /wiki/Lanford_Wilson
    /wiki/Lemon_Sky
    /wiki/Pyrates
    /wiki/Murder_in_the_First_(film)
    /wiki/The_Woodsman_(2004_film)
    /wiki/Loverboy_(2005_film)
    /wiki/Sosie_Bacon
    /wiki/Upper_West_Side
    /wiki/Manhattan
    /wiki/Tracy_Pollan
    /wiki/The_Times
    /wiki/Will.i.am
    /wiki/It%27s_a_New_Day_(Will.i.am_song)
    /wiki/Barack_Obama
    /wiki/Ponzi_scheme
    /wiki/Bernard_Madoff
    /wiki/Finding_Your_Roots
    /wiki/Henry_Louis_Gates
    /wiki/Six_Degrees_of_Kevin_Bacon
    /wiki/Trivia
    /wiki/Big_screen
    /wiki/Six_degrees_of_separation
    /wiki/Internet_meme
    /wiki/SixDegrees.org
    /wiki/Bacon_number
    /wiki/Internet_Movie_Database
    /wiki/Paul_Erd%C5%91s
    /wiki/Erd%C5%91s_number
    /wiki/Paul_Erd%C5%91s
    /wiki/Bacon_number
    /wiki/Erd%C5%91s_number
    /wiki/Erd%C5%91s%E2%80%93Bacon_number
    /wiki/The_Bacon_Brothers
    /wiki/Michael_Bacon_(musician)
    /wiki/Music_album
    /wiki/Hollywood_Walk_of_Fame
    /wiki/Hollywood_Walk_of_Fame
    /wiki/Denver_Film_Festival
    /wiki/Phoenix_Film_Festival
    /wiki/Santa_Barbara_International_Film_Festival
    /wiki/Broadcast_Film_Critics_Association
    /wiki/Seattle_International_Film_Festival
    /wiki/Apollo_13_(film)
    /wiki/Mystic_River_(film)
    /wiki/Blockbuster_Entertainment_Awards
    /wiki/Blockbuster_Entertainment_Awards
    /wiki/Hollow_Man
    /wiki/Boston_Society_of_Film_Critics
    /wiki/Boston_Society_of_Film_Critics_Award_for_Best_Cast
    /wiki/Mystic_River_(film)
    /wiki/Bravo_Otto
    /wiki/Bravo_Otto
    /wiki/Footloose_(1984_film)
    /wiki/CableACE_Award
    /wiki/CableACE_Award
    /wiki/Losing_Chase
    /wiki/The_Woodsman_(2004_film)
    /wiki/Critics%27_Choice_Movie_Awards
    /wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
    /wiki/Murder_in_the_First_(film)
    /wiki/Ghent_International_Film_Festival
    /wiki/Ghent_International_Film_Festival
    /wiki/The_Woodsman_(2004_film)
    /wiki/Giffoni_Film_Festival
    /wiki/Giffoni_Film_Festival
    /wiki/Digging_to_China
    /wiki/Gold_Derby_Awards
    /wiki/Gold_Derby_Awards
    /wiki/Mystic_River_(film)
    /wiki/Golden_Globe_Award
    /wiki/Golden_Globe_Award_for_Best_Supporting_Actor_%E2%80%93_Motion_Picture
    /wiki/The_River_Wild
    /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
    /wiki/Taking_Chance
    /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Television_Series_Musical_or_Comedy
    /wiki/I_Love_Dick_(TV_series)
    /wiki/Independent_Spirit_Awards
    /wiki/Independent_Spirit_Award_for_Best_Male_Lead
    /wiki/The_Woodsman_(2004_film)
    /wiki/Mystic_River_(film)
    /wiki/MTV_Movie_%26_TV_Awards
    /wiki/MTV_Movie_Award_for_Best_Villain
    /wiki/Hollow_Man
    /wiki/Taking_Chance
    /wiki/The_Following
    /wiki/E!_People%27s_Choice_Awards
    /wiki/E!_People%27s_Choice_Awards
    /wiki/The_Following
    /wiki/E!_People%27s_Choice_Awards
    /wiki/The_Following
    /wiki/Primetime_Emmy_Award
    /wiki/Primetime_Emmy_Award_for_Outstanding_Lead_Actor_in_a_Limited_Series_or_Movie
    /wiki/Taking_Chance
    /wiki/Satellite_Awards
    /wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Motion_Picture
    /wiki/The_Woodsman_(2004_film)
    /wiki/Satellite_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
    /wiki/Taking_Chance
    /wiki/Saturn_Award
    /wiki/Saturn_Award_for_Best_Actor_on_Television
    /wiki/The_Following
    /wiki/Saturn_Award_for_Best_Actor_on_Television
    /wiki/The_Following
    /wiki/Scream_Awards
    /wiki/Scream_Awards
    /wiki/Screen_Actors_Guild_Award
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Supporting_Role
    /wiki/Murder_in_the_First_(film)
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
    /wiki/Apollo_13_(film)
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
    /wiki/Mystic_River_(film)
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Cast_in_a_Motion_Picture
    /wiki/Frost/Nixon_(film)
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
    /wiki/Taking_Chance
    /wiki/Teen_Choice_Awards
    /wiki/Teen_Choice_Award_for_Choice_Movie_Villain
    /wiki/Beauty_Shop
    /wiki/Teen_Choice_Award_for_Choice_Movie_Villain
    /wiki/TV_Guide_Award
    /wiki/TV_Guide_Award
    /wiki/The_Following
    /wiki/Kevin_Bacon_filmography
    /wiki/List_of_actors_with_Hollywood_Walk_of_Fame_motion_picture_stars
    /wiki/The_Austin_Chronicle
    /wiki/Access_Hollywood
    /wiki/IMDb
    /wiki/Internet_Broadway_Database
    /wiki/Lortel_Archives
    /wiki/AllMovie
    /wiki/Losing_Chase
    /wiki/Loverboy_(2005_film)
    /wiki/The_Closer
    /wiki/Wild_Things_(film)
    /wiki/The_Woodsman_(2004_film)
    /wiki/Loverboy_(2005_film)
    /wiki/The_Following
    /wiki/Cop_Car_(film)
    /wiki/I_Love_Dick_(TV_series)
    /wiki/Kevin_Bacon_filmography
    /wiki/Critics%27_Choice_Movie_Award_for_Best_Actor
    /wiki/Geoffrey_Rush
    /wiki/Jack_Nicholson
    /wiki/Ian_McKellen
    /wiki/Russell_Crowe
    /wiki/Russell_Crowe
    /wiki/Russell_Crowe
    /wiki/Daniel_Day-Lewis
    /wiki/Jack_Nicholson
    /wiki/Sean_Penn
    /wiki/Jamie_Foxx
    /wiki/Philip_Seymour_Hoffman
    /wiki/Forest_Whitaker
    /wiki/Daniel_Day-Lewis
    /wiki/Sean_Penn
    /wiki/Jeff_Bridges
    /wiki/Colin_Firth
    /wiki/George_Clooney
    /wiki/Daniel_Day-Lewis
    /wiki/Matthew_McConaughey
    /wiki/Michael_Keaton
    /wiki/Leonardo_DiCaprio
    /wiki/Casey_Affleck
    /wiki/Gary_Oldman
    /wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
    /wiki/Mickey_Rooney
    /wiki/Anthony_Andrews
    /wiki/Richard_Chamberlain
    /wiki/Ted_Danson
    /wiki/Dustin_Hoffman
    /wiki/James_Woods
    /wiki/Randy_Quaid
    /wiki/Michael_Caine
    /wiki/Stacy_Keach
    /wiki/Robert_Duvall
    /wiki/James_Garner
    /wiki/Beau_Bridges
    /wiki/Robert_Duvall
    /wiki/James_Garner
    /wiki/Ra%C3%BAl_Juli%C3%A1
    /wiki/Gary_Sinise
    /wiki/Alan_Rickman
    /wiki/Ving_Rhames
    /wiki/Stanley_Tucci
    /wiki/Jack_Lemmon
    /wiki/Brian_Dennehy
    /wiki/James_Franco
    /wiki/Albert_Finney
    /wiki/Al_Pacino
    /wiki/Geoffrey_Rush
    /wiki/Jonathan_Rhys_Meyers
    /wiki/Bill_Nighy
    /wiki/Jim_Broadbent
    /wiki/Paul_Giamatti
    /wiki/Al_Pacino
    /wiki/Idris_Elba
    /wiki/Kevin_Costner
    /wiki/Michael_Douglas
    /wiki/Billy_Bob_Thornton
    /wiki/Oscar_Isaac
    /wiki/Tom_Hiddleston
    /wiki/Ewan_McGregor
    /wiki/Saturn_Award_for_Best_Actor_on_Television
    /wiki/Kyle_Chandler
    /wiki/Steven_Weber_(actor)
    /wiki/Richard_Dean_Anderson
    /wiki/David_Boreanaz
    /wiki/Robert_Patrick
    /wiki/Ben_Browder
    /wiki/David_Boreanaz
    /wiki/David_Boreanaz
    /wiki/Ben_Browder
    /wiki/Matthew_Fox
    /wiki/Michael_C._Hall
    /wiki/Matthew_Fox
    /wiki/Edward_James_Olmos
    /wiki/Josh_Holloway
    /wiki/Stephen_Moyer
    /wiki/Bryan_Cranston
    /wiki/Bryan_Cranston
    /wiki/Mads_Mikkelsen
    /wiki/Hugh_Dancy
    /wiki/Andrew_Lincoln
    /wiki/Bruce_Campbell
    /wiki/Andrew_Lincoln
    /wiki/Kyle_MacLachlan
    /wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
    /wiki/Ra%C3%BAl_Juli%C3%A1
    /wiki/Gary_Sinise
    /wiki/Alan_Rickman
    /wiki/Gary_Sinise
    /wiki/Christopher_Reeve
    /wiki/Jack_Lemmon
    /wiki/Brian_Dennehy
    /wiki/Ben_Kingsley
    /wiki/William_H._Macy
    /wiki/Al_Pacino
    /wiki/Geoffrey_Rush
    /wiki/Paul_Newman
    /wiki/Jeremy_Irons
    /wiki/Kevin_Kline
    /wiki/Paul_Giamatti
    /wiki/Al_Pacino
    /wiki/Paul_Giamatti
    /wiki/Kevin_Costner
    /wiki/Michael_Douglas
    /wiki/Mark_Ruffalo
    /wiki/Idris_Elba
    /wiki/Bryan_Cranston
    /wiki/Alexander_Skarsg%C3%A5rd
    /wiki/WorldCat
    /wiki/Biblioteca_Nacional_de_Espa%C3%B1a
    /wiki/Biblioth%C3%A8que_nationale_de_France
    /wiki/Integrated_Authority_File
    /wiki/International_Standard_Name_Identifier
    /wiki/Library_of_Congress_Control_Number
    /wiki/MusicBrainz
    /wiki/National_Library_of_the_Czech_Republic
    /wiki/SNAC
    /wiki/Syst%C3%A8me_universitaire_de_documentation
    /wiki/Virtual_International_Authority_File
    

    我们升级下要求写一个不是这种静态的连接,要求如下:

    • 一个函数get_links,可以URL拼接返回一个列表,里边包含所有词条的连接
    • 一个主函数,以某个起始词条为参数调用get_links,在返回的URL列表里随机选择一个词条链接,在调用get_links,直到我们停止,或者没有连接了才停止
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    import datetime
    import random
    
    random.seed(datetime.datetime.now())
    def get_links(articleUrl):
        html = urlopen("https://en.wikipedia.org" + articleUrl)
        bsObj = BeautifulSoup(html)
        a = bsObj.find("div",{'id':'bodyContent'})
        b = a.findAll("a",href = re.compile("^(/wiki/)((?!:).)*$"))
        return b
    links = get_links('/wiki/Kevin_Bacon')
    while len(links)>0:
        # 随机选取所获得的词条连接
        newArticle = links[random.randint(0,len(links)-1)].attrs['href']
        print(newArticle)
        links = get_links(newArticle)
    

    这里只是一个简单的从一个词条链接随机跳转到另一个词条链接的方法

    采集整个网站

    采集整个网站是非常费时费力的,数据量非常庞大,需要我们用数据库来出巡我们采集的资源,那么什么时候采集整个网站呢?遍历整个网站的网络数据有什么用呢?

    • 生成网站地图

    如果你想知道一格网站的所有连接,并且把这些链接的页面整理成这个网站的实际的形式,这样我们可以很快的找出这个网站的内容

    • 收集数据

    需要搜索垂直领域的内容的时候可以创建一个爬虫递归地遍历每个网站,只收集那些网站页面上的数据.

    采集整个网站的时候常用的方法就是从顶级页面开始,搜索页面上的所有链接,形成列表,再去采集这些链接的每一个页面,然后把每个页面上的链接形成新的链接继续操作.

    为了避免一个页面的链接被重复采集,链接的去重很重要,我们可以把这些链接放在一个集合(set)里达到去重的目的.

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    pages = set()
    def get_links(pageUrl):
        global pages
        html = urlopen("https://en.wikipedia.org" + pageUrl)
        bsObj = BeautifulSoup(html)
        a = bsObj.findAll("a",href = re.compile("^(/wiki/)"))
        for link in a:
            if 'href' in link.attrs:
                if link.attrs['href] not in pages:
                    # 我们遇到了新的页面
                   newPages = link.attrs['href']
                   print(newPages)
                   pages.add(newPages)
                   get_links(newPages)
    get_links("")
    

    注意

    使用递归函数的时候运行次数太多会报错,python默认是1000次,在爬虫的时候需要确保URL符合规范.

    收集整个网站的数据

    如果爬虫只是从一个页面跳到另一个页面是非常无聊的,为了有效的使用它,我们需要爬取也写数据内容,我们来爬取下500Px里边今日热门,排名上升,新作等的照片第一张.先分析下网页:

    01.png
    • 所有的标题都是以<h1>作为标签的,而且只有一个h1.
    02.png
    • 导航栏在一个<ul class='px_tabs'> --> <li> --> <a>标签里边
    03.png
    • 照片在<div class='discovery_body_region'> --> <div> --> <a> --> <img>
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re
    
    pages = set()
    def get_links(pageUrl):
        global pages
        html = urlopen("https://www.500px.com" + pageUrl)
        bsObj = BeautifulSoup(html)
        try:
            print(bsObj.h1.get_text())
            print(bsObj.find("div",{'class':'discovery_body_region'}).find('a').findAll('img')[0].attrs['src'])
            print(bsObj.find("ul",{'class':'px_tabs'}).findAll('li').find('a').attrs['href'])
        except AttributeError:
            print("没有了")
        a = bsObj.find("ul",{'class':'px_tabs'}).findAll('li').findAll("a",href = re.compile("^(/\w)"))
        for link in a:
            if 'href' in link.attrs:
                if link.attrs['href] not in pages:
                    # 我们遇到了新的页面
                   newPages = link.attrs['href']
                   print('-'*20'\n'+ newPages)
                   pages.add(newPages)
                   get_links(newPages)
    get_links("")
    

    相关文章

      网友评论

          本文标题:采集

          本文链接:https://www.haomeiwen.com/subject/rxkdfqtx.html