美文网首页Python及网络爬虫
urllib 介绍 python2.7版本

urllib 介绍 python2.7版本

作者: dopami | 来源:发表于2017-12-05 17:07 被阅读12次

    https://docs.python.org/2/library/urllib.html

    20.5.urllib— Open arbitrary resources by URL

    Note

    Theurllibmodule has been split into parts and renamed inPython 3 tourllib.request,urllib.parse,andurllib.error. The2to3tool will automatically adaptimports when converting your sources to Python 3.Also note that theurllib.request.urlopen()function in Python 3 isequivalent tourllib2.urlopen()and thaturllib.urlopen()hasbeen removed.

    This module provides a high-level interface for fetching data across the WorldWide Web.  In particular, theurlopen()function is similar to thebuilt-in functionopen(), but accepts Universal Resource Locators (URLs)instead of filenames.  Some restrictions apply — it can only open URLs forreading, and no seek operations are available.

    See also

    TheRequests packageis recommended for a higher-level HTTP client interface.

    Changed in version 2.7.9:For HTTPS URIs,urllibperforms all the neccessary certificate and hostname checks by default.

    Warning

    For Python versions earlier than 2.7.9, urllib does not

    attempt to validate the server certificates of HTTPS URIs. Use at your

    own risk!

    20.5.1. High-level interface

    urllib.urlopen(url[,data[,proxies[,context]]])

    Open a network object denoted by a URL for reading.  If the URL does nothave a scheme identifier, or if it hasfile:as its schemeidentifier, this opens a local file (withoutuniversal newlines);otherwise it opens a socket to a server somewhere on the network.  If theconnection cannot be made theIOErrorexception is raised.  If allwent well, a file-like object is returned.  This supports the followingmethods:read(),readline(),readlines(),fileno(),close(),info(),getcode()andgeturl().  It alsohas proper support for theiteratorprotocol. One caveat: theread()method, if the size argument is omitted or negative, may notread until the end of the data stream; there is no good way to determinethat the entire stream from a socket has been read in the general case.

    Except for theinfo(),getcode()andgeturl()methods,these methods have the same interface as for file objects — see sectionFile Objectsin this manual.  (It is not a built-in file object,however, so it can’t be used at those few places where a true built-in fileobject is required.)

    Theinfo()method returns an instance of the classmimetools.Messagecontaining meta-information associated with theURL.  When the method is HTTP, these headers are those returned by the serverat the head of the retrieved HTML page (including Content-Length andContent-Type).  When the method is FTP, a Content-Length header will bepresent if (as is now usual) the server passed back a file length in responseto the FTP retrieval request. A Content-Type header will be present if theMIME type can be guessed.  When the method is local-file, returned headerswill include a Date representing the file’s last-modified time, aContent-Length giving file size, and a Content-Type containing a guess at thefile’s type. See also the description of themimetoolsmodule.

    Thegeturl()method returns the real URL of the page.  In some cases, theHTTP server redirects a client to another URL.  Theurlopen()functionhandles this transparently, but in some cases the caller needs to know which URLthe client was redirected to.  Thegeturl()method can be used to get atthis redirected URL.

    Thegetcode()method returns the HTTP status code that was sent with theresponse, orNoneif the URL is no HTTP URL.

    If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must be in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

    Theurlopen()function works transparently with proxies which do notrequire authentication.  In a Unix or Windows environment, set thehttp_proxy, orftp_proxyenvironment variables to a URL thatidentifies the proxy server before starting the Python interpreter.  For example(the'%'is the command prompt):

    %http_proxy="http://www.someproxy.com:3128"%exporthttp_proxy%python...

    Theno_proxyenvironment variable can be used to specify hosts whichshouldn’t be reached via proxy; if set, it should be a comma-separated listof hostname suffixes, optionally with:portappended, for examplecern.ch,ncsa.uiuc.edu,some.host:8080.

    In a Windows environment, if no proxy environment variables are set, proxy

    settings are obtained from the registry’s Internet Settings section.

    In a Mac OS X  environment,urlopen()will retrieve proxy informationfrom the OS X System Configuration Framework, which can be managed withNetwork System Preferences panel.

    Alternatively, the optionalproxiesargument may be used to explicitly specifyproxies.  It must be a dictionary mapping scheme names to proxy URLs, where anempty dictionary causes no proxies to be used, andNone(the default value)causes environmental proxy settings to be used as discussed above.  Forexample:

    # Use http://www.someproxy.com:3128 for HTTP proxyingproxies={'http':'http://www.someproxy.com:3128'}filehandle=urllib.urlopen(some_url,proxies=proxies)# Don't use any proxiesfilehandle=urllib.urlopen(some_url,proxies={})# Use proxies from environment - both versions are equivalentfilehandle=urllib.urlopen(some_url,proxies=None)filehandle=urllib.urlopen(some_url)

    Proxies which require authentication for use are not currently supported;

    this is considered an implementation limitation.

    Thecontextparameter may be set to assl.SSLContextinstance toconfigure the SSL settings that are used ifurlopen()makes a HTTPSconnection.

    Changed in version 2.3:Added theproxiessupport.

    Changed in version 2.6:Addedgetcode()to returned object and support for theno_proxyenvironment variable.

    Changed in version 2.7.9:Thecontextparameter was added. All the neccessary certificate and hostname checks are done by default.

    Deprecated since version 2.6:Theurlopen()function has been removed in Python 3 in favorofurllib2.urlopen().

    urllib.urlretrieve(url[,filename[,reporthook[,data]]])

    Copy a network object denoted by a URL to a local file, if necessary. If the URLpoints to a local file, or a valid cached copy of the object exists, the objectis not copied.  Return a tuple(filename,headers)wherefilenameis thelocal file name under which the object can be found, andheadersis whatevertheinfo()method of the object returned byurlopen()returned (fora remote object, possibly cached). Exceptions are the same as forurlopen().

    The second argument, if present, specifies the file location to copy to (ifabsent, the location will be a tempfile with a generated name). The thirdargument, if present, is a hook function that will be called once onestablishment of the network connection and once after each block readthereafter.  The hook will be passed three arguments; a count of blockstransferred so far, a block size in bytes, and the total size of the file.  Thethird argument may be-1on older FTP servers which do not return a filesize in response to a retrieval request.

    If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

    Changed in version 2.5:urlretrieve()will raiseContentTooShortErrorwhen it detects thatthe amount of data available  was less than the expected amount (which is thesize reported by aContent-Lengthheader). This can occur, for example, whenthe  download is interrupted.

    TheContent-Lengthis treated as a lower bound: if there’s more data  to read,urlretrieve()reads more data, but if less data is available,  it raisesthe exception.

    You can still retrieve the downloaded data in this case, it is stored  in thecontentattribute of the exception instance.

    If noContent-Lengthheader was supplied,urlretrieve()can not checkthe size of the data it has downloaded, and just returns it.  In this case youjust have to assume that the download was successful.

    urllib._urlopener

    The public functionsurlopen()andurlretrieve()create an instanceof theFancyURLopenerclass and use it to perform their requestedactions.  To override this functionality, programmers can create a subclass ofURLopenerorFancyURLopener, then assign an instance of thatclass to theurllib._urlopenervariable before calling the desired function.For example, applications may want to specify a differentUser-Agentheader thanURLopenerdefines.  This can beaccomplished with the following code:

    importurllibclassAppURLopener(urllib.FancyURLopener):version="App/1.7"urllib._urlopener=AppURLopener()

    urllib.urlcleanup()

    Clear the cache that may have been built up by previous calls tourlretrieve().

    20.5.2. Utility functions

    urllib.quote(string[,safe])

    Replace special characters instringusing the%xxescape. Letters,digits, and the characters'_.-'are never quoted. By default, thisfunction is intended for quoting the path section of the URL. The optionalsafeparameter specifies additional characters that should not be quoted— its default value is'/'.

    Example:quote('/~connolly/')yields'/%7econnolly/'.

    urllib.quote_plus(string[,safe])

    Likequote(), but also replaces spaces by plus signs, as required forquoting HTML form values when building up a query string to go into a URL.Plus signs in the original string are escaped unless they are included insafe.  It also does not havesafedefault to'/'.

    urllib.unquote(string)

    Replace%xxescapes by their single-character equivalent.

    Example:unquote('/%7Econnolly/')yields'/~connolly/'.

    urllib.unquote_plus(string)

    Likeunquote(), but also replaces plus signs by spaces, as required forunquoting HTML form values.

    urllib.urlencode(query[,doseq])

    Convert a mapping object or a sequence of two-element tuples to a“percent-encoded” string, suitable to pass tourlopen()above as theoptionaldataargument.  This is useful to pass a dictionary of formfields to aPOSTrequest.  The resulting string is a series ofkey=valuepairs separated by'&'characters, where bothkeyandvalueare quoted usingquote_plus()above.  When a sequence oftwo-element tuples is used as thequeryargument, the first element ofeach tuple is a key and the second is a value. The value element in itselfcan be a sequence and in that case, if the optional parameterdoseqisevaluates toTrue, individualkey=valuepairs separated by'&'aregenerated for each element of the value sequence for the key.  The order ofparameters in the encoded string will match the order of parameter tuples inthe sequence. Theurlparsemodule provides the functionsparse_qs()andparse_qsl()which are used to parse query stringsinto Python data structures.

    urllib.pathname2url(path)

    Convert the pathnamepathfrom the local syntax for a path to the form used inthe path component of a URL.  This does not produce a complete URL.  The returnvalue will already be quoted using thequote()function.

    urllib.url2pathname(path)

    Convert the path componentpathfrom a percent-encoded URL to the local syntax for apath.  This does not accept a complete URL.  This function usesunquote()to decodepath.

    urllib.getproxies()

    This helper function returns a dictionary of scheme to proxy server URLmappings. It scans the environment for variables named_proxy,in case insensitive way, for all operating systems first, and when it cannotfind it, looks for proxy information from Mac OSX System Configuration forMac OS X and Windows Systems Registry for Windows.If both lowercase and uppercase environment variables exist (and disagree),lowercase is preferred.

    Note

    If the environment variableREQUEST_METHODis set, which usuallyindicates your script is running in a CGI environment, the environmentvariableHTTP_PROXY(uppercase_PROXY) will be ignored. This isbecause that variable can be injected by a client using the “Proxy:” HTTPheader. If you need to use an HTTP proxy in a CGI environment, either useProxyHandlerexplicitly, or make sure the variable name is inlowercase (or at least the_proxysuffix).

    Note

    urllib also exposes certain utility functions like splittype, splithost andothers parsing URL into various components. But it is recommended to useurlparsefor parsing URLs rather than using these functions directly.Python 3 does not expose these helper functions fromurllib.parsemodule.

    20.5.3. URL Opener objects

    classurllib.URLopener([proxies[,context[,**x509]]])

    Base class for opening and reading URLs.  Unless you need to support openingobjects using schemes other thanhttp:,ftp:, orfile:,you probably want to useFancyURLopener.

    By default, theURLopenerclass sends aUser-Agentheaderofurllib/VVV, whereVVVis theurllibversion number.Applications can define their ownUser-Agentheader by subclassingURLopenerorFancyURLopenerand setting the class attributeversionto an appropriate string value in the subclass definition.

    The optionalproxiesparameter should be a dictionary mapping scheme names toproxy URLs, where an empty dictionary turns proxies off completely.  Its defaultvalue isNone, in which case environmental proxy settings will be used ifpresent, as discussed in the definition ofurlopen(), above.

    Thecontextparameter may be assl.SSLContextinstance.  If given,it defines the SSL settings the opener uses to make HTTPS connections.

    Additional keyword parameters, collected inx509, may be used forauthentication of the client when using thehttps:scheme.  The keywordskey_fileandcert_fileare supported to provide an  SSL key and certificate;both are needed to support client authentication.

    URLopenerobjects will raise anIOErrorexception if the serverreturns an error code.

    open(fullurl[,data])

    Openfullurlusing the appropriate protocol.  This method sets up cache andproxy information, then calls the appropriate open method with its inputarguments.  If the scheme is not recognized,open_unknown()is called.Thedataargument has the same meaning as thedataargument ofurlopen().

    open_unknown(fullurl[,data])

    Overridable interface to open unknown URL types.

    retrieve(url[,filename[,reporthook[,data]]])

    Retrieves the contents ofurland places it infilename.  The return valueis a tuple consisting of a local filename and either amimetools.Messageobject containing the response headers (for remoteURLs) orNone(for local URLs).  The caller must then open and read thecontents offilename.  Iffilenameis not given and the URL refers to alocal file, the input filename is returned.  If the URL is non-local andfilenameis not given, the filename is the output oftempfile.mktemp()with a suffix that matches the suffix of the last path component of the inputURL.  Ifreporthookis given, it must be a function accepting three numericparameters.  It will be called after each chunk of data is read from thenetwork.reporthookis ignored for local URLs.

    If theurluses thehttp:scheme identifier, the optionaldataargument may be given to specify aPOSTrequest (normally the request typeisGET).  Thedataargument must in standardapplication/x-www-form-urlencodedformat; see theurlencode()function below.

    version

    Variable that specifies the user agent of the opener object.  To geturllibto tell servers that it is a particular user agent, set this in asubclass as a class variable or in the constructor before calling the baseconstructor.

    classurllib.FancyURLopener(...)

    FancyURLopenersubclassesURLopenerproviding default handlingfor the following HTTP response codes: 301, 302, 303, 307 and 401.  For the 30xresponse codes listed above, theLocationheader is used to fetchthe actual URL.  For 401 response codes (authentication required), basic HTTPauthentication is performed.  For the 30x response codes, recursion is boundedby the value of themaxtriesattribute, which defaults to 10.

    For all other response codes, the methodhttp_error_default()is calledwhich you can override in subclasses to handle the error appropriately.

    Note

    According to the letter ofRFC 2616, 301 and 302 responses to POST requestsmust not be automatically redirected without confirmation by the user.  Inreality, browsers do allow automatic redirection of these responses, changingthe POST to a GET, andurllibreproduces this behaviour.

    The parameters to the constructor are the same as those forURLopener.

    Note

    When performing basic authentication, aFancyURLopenerinstance callsitsprompt_user_passwd()method.  The default implementation asks theusers for the required information on the controlling terminal.  A subclass mayoverride this method to support more appropriate behavior if needed.

    TheFancyURLopenerclass offers one additional method that should beoverloaded to provide the appropriate behavior:

    prompt_user_passwd(host,realm)

    Return information needed to authenticate the user at the given host in thespecified security realm.  The return value should be a tuple,(user,password), which can be used for basic authentication.

    The implementation prompts for this information on the terminal; an application

    should override this method to use an appropriate interaction model in the local

    environment.

    exceptionurllib.ContentTooShortError(msg[,content])

    This exception is raised when theurlretrieve()function detects that theamount of the downloaded data is less than the  expected amount (given by theContent-Lengthheader). Thecontentattribute stores the downloaded(and supposedly truncated) data.

    New in version 2.5.

    20.5.4.urllibRestrictions

    Currently, only the following protocols are supported: HTTP, (versions 0.9 and

    1.0),  FTP, and local files.

    The caching feature ofurlretrieve()has been disabled until I find thetime to hack proper processing of Expiration time headers.

    There should be a function to query whether a particular URL is in the cache.

    For backward compatibility, if a URL appears to point to a local file but the

    file can’t be opened, the URL is re-interpreted using the FTP protocol.  This

    can sometimes cause confusing error messages.

    Theurlopen()andurlretrieve()functions can cause arbitrarilylong delays while waiting for a network connection to be set up.  This meansthat it is difficult to build an interactive Web client using these functionswithout using threads.

    The data returned byurlopen()orurlretrieve()is the raw datareturned by the server.  This may be binary data (such as an image), plain textor (for example) HTML.  The HTTP protocol provides type information in the replyheader, which can be inspected by looking at theContent-Typeheader.  If the returned data is HTML, you can use the modulehtmllibtoparse it.

    The code handling the FTP protocol cannot differentiate between a file and adirectory.  This can lead to unexpected behavior when attempting to read a URLthat points to a file that is not accessible.  If the URL ends in a/, it isassumed to refer to a directory and will be handled accordingly.  But if anattempt to read a file leads to a 550 error (meaning the URL cannot be found oris not accessible, often for permission reasons), then the path is treated as adirectory in order to handle the case when a directory is specified by a URL butthe trailing/has been left off.  This can cause misleading results whenyou try to fetch a file whose read permissions make it inaccessible; the FTPcode will try to read it, fail with a 550 error, and then perform a directorylisting for the unreadable file. If fine-grained control is needed, considerusing theftplibmodule, subclassingFancyURLopener, or changing_urlopenerto meet your needs.

    This module does not support the use of proxies which require authentication.

    This may be implemented in the future.

    Although theurllibmodule contains (undocumented) routines to parseand unparse URL strings, the recommended interface for URL manipulation is inmoduleurlparse.

    20.5.5. Examples

    Here is an example session that uses theGETmethod to retrieve a URLcontaining parameters:

    >>>importurllib>>>params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})>>>f=urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s"%params)>>>printf.read()

    The following example uses thePOSTmethod instead:

    >>>importurllib>>>params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})>>>f=urllib.urlopen("http://www.musi-cal.com/cgi-bin/query",params)>>>printf.read()

    The following example uses an explicitly specified HTTP proxy, overriding

    environment settings:

    >>>importurllib>>>proxies={'http':'http://proxy.example.com:8080/'}>>>opener=urllib.FancyURLopener(proxies)>>>f=opener.open("http://www.python.org")>>>f.read()

    The following example uses no proxies at all, overriding environment settings:

    >>>importurllib>>>opener=urllib.FancyURLopener({})>>>f=opener.open("http://www.python.org/")>>>f.read()

    相关文章

      网友评论

        本文标题:urllib 介绍 python2.7版本

        本文链接:https://www.haomeiwen.com/subject/hbvcixtx.html