当前位置: 代码迷 >> Web前端 >> 【python 学习之web篇】用python 3.1.2兑现crawler-C
  详细解决方案

【python 学习之web篇】用python 3.1.2兑现crawler-C

热度:549   发布时间:2012-09-10 22:20:12.0
【python 学习之web篇】用python 3.1.2实现crawler--C

【python 学习之web篇】用python 3.1.2实现crawler--C
2011年06月28日
  2011/06/28
      Retriever的实现,该类使用MyParser下载存储并分析网页内的超链接
      需要实现3部分
      1,根据url分析并创建适合存储该网页的文件及其路径,这部分放在构造函数内实现
      urllib.parse.urlparse(url) : 将url分解成6个组成部分
  urllib.parse.urlparse(urlstring, default_scheme='', allow_fragments=True) Parse a URL into six components, returning a 6-tuple. This corresponds to the general structure of a URL: scheme://netloc/path;parameters?query#fragment. Each tuple item is a string, possibly empty. The components are not broken up in smaller parts (for example, the network location is a single string), and % escapes are not expanded. The delimiters as shown above are not part of the result, except for a leading slash in the path component, which is retained if present. For example:
  >>> from urllib.parse import urlparse>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')>>> o   # doctest: +NORMALIZE_WHITESPACEParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',            params='', query='', fragment='')>>> o.scheme'http'>>> o.port80>>> o.geturl()'http://www.cwi.nl:80/%7Eguido/Python.html'
      os.path.splitext(path) :将path分解成名字加扩展名的形式
  os.path.splitext(path) Split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period. Leading periods on the basename are ignored; splitext('.cshrc') returns ('.cshrc', '').    2,下载网页内容,这里经常会遇到能够打开但是不能下载之类的情况,需要进行异常判断。
      urllib.request.urlretrieve(self.url, self.file):将self.url的网页下载到self.file里面
  urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None) Copy a network object denoted by a URL to a local file, if necessary. If the URL points to a local file, or a valid cached copy of the object exists, the object is not copied. Return a tuple (filename, headers) where filename is the local file name under which the object can be found, and headers is whatever the info() method of the object returned by urlopen() returned (for a remote object, possibly cached). Exceptions are the same as for urlopen().
  The second argument, if present, specifies the file location to copy to (if absent, the location will be a tempfile with a generated name). The third argument, if present, is a hook function that will be called once on establishment of the network connection and once after each block read thereafter. The hook will be passed three arguments; a count of blocks transferred so far, a block size in bytes, and the total size of the file. The third argument may be -1 on older FTP servers which do not return a file size in response to a retrieval request.
  If the url uses the http: scheme identifier, the optional data argument may be given to specify a POST request (normally the request type is GET). The data argument must in standard application/x-www-form-urlencoded format; see the urlencode() function below.
  urlretrieve() will raise ContentTooShortError when it detects that the amount of data available was less than the expected amount (which is the size reported by a Content-Length header). This can occur, for example, when the download is interrupted.
  The Content-Length is treated as a lower bound: if there’s more data to read, urlretrieve reads more data, but if less data is available, it raises the exception.
  You can still retrieve the downloaded data in this case, it is stored in the content attribute of the exception instance.
  If no Content-Length header was supplied, urlretrieve can not check the size of the data it has downloaded, and just returns it. In this case you just have to assume that the download was successful.
      user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
      header = {'User-Agent' : user_agent}
      request = urllib.request.Request(self.url, headers = header):创建一个Request对象
  class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False) This class is an abstraction of a URL request.
  url should be a string containing a valid URL.
  data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format.
  headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself