前言

这是第一次爬虫，以理论为主，以实现为辅。因为是到处看的，所以不是很有逻辑性。
参考链接：
python爬虫原理
 Python爬虫的两套解析方法和四种爬虫实现
 爬虫基本原理

一、工具

两个解析库：BeautifulSoup, lxml
两个请求库：urllib, requests
法法

二、爬虫流程

用户获取网络数据的方式
方式1:浏览器提交请求——>下载网页代码——>解析成页面
方式2:模拟浏览器发送请求(获取网页代码)——>提取有用数据——>存放在数据库或者文件中
爬虫就是指方式2.

1.发起请求

使用http库向目标站点发送请求，即发送一个Request。
Request包含：请求头，请求体等。
Request模块缺点：不能执行JS和CSS代码。

2.获取响应内容

服务器正常响应，得到一个Response。
Response包含：html，json，图片，视频等。

3.解析内容

解析html数据：正则表达式(RE模块)，第三方解析库如BeautifulSoup,pyquery等
解析json数据：json模块
解析二进制数据：以wb形式写入文件

4.保存数据

数据库（MySQL, Mongdb, Redis)
文件

三、Request&Response

1.Request

1.1.请求方式

常见的有：GET/POST

1.2.请求的URL

url是全球容易资源定位符，用来丁意思互联网上一个唯一的资源，例如：一张图片、一个文件、一段视频。

1.3.请求头

User-agen：访问的浏览器请求头没有user-agent客户端配置，会被当成非法用户host
cookies：cookie用来保存登录信息
Referrer：访问源至哪里来

1.4.请求体

get：请求体没有内容
post：请求体是format data

2.Response

2.1 响应状态码

200：代表成功
301：代表调转
404：文件不存在
403：无权限访问
502：服务器错误

2.2 响应头

Set-Cookie:BDSVRTM=0; path=/：可能有多个，是来告诉浏览器，把cookie保存下来
Content-Location：服务端响应头中包含Location返回浏览器之后，浏览器就会重新访问另一个页面

2.3preview

网页源代码，包括：
Json数据、html、图片、二进制数据

接下来开始尝试写一些基本的爬虫代码，并做记录

```python

发起请求，并获取请求内容

from urllib import request
resp = request.urlopen(‘https://movie.douban.com/nowplaying/hangzhou/‘) # http.client.HTTPResponse
html_data = resp.read().decode(‘utf-8’) # str 这里的print是最好看的

解析内容

from bs4 import BeautifulSoup as bs
soup = bs(htmldata, ‘html.parser’) # bs4.BeautifulSoup
nowplaying_movie = soup.find_all(‘div’, id=’nowplaying’) # bs4.element.ResultSet list的形式，可以暂时看成是多个组成的list，需要先[0]的进行访问。
tmp = nowplaying_movie[0] # bs4.element.Tag
nowplaying_movie_list = nowplaying_movie[0].find_all(‘li’, class=’list-item’) # bs4.element.ResultSet list形式， bs4.element.Tag
nowplaying_list = [] # 此时就是直接获取数据了，find_all是对相应片段的截取
for item in nowplaying_movie_list:
nowplaying_dict = {}
nowplaying_dict[‘id’] = item[‘data-subject’]
for tag_img_item in item.find_all(‘img’):
nowplaying_dict[‘name’] = tag_img_item[‘alt’]
nowplaying_list.append(nowplaying_dict)

requrl = ‘https://movie.douban.com/subject/'+nowplaying_list[0]['id‘] + ‘/comments’ +’?’ +’start=0’ + ‘&limit=20’

三句一体

resp = request.urlopen(requrl)
html_data = resp.read().decode(‘utf-8’)
soup = bs(html_data, ‘html.parser’)

commentdiv_lists[0].find_all(‘span’, class=”short”)[0].string # .string 可以暂时理解成中间的字符串

Hexo

python_reptilian

前言