Python网络爬虫实战项目代码大全,怎么样爬取全网1200本Python书

一. 多进度爬虫

  对于数据量较大的爬虫,对数据的处理供给较高时,能够运用python多进度或三十二线程的建制完结,多进度是指分配七个CPU处理程序,同临时刻唯有一个CPU在做事,二十多线程是指进程之中有七个像样”子进度”同时在协同职业。python中有多样多少个模块可成功多进度和八线程的做事,此处此用multiprocessing模块变成二十四线程爬虫,测试进程中窥见,由于站点具有反爬虫机制,当url地址和进程数目较多时,爬虫会报错。

Python二拾四线程爬虫与二种数额存款和储蓄形式贯彻(Python爬虫实战二),python爬虫

美高梅开户网址 1

1]-
微信公众号爬虫。基于搜狗微信搜索的微信公众号爬虫接口,能够扩充成基于搜狗搜索的爬虫,重返结果是列表,每一项均是群众号具体新闻字典

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

一. 多进度爬虫

  对于数据量较大的爬虫,对数据的处理须求较高时,能够运用python多进程或二十多线程的体制成功,多进度是指分配四个CPU处理程序,同近来刻唯有二个CPU在劳作,四线程是指进度之中有五个像样”子进度”同时在协同工作。python中有各个四个模块可实现多进度和多线程的办事,此处此用multiprocessing模块变成二十多线程爬虫,测试进程中发现,由于站点具备反爬虫机制,当url地址和经过数目较多时,爬虫会报错。

那是新手学Python的第7捌篇原创文章阅读本文大约须求3分钟

python零基础学习摄像教程全集-叁

 3. 爬取的数码存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

二. 代码内容

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import requests
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 time.sleep(1)
 return duanzi_list

def normal_scapper(url_lists):
 '''
 定义调用函数,使用普通的爬虫函数爬取数据
 '''
 begin_time = time.time()
 for url in url_lists:
  scrap_qiushi_info(url)
 end_time = time.time()
 print "普通爬虫一共耗费时长:%f" % (end_time - begin_time)

def muti_process_scapper(url_lists,process_num=2):
 '''
 定义多进程爬虫调用函数,使用mutiprocessing模块爬取web数据
 '''
 begin_time = time.time()
 pool = Pool(processes=process_num)
 pool.map(scrap_qiushi_info,url_lists)
 end_time = time.time()
 print "%d个进程爬虫爬取所耗费时长为:%s" % (process_num,(end_time - begin_time))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 normal_scapper(url_lists)
 muti_process_scapper(url_lists,process_num=2)


if __name__ == "__main__":
 main()

前边写了壹篇小说关于爬取市面上全数的Python书思路,那也好不轻松我们多少解析体系讲座里面的多个小的实战项目。上次代码未有写完,正好周末有时光把代码全部做到而且存入了数据库中,今天就给我们一步步分析一下是自家是怎样爬取数据,清洗数据和绕过反爬虫的有的安排和有限记录。

2]-
豆瓣读书爬虫。能够爬下豆瓣读书标签下的持有图书,按评分排行依次存款和储蓄,存款和储蓄到Excel中,可便宜大家筛选收集,比如筛选评价人数>壹仟的高分书籍;可依据差异的宗旨存款和储蓄到Excel分歧的Sheet
,选用User
Agent伪装为浏览器进行爬取,并投入随机延时来更加好的模仿浏览器行为,防止爬虫被封

Python网络爬虫实战项目代码大全,怎么样爬取全网1200本Python书。 4. 插入至MySQL数据库

  将爬虫获取的数据插入到关系性数据库MySQL数据库中作为永世数据存款和储蓄,首先供给在MySQL数据库中创立库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

 三. 爬取的数据存入到MongoDB数据库

#!/usr/bin/python
#_*_ coding:utf _*_

import re
import time 
import json
import requests
import pymongo
from multiprocessing import Pool

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mongo(datas):
 '''
 @datas: 需要插入到mongoDB的数据,封装为字典,通过遍历的方式将数据插入到mongoDB中,insert_one()表示一次插入一条数据
 '''
 client = pymongo.MongoClient('localhost',27017)
 duanzi = client['duanzi_db']
 duanzi_info = duanzi['duanzi_info']
 for data in datas:
  duanzi_info.insert_one(data)

def query_data_from_mongo():
 '''
 查询mongoDB中的数据
 '''
 client = pymongo.MongoClient('localhost',27017)['duanzi_db']['duanzi_info']
 for data in client.find():
  print data 
 print "一共查询到%d条数据" % (client.find().count())


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mongo(duanzi_list)

if __name__ == "__main__":
 main()
 #query_data_from_mongo()

一).市面上全体的Python书,都在京东,天猫和豆子上,于是自个儿选取了豆瓣来爬取2).分析网址的构造,其实依然比较轻便的,首先有三个主的页面,里面有全数python的链接,1共1388本(个中有十0多本其实是重新的),网页底部分页展现1共9叁页

3]-
乐乎爬虫。此项指标效益是爬取和讯用户音信以及人际拓扑关系,爬虫框架使用scrapy,数据存款和储蓄使用mongodb。[3]: 

 5. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,格局分开的文件读写方式,能够透过纯文本恐怕Excel格局读取,是一种普遍的多少存储格局,此处将爬取的多寡存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

 4. 插入至MySQL数据库

  将爬虫获取的多少插入到关系性数据库MySQL数据库中作为永世数据存款和储蓄,首先必要在MySQL数据库中创设库和表,如下:

1. 创建库
MariaDB [(none)]> create database qiushi;
Query OK, 1 row affected (0.00 sec)

2. 使用库
MariaDB [(none)]> use qiushi;
Database changed

3. 创建表格
MariaDB [qiushi]> create table qiushi_info(id int(32) unsigned primary key auto_increment,username varchar(64) not null,level int default 0,laugh_count int default 0,comment_count int default 0,content text default '')engine=InnoDB charset='UTF8';
Query OK, 0 rows affected, 1 warning (0.06 sec)

MariaDB [qiushi]> show create table qiushi_info;
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table       | Create Table                                                                                                                                                                                                                                                                                            |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| qiushi_info | CREATE TABLE `qiushi_info` (
  `id` int(32) unsigned NOT NULL AUTO_INCREMENT,
  `username` varchar(64) NOT NULL,
  `level` int(11) DEFAULT '0',
  `laugh_count` int(11) DEFAULT '0',
  `comment_count` int(11) DEFAULT '0',
  `content` text,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

 写入到MySQL数据库中的代码如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import time 
import pymysql
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_mysql(datas):
 '''
 @params: datas,将爬虫获取的数据写入到MySQL数据库中
 '''
 try:
  conn = pymysql.connect(host='localhost',port=3306,user='root',password='',db='qiushi',charset='utf8')
  cursor = conn.cursor(pymysql.cursors.DictCursor)
  for data in datas:
   data_list = (data['username'],int(data['level']),int(data['laugh_count']),int(data['comment_count']),data['content'])
   sql = "INSERT INTO qiushi_info(username,level,laugh_count,comment_count,content) VALUES('%s',%s,%s,%s,'%s')" %(data_list)
   cursor.execute(sql)
   conn.commit()
 except Exception as e:
  print e
 cursor.close()
 conn.close()


def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_mysql(duanzi_list)

if __name__ == "__main__":
 main()

美高梅开户网址 2三).那个页面是静态页面,url页比较有规律,所以很轻松构造出装有的url的地点美高梅开户网址 3四).爬虫每一种分页里面的装有的Python书和对应的url,比如第三页里面有”笨办法那本书”,大家只须要领取书名和对应的url美高梅开户网址 4美高梅开户网址 5

[4]-
Bilibili用户爬虫。,抓取字段:用户id,外号,性别,头像,品级,经验值,观众数,生日,地址,注册时间,签字,等第与经验值等。抓取之后生成B站用户数据报告。

 6. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

 五. 将爬虫数据写入到CSV文件

  CSV文件是以逗号,情势分开的公文读写形式,能够因此纯文本只怕Excel方式读取,是壹种常见的多寡存款和储蓄形式,此处将爬取的数据存入到CSV文件内。

将数据存入到CSV文件代码内容如下:

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_csv(datas,filename):
 '''
 @datas: 需要写入csv文件的数据内容,是一个列表
 @params:filename,需要写入到目标文件的csv文件名
 '''
 with file(filename,'w+') as f:
  writer = csv.writer(f)
  writer.writerow(('username','level','laugh_count','comment_count','content'))
  for data in datas:
   writer.writerow((data['username'],data['level'],data['laugh_count'],data['comment_count'],data['content']))

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_csv(duanzi_list,'/root/duanzi_info.csv')

if __name__ == "__main__":
 main()

一).上边大家已经提取了九十三个页面包车型客车享有的Python书和对应的url,一共是93*一5概况1300多本书,首先先去重,然后我们可以把它存到内部存款和储蓄器里面用3个字典保存,或然存到3个csv文件中去(有同学可能想不到为何要存到文件之中呢,用字典存取不是利于啊,先不说最终发布)

5]-
网易网易爬虫。主要爬取乐乎今日头条用户的个人新闻、网易音讯、客官和关怀。代码获取搜狐搜狐Cookie举办登6,可通过多账号登六来幸免和讯的反对扒手。主要运用
scrapy 爬虫框架。

 陆. 将爬取数据写入到文本文件中

#!/usr/bin/python
#_*_ coding:utf _*_
#blog:http://www.cnblogs.com/cloudlab/

import re
import csv
import time 
import requests

duanzi_list = []

def get_web_html(url):
 '''
 @params:获取url地址web站点的html数据
 '''
 headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}
 try:
  req = requests.get(url,headers=headers)
  if req.status_code == 200:
   response = req.text.encode('utf8')
 except Exception as e:
  print e
 return response

def scrap_qiushi_info(url):
 '''
 @params:url,获取段子数据信息
 '''
 html = get_web_html(url)
 usernames = re.findall(r'<h2>(.*?)</h2>',html,re.S|re.M)
 levels = re.findall('<div class="articleGender \w*Icon">(\d+)</div>',html,re.S|re.M)
 laugh_counts = re.findall('.*?<i class="number">(\d+)</i>',html,re.S|re.M)
 comment_counts = re.findall('<i class="number">(\d+)</i> 评论',html,re.S|re.M)
 contents = re.findall('<div class="content">.*?(.*?)',html,re.S|re.M)
 for username,level,laugh_count,comment_count,content in zip(usernames,levels,laugh_counts,comment_counts,contents):
  information = {
   "username": username.strip(),
   "level": level.strip(),
   "laugh_count": laugh_count.strip(),
   "comment_count": comment_count.strip(),
   "content": content.strip()
  }
  duanzi_list.append(information)
 return duanzi_list

def write_into_files(datas,filename):
 '''
 定义数据存入写文件的函数
 @params:datas需要写入的数据
 @filename:将数据写入到指定的文件名
 '''
 print "开始写入文件.."
 with file(filename,'w+') as f:
  f.write("用户名" + "\t" + "用户等级" + "\t" + "笑话数" + "\t" + "评论数" + "\t" + "段子内容" + "\n")
  for data in datas:
   f.write(data['username'] + "\t" + \
    data['level'] + "\t" + \
    data['laugh_count'] + "\t" + \
    data['comment_count'] + "\t" + \
    data['content'] + "\n" + "\n"
   )

def main():
 '''
 定义main()函数,程序入口,通过列表推倒式获取url地址,调用爬虫函数
 '''
 url_lists = ['https://www.qiushibaike.com/text/page/{}'.format(i) for i in range(1,11)]
 for url in url_lists:
  scrap_qiushi_info(url) 
  time.sleep(1)
 write_into_files(duanzi_list,'/root/duanzi.txt')

if __name__ == "__main__":
 main()

 

  1. 多进度爬虫 对于数据量较大的爬虫,对数码的拍卖供给较高时,可…

二).大家跟着分析每本书页面包车型大巴表征:

r[6]-
小说下载分布式爬虫。使用scrapy,redis,
mongodb,graphite完结的多少个分布式网络爬虫,底层存款和储蓄mongodb集群,分布式使用redis实现,爬虫状态显示选择graphite完毕,重要针对一个小说站点。

美高梅开户网址 6上一片文章说过大家须要分析:笔者/出版社/译者/出版年/页数/定价/ISBN/评分/评价人数

[7]-
中夏族民共和国知网爬虫。设置检索条件后,推行src/CnkiSpider.py抓取多少,抓取数据存款和储蓄在/data目录下,每个数据文件的第二行为字段名称。

看一下网址的源码,发现主要的信息在div 和div

[8]-
链家网爬虫。爬取东京(Tokyo)地区链家历年二手房成交记录。涵盖链家爬虫一文的壹切代码,包涵链家模拟登6代码。

美高梅开户网址 7三).那一有个其他数量清洗是比较费心的,因为不是每1本书都以有点评和评分系统的,而且不是每1本书都有小编,页面,价格的,所以提取的时候势须要盘活丰盛处理,比如有些页面长的这么:美高梅开户网址 8原有数据采集的经过中有多数区别等的多寡:

[9]- 京东爬虫。基于scrapy的京东网址爬虫,保存格式为csv。

  • 书的日子表示格式,各类各类都有:有的书的日子是:’September
    200七’,’October 2二, 200柒’,’2017-玖’,’2017-8-二伍’

  • 有的书的价钱是货币单位不联合,有台币,比索,美元和人民币比如:CNY
    4玖.00,13伍,1九 €,JPY 4320, $ 17陆.00

[10]- QQ 群爬虫。批量抓取 QQ
群音讯,包涵群名称、群号、群人数、群主、群简单介绍等内容,最后生成 XLS(X) /
CSV 结果文件。

1).有的同室后台问作者,你是用scrapy框架照旧要好入手写的,作者那么些项目是团结入手写的,其实scrapy是一个非常屌的框架,如果爬取几80000的多寡,作者自然会用这一个一级武器.

[11]-乌云爬虫。
乌云公开漏洞、知识库爬虫和探寻。全部公然漏洞的列表和各样漏洞的文本内容存在mongodb中,大致约贰G内容;假如整站爬全部文件和图片作为离线查询,大约须要拾G空间、2钟头(10M电信带宽);爬取全部知识库,总共约500M空中。漏洞寻找采纳了Flask作为web
server,bootstrap作为前端。

2).笔者用的是二十多线程爬取,把富有的url都扔到叁个系列之中,然后设置几个线程去队列之中不断的爬取,然后循环往复,直到队列里的url全体处理落成

2016.9.11补充:

三).数据存款和储蓄的时候,有三种思路:

[12]- 去何地网爬虫。
网络爬虫之Selenium使用代理登录:爬取去哪儿网站,使用selenium模拟浏览器登入,获取翻页操作。代理能够存入一个文书,程序读取并利用。帮助多进度抓取。

  • 1种是直接把爬取完的数据存到SQL数据Curry面,然后每一趟新的url来了现在,直接查询数据Curry面有未有,有的话,就跳过,未有就爬取处理

  • 另一种是存入CSV文件,因为是二十多线程存取,所以必然要加入保障护,不然多少个线程同时写3个文件的会有毛病的,写成CSV文件也能调换来数据库,而且保存成CSV文件还有3个便宜,能够转成pandas分外便宜的拍卖分析.

[13]-
机票爬虫(去哪个地方和携程网)。Findtrip是三个依据Scrapy的机票爬虫,近期重组了国内两大机票网址(去何地

1).壹般大型的网址都有反爬虫战略,就算大家此番爬的数码唯有一千本书,然而同样会遇见反爬虫难点

  • 携程)。

二).关于反爬虫计策,绕过反爬虫有许多种办法。有的时候加时延(尤其是10贰线程处理的时候),有的时候用cookie,有的会代理,更加是大面积的爬取分明是要用代理池的,小编这里用的是cookie加时延,比较土的方法.

r[14]

叁).断点续传,即便笔者的数据量不是非常的大,千条规模,不过建议要加断点续传功用,因为您不精晓在爬的时候会现出什么样难点,即便你能够递归爬取,然而即使您爬了800多条,程序挂了,你的事物还没用存下来,下次爬取又要重头开端爬,会湿疹的(聪明的同学显著猜到,作者上边第三步留的伏笔,正是如此原因)

  • 根据requests、MySQLdb、torndb的天涯论坛客户端内容爬虫。

一).整个的代码架构笔者还未有完全优化,近年来是5个py文件,前边小编会特别优化和打包的

[15]- 豆瓣电影、书籍、小组、相册、东西等爬虫集。

美高梅开户网址 9

[17]-
百度mp三全站爬虫,使用redis援救断点续传。r[美高梅开户网址,18]-
Tmall和Taobao的爬虫,能够根据查找关键词,货物id来抓去页面的新闻,数据存款和储蓄在

  • spider_main:重倘若爬取玖三个分页的全数书的链接和书面,并且10二线程处理
  • book_html_parser:重假如爬取每一本书的音信
  • url_manager:重借使管理所有的url链接
  • db_manager:首就算数据库的存取和询问
  • util:是叁个存放一些大局的变量
  • verify:是自己公开测试代码的一个小程序

[19]-
多少人股票(stock)数量(沪深)爬虫和选股攻略测试框架。依据选定的日子范围抓取全数沪深两市股票(stock)的物价指数数据。支持使用表明式定义选股计谋。扶助二十四线程处理。保存数据到JSON文件、CSV文件。

贰).首要的爬取结果的存放

美高梅开户网址 10all_books_link.csv:重要存放在1200多本书的url和书名美高梅开户网址 11python_books.csv:首要存放在具体每一本书的新闻美高梅开户网址 12三).用到的库爬虫部分:用了requests,beautifulSoup数据清洗:用了大量的正则表达式,collection模块,对书的出版日期用了datetime和calendar模块多线程:用了threading模块和queue


敲定:好,前天的全网分析Python书,爬虫篇,就讲道那里,基本上大家全部这几个类型的手艺点都讲了二遍,爬虫仍旧很风趣的,可是要变为贰个爬虫高手还有诸多地点要学习,想把爬虫写的爬取速度快,又稳重,还能够绕过反爬虫系统,并不是一件轻便的事务.
好玩味的小伙伴,也足以团结动手写一下啊。源码等背后的数码解析篇讲完后,小编会放github上,若有啥难点,也欢迎留言研商一下.

本项目收音和录音各样Python网络爬虫实战开源代码,并恒久更新,欢迎补充。

愈多Python干货欢迎加我爱python大神QQ群:30405079九

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图