使用Scrapy抓取豆瓣图书短评并导出Excel
本文演示如何通过Scrapy框架抓取豆瓣图书Top250中前两页(共50本书)的短评信息,并将结果保存为Excel文件。每条记录包含以下字段:
- 图书名称
- 用户ID
- 短评内容
初始化项目结构
D:\USERDATA\python>scrapy startproject douban_comments
New Scrapy project 'douban_comments', using template directory '...', created in:
D:\USERDATA\python\douban_comments
You can start your first spider with:
cd douban_comments
scrapy genspider example example.com
定义爬虫逻辑
在 douban_comments/spiders/comment_spider.py 文件中编写核心抓取代码:
import scrapy
from bs4 import BeautifulSoup
import re
import math
from ..items import CommentItem
class CommentSpider(scrapy.Spider):
name = 'comments'
allowed_domains = ['book.douban.com']
start_urls = [
'https://book.douban.com/top250?start=0',
'https://book.douban.com/top250?start=25'
]
def parse(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', class_='nbg')
for link in links:
detail_url = link.get('href')
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
def parse_detail(self, response):
url = str(response).split()[1].rstrip('>')
soup = BeautifulSoup(response.text, 'html.parser')
comment_link = soup.find('a', href=re.compile(r'^https://book.douban.com/subject/.*/comments/'))
total = int(comment_link.text.strip().split()[1])
page_count = min(math.ceil(total / 20), 2) # 抓取最多2页评论
for p in range(1, page_count + 1):
comment_page = f"{url}comments/hot?p={p}"
yield scrapy.Request(url=comment_page, callback=self.parse_comments)
def parse_comments(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('a', href=re.compile(r'^https://book.douban.com/subject/')).text.strip()
items = soup.find_all('li', class_='comment-item')
for item in items:
record = CommentItem()
record['book_title'] = title
user_info = item.find_all('a', href=re.compile(r'^https://www.douban.com/people/'))
record['username'] = user_info[1].text if len(user_info) > 1 else ''
record['content'] = item.find('span', class_='short').text.strip()
yield record
配置数据模型
修改 douban_comments/items.py 定义要采集的数据项:
import scrapy
class CommentItem(scrapy.Item):
book_title = scrapy.Field()
username = scrapy.Field()
content = scrapy.Field()
设置输出管道
在 douban_comments/pipelines.py 中实现Excel写入功能:
import openpyxl
class ExcelPipeline:
def __init__(self):
self.workbook = openpyxl.Workbook()
self.sheet = self.workbook.active
self.sheet.append(['书名', '用户名', '短评'])
def process_item(self, item, spider):
row_data = [item['book_title'], item['username'], item['content']]
self.sheet.append(row_data)
return item
def close_spider(self, spider):
self.workbook.save('douban_short_comments.xlsx')
self.workbook.close()
更新全局设置
编辑 douban_comments/settings.py 启用管道并设定请求延迟:
BOT_NAME = 'douban_comments'
SPIDER_MODULES = ['douban_comments.spiders']
NEWSPIDER_MODULE = 'douban_comments.spiders'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
'douban_comments.pipelines.ExcelPipeline': 300,
}
运行爬虫
scrapy crawl comments
执行完毕后将在当前目录生成名为 douban_short_comments.xlsx 的Excel文档。