使用Scrapy框架抓取拉勾网Python职位信息
项目目标与数据源分析
本项目旨在通过Scrapy爬虫框架采集拉勾网中与"Python"相关的招聘信息,包括职位名称、薪资范围、工作地点、经验要求、公司规模等关键字段。由于目标网站采用了动态加载机制,需深入分析其前后端交互逻辑才能有效获取数据。
页面结构与请求流程解析
在浏览器中搜索"Python"后,初始URL为:https://www.lagou.com/jobs/list_python。观察发现该页面内容并非静态HTML直接渲染,而是通过Ajax异步加载职位列表。
打开开发者工具并切换至Network面板,筛选XHR请求类型,可定位到核心接口:
POST https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false
此接口返回JSON格式的职位摘要数据,包含每条职位的positionId和showId,可用于构造详情页链接:
https://www.lagou.com/jobs/{positionId}.html?show={showId}
进一步分析POST参数:
- first:布尔值,标识是否为首次请求(第一页为true)
- pn:当前页码
- kd:搜索关键词,如"python"
- sid:会话ID,后续翻页时需携带
若直接访问上述接口,服务器将返回反爬提示:"操作太频繁,请稍后再试"。因此必须模拟完整用户行为流——先访问首页建立会话Cookie,再发起POST请求获取数据。
详情页数据提取策略
进入职位详情页后,通过XPath定位所需字段。例如:
- 职位名:
//div[@class="position-content-l"]/div/h1/text() - 薪资:
//dd[@class="job_request"]//span[1]/text() - 城市:
//dd[@class="job_request"]//span[2]/text() - 经验要求:
//dd[@class="job_request"]//span[3]/text() - 学历要求:
//dd[@class="job_request"]//span[4]/text() - 公司名称:
//dl[@class="job_company"]//img/@alt
所有数据均存在于页面DOM中,无需额外API调用即可解析。
Scrapy爬虫实现代码
定义数据模型(items.py)
import scrapy
class JobItem(scrapy.Item):
job_url = scrapy.Field()
job_title = scrapy.Field()
salary = scrapy.Field()
city = scrapy.Field()
district = scrapy.Field()
experience = scrapy.Field()
degree = scrapy.Field()
tags = scrapy.Field()
publish_time = scrapy.Field()
company_name = scrapy.Field()
company_nature = scrapy.Field()
financing_stage = scrapy.Field()
company_scale = scrapy.Field()
主爬虫逻辑(spiders/lagou_spider.py)
import scrapy
import json
from ..items import JobItem
class LagouJobSpider(scrapy.Spider):
name = 'lagou_job'
allowed_domains = ['lagou.com']
start_urls = ['https://www.lagou.com/jobs/list_Python']
custom_headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://www.lagou.com/jobs/list_Python',
'X-Requested-With': 'XMLHttpRequest'
}
def __init__(self):
self.session_id = None
self.page_count = 0
def parse(self, response):
# 首次请求仅用于获取Cookie,跳转至Ajax接口
yield scrapy.FormRequest(
url='https://www.lagou.com/jobs/positionAjax.json',
formdata={'first': 'true', 'pn': '1', 'kd': 'python'},
headers=self.custom_headers,
callback=self.parse_ajax_response
)
def parse_ajax_response(self, response):
try:
data = json.loads(response.text)
if not data.get('success'):
self.logger.warning(f"请求失败: {data.get('msg')}")
return
content = data['content']
page_no = content['pageNo']
self.page_count = page_no
self.logger.info(f"正在抓取第 {page_no} 页")
# 提取会话ID
if not self.session_id:
self.session_id = content.get('showId')
# 遍历职位并生成详情页请求
for pos_id in content['hrInfoMap'].keys():
item = JobItem()
item['job_url'] = f"https://www.lagou.com/jobs/{pos_id}.html?show={self.session_id}"
yield scrapy.Request(
item['job_url'],
callback=self.parse_detail,
headers=self.custom_headers,
meta={'item': item}
)
# 翻页逻辑
if page_no < 30:
yield scrapy.FormRequest(
url='https://www.lagou.com/jobs/positionAjax.json',
formdata={
'first': 'false',
'pn': str(page_no + 1),
'kd': 'python',
'sid': self.session_id
},
headers=self.custom_headers,
callback=self.parse_ajax_response
)
except Exception as e:
self.logger.error(f"解析出错: {e}")
def parse_detail(self, response):
item = response.meta['item']
# 基本信息提取
info_block = response.xpath('//div[@class="position-content-l"]')
item['job_title'] = info_block.xpath('.//h1/text()').get()
request_spans = info_block.xpath('.//dd[@class="job_request"]//span/text()').getall()
if len(request_spans) >= 4:
item['salary'] = request_spans[0].strip()
item['city'] = request_spans[1].strip().strip('/')
item['experience'] = request_spans[2].strip().strip('/')
item['degree'] = request_spans[3].strip().strip('/')
item['district'] = ''.join(response.xpath('//div[@class="work_addr"]//text()').getall()).replace(' ', '').replace('\n','').strip()
item['tags'] = response.xpath('//ul[@class="position-label clearfix"]/li/text()').getall()
raw_time = response.xpath('//p[@class="publish_time"]/text()').get()
item['publish_time'] = raw_time.split('&')[0] if raw_time else None
# 公司信息提取
company_block = response.xpath('//dl[@class="job_company"]')
item['company_name'] = company_block.xpath('.//img/@alt').get()
features = company_block.xpath('.//li[@class="c_feature"]/h4[@class="c_feature_name"]/text()').getall()
if len(features) >= 4:
item['company_nature'] = features[0]
item['financing_stage'] = features[1]
item['company_scale'] = features[3]
yield item
中间件配置:随机User-Agent(middlewares.py)
from fake_useragent import UserAgent
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotatingUA(UserAgentMiddleware):
def __init__(self, user_agent=''):
super().__init__()
self.ua = UserAgent()
def process_request(self, request, spider):
request.headers.setdefault('User-Agent', self.ua.random)
数据持久化管道(pipelines.py)
import json
class JsonLinesPipeline:
def open_spider(self, spider):
self.file = open('lagou_jobs.jsonl', 'w', encoding='utf-8')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + '\n'
self.file.write(line)
return item
项目设置(settings.py)
# 反爬设置
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.3
RANDOMIZE_DOWNLOAD_DELAY = True
# 中间件启用
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RotatingUA': 100,
}
# 数据管道
ITEM_PIPELINES = {
'myproject.pipelines.JsonLinesPipeline': 300,
}
# 日志等级
LOG_LEVEL = 'INFO'
运行与结果
执行命令启动爬虫:
scrapy crawl lagou_job
程序将自动完成登录会话、分页抓取、详情解析全流程,并将结果逐行写入lagou_jobs.jsonl文件。尽管存在反爬机制限制(通常5~6页后触发封禁),但结合IP代理池可进一步提升稳定性。