当前位置：首页 > 技术 > 正文内容

使用Scrapy框架抓取拉勾网Python职位信息

访客技术 2026年6月9日 96

项目目标与数据源分析

本项目旨在通过Scrapy爬虫框架采集拉勾网中与"Python"相关的招聘信息，包括职位名称、薪资范围、工作地点、经验要求、公司规模等关键字段。由于目标网站采用了动态加载机制，需深入分析其前后端交互逻辑才能有效获取数据。

页面结构与请求流程解析

在浏览器中搜索"Python"后，初始URL为：https://www.lagou.com/jobs/list_python。观察发现该页面内容并非静态HTML直接渲染，而是通过Ajax异步加载职位列表。

打开开发者工具并切换至Network面板，筛选XHR请求类型，可定位到核心接口：

POST https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false

此接口返回JSON格式的职位摘要数据，包含每条职位的positionId和showId，可用于构造详情页链接：

https://www.lagou.com/jobs/{positionId}.html?show={showId}

进一步分析POST参数：

first：布尔值，标识是否为首次请求（第一页为true）
pn：当前页码
kd：搜索关键词，如"python"
sid：会话ID，后续翻页时需携带

若直接访问上述接口，服务器将返回反爬提示："操作太频繁，请稍后再试"。因此必须模拟完整用户行为流——先访问首页建立会话Cookie，再发起POST请求获取数据。

详情页数据提取策略

进入职位详情页后，通过XPath定位所需字段。例如：

职位名：//div[@class="position-content-l"]/div/h1/text()
薪资：//dd[@class="job_request"]//span[1]/text()
城市：//dd[@class="job_request"]//span[2]/text()
经验要求：//dd[@class="job_request"]//span[3]/text()
学历要求：//dd[@class="job_request"]//span[4]/text()
公司名称：//dl[@class="job_company"]//img/@alt

所有数据均存在于页面DOM中，无需额外API调用即可解析。

Scrapy爬虫实现代码

定义数据模型（items.py）

import scrapy

class JobItem(scrapy.Item):
    job_url = scrapy.Field()
    job_title = scrapy.Field()
    salary = scrapy.Field()
    city = scrapy.Field()
    district = scrapy.Field()
    experience = scrapy.Field()
    degree = scrapy.Field()
    tags = scrapy.Field()
    publish_time = scrapy.Field()
    company_name = scrapy.Field()
    company_nature = scrapy.Field()
    financing_stage = scrapy.Field()
    company_scale = scrapy.Field()

主爬虫逻辑（spiders/lagou_spider.py）

import scrapy
import json
from ..items import JobItem

class LagouJobSpider(scrapy.Spider):
    name = 'lagou_job'
    allowed_domains = ['lagou.com']
    start_urls = ['https://www.lagou.com/jobs/list_Python']

    custom_headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Referer': 'https://www.lagou.com/jobs/list_Python',
        'X-Requested-With': 'XMLHttpRequest'
    }

    def __init__(self):
        self.session_id = None
        self.page_count = 0

    def parse(self, response):
        # 首次请求仅用于获取Cookie，跳转至Ajax接口
        yield scrapy.FormRequest(
            url='https://www.lagou.com/jobs/positionAjax.json',
            formdata={'first': 'true', 'pn': '1', 'kd': 'python'},
            headers=self.custom_headers,
            callback=self.parse_ajax_response
        )

    def parse_ajax_response(self, response):
        try:
            data = json.loads(response.text)
            if not data.get('success'):
                self.logger.warning(f"请求失败: {data.get('msg')}")
                return

            content = data['content']
            page_no = content['pageNo']
            self.page_count = page_no
            self.logger.info(f"正在抓取第 {page_no} 页")

            # 提取会话ID
            if not self.session_id:
                self.session_id = content.get('showId')

            # 遍历职位并生成详情页请求
            for pos_id in content['hrInfoMap'].keys():
                item = JobItem()
                item['job_url'] = f"https://www.lagou.com/jobs/{pos_id}.html?show={self.session_id}"
                yield scrapy.Request(
                    item['job_url'],
                    callback=self.parse_detail,
                    headers=self.custom_headers,
                    meta={'item': item}
                )

            # 翻页逻辑
            if page_no < 30:
                yield scrapy.FormRequest(
                    url='https://www.lagou.com/jobs/positionAjax.json',
                    formdata={
                        'first': 'false',
                        'pn': str(page_no + 1),
                        'kd': 'python',
                        'sid': self.session_id
                    },
                    headers=self.custom_headers,
                    callback=self.parse_ajax_response
                )
        except Exception as e:
            self.logger.error(f"解析出错: {e}")

    def parse_detail(self, response):
        item = response.meta['item']

        # 基本信息提取
        info_block = response.xpath('//div[@class="position-content-l"]')
        item['job_title'] = info_block.xpath('.//h1/text()').get()
        request_spans = info_block.xpath('.//dd[@class="job_request"]//span/text()').getall()
        if len(request_spans) >= 4:
            item['salary'] = request_spans[0].strip()
            item['city'] = request_spans[1].strip().strip('/')
            item['experience'] = request_spans[2].strip().strip('/')
            item['degree'] = request_spans[3].strip().strip('/')

        item['district'] = ''.join(response.xpath('//div[@class="work_addr"]//text()').getall()).replace(' ', '').replace('\n','').strip()

        item['tags'] = response.xpath('//ul[@class="position-label clearfix"]/li/text()').getall()
        raw_time = response.xpath('//p[@class="publish_time"]/text()').get()
        item['publish_time'] = raw_time.split('&')[0] if raw_time else None

        # 公司信息提取
        company_block = response.xpath('//dl[@class="job_company"]')
        item['company_name'] = company_block.xpath('.//img/@alt').get()
        features = company_block.xpath('.//li[@class="c_feature"]/h4[@class="c_feature_name"]/text()').getall()
        if len(features) >= 4:
            item['company_nature'] = features[0]
            item['financing_stage'] = features[1]
            item['company_scale'] = features[3]

        yield item

中间件配置：随机User-Agent（middlewares.py）

from fake_useragent import UserAgent
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotatingUA(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        super().__init__()
        self.ua = UserAgent()

    def process_request(self, request, spider):
        request.headers.setdefault('User-Agent', self.ua.random)

数据持久化管道（pipelines.py）

import json

class JsonLinesPipeline:
    def open_spider(self, spider):
        self.file = open('lagou_jobs.jsonl', 'w', encoding='utf-8')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(line)
        return item

项目设置（settings.py）

# 反爬设置
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 0.3
RANDOMIZE_DOWNLOAD_DELAY = True

# 中间件启用
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.RotatingUA': 100,
}

# 数据管道
ITEM_PIPELINES = {
    'myproject.pipelines.JsonLinesPipeline': 300,
}

# 日志等级
LOG_LEVEL = 'INFO'

运行与结果

执行命令启动爬虫：

scrapy crawl lagou_job

程序将自动完成登录会话、分页抓取、详情解析全流程，并将结果逐行写入lagou_jobs.jsonl文件。尽管存在反爬机制限制（通常5~6页后触发封禁），但结合IP代理池可进一步提升稳定性。

标签: Scrapy Python爬虫 Ajax数据抓取

返回列表

上一篇：使用Selenium进行网页自动化操作

下一篇：LightGBM安装配置与性能优化完整指南

Linux crontab 详解

1) crontab 是什么cron 是 Linux 的定时任务守护进程；crontab 是用来编辑/查看“按时间周期执行命令”的表（cron table）。常见两类：用户 crontab：每个用户一份（crontab -e 编辑）系统级 crontab / cron.d：可指定执行用户（/etc/crontab、/etc/cron.d/*）2) crontab 时间...

富文本里可以允许的 HTML 属性

一、所有标签默认允许的安全属性（极少）class （可选）id （通常建议禁用）title️ 注意：id 容易被滥用做锚点注入，很多系统直接禁用class 允许的话最好只允许固定前缀（如 editor-*）二、a 标签允许属性<a href="" t...

方法一：通过官网安装包（最简单，适合初学者）如果你只是想快速安装并开始使用，这是最直接的方法。访问 Node.js 官网。页面会显示两个版本：LTS (Recommended For Most Users)：长期支持版，最稳定。建议选这个。Current：最新特性版，包含最新功能但可能不够稳定。下载 .pkg 安装包并运行。按照安装向导点击“下一步”即可完成。方法二：使用 Homebrew 安装（...

Dom\HTML_NO_DEFAULT_NS 的副作用：自动加闭合标签

在使用Dom\HTMLDocument时，Dom\HTML_NO_DEFAULT_NS 将禁止在解析过程中设置元素的命名空间, 此设置是为了与DOMDocument向后兼容而存在的。当使用它时，已知的一个副作用就是：自动加闭合标签例如 </img> 为什么会这样？当你使用：Dom\HTML_NO_DEFAULT_NS文档会变成无命名空间模式，此时内部更接近 XML...

Laravel 事件和监听器创建

在 Laravel 中，使用 Artisan 命令创建 Events（事件）和 Listeners（监听器）是非常高效的。你可以通过以下几种方式来实现：1. 手动创建单个 Event如果你只想创建一个事件类，可以使用 make:event 命令：Bashphp artisan make:event UserRegistered执行后，文件将生成在 app/Even...

自定义域名解析神器 dnsmasq

什么是 dnsmasq？dnsmasq 是一个轻量级、功能强大的网络服务工具，专为小型和中等规模网络设计。它是一个综合的网络基础设施解决方案[1]。dnsmasq 能做什么？功能说明应用场景DNS 转发与缓存将 DNS 查询转发到上游服务器（ISP、Google DNS 等），并在本地缓存结果加快 DNS 查询速度，减少外部 DNS 流量本地 DNS解析本地网络设备的主机名，无需编辑&n...

老程序员博客

使用Scrapy框架抓取拉勾网Python职位信息

项目目标与数据源分析

页面结构与请求流程解析

详情页数据提取策略

Scrapy爬虫实现代码

定义数据模型（items.py）

主爬虫逻辑（spiders/lagou_spider.py）

中间件配置：随机User-Agent（middlewares.py）

数据持久化管道（pipelines.py）

项目设置（settings.py）

运行与结果

相关文章

Linux crontab 详解

富文本里可以允许的 HTML 属性

Mac 安装 Node.js 指南

Dom\HTML_NO_DEFAULT_NS 的副作用：自动加闭合标签

Laravel 事件和监听器创建

自定义域名解析神器 dnsmasq

发表评论

Copyright © agingcoder.cn

Powered By Z-BlogPHP. Theme by TOYEAN.

老程序员博客

使用Scrapy框架抓取拉勾网Python职位信息

项目目标与数据源分析

页面结构与请求流程解析

详情页数据提取策略

Scrapy爬虫实现代码

定义数据模型（items.py）

主爬虫逻辑（spiders/lagou_spider.py）

中间件配置：随机User-Agent（middlewares.py）

数据持久化管道（pipelines.py）

项目设置（settings.py）

运行与结果

相关文章

Linux crontab 详解

富文本里可以允许的 HTML 属性

Mac 安装 Node.js 指南

Dom\HTML_NO_DEFAULT_NS 的副作用：自动加闭合标签

Laravel 事件和监听器创建

自定义域名解析神器 dnsmasq

发表评论取消回复

Copyright © agingcoder.cn

Powered By Z-BlogPHP. Theme by TOYEAN.

发表评论