机器学习与人工智能

Python为网络爬虫开发提供了全面而强大的工具生态系统。从简单的数据收集任务到复杂的分布式爬虫系统，Python都能胜任。初学者建议从Requests和BeautifulSoup开始，掌握基础后再逐步学习Scrapy等高级框架和异步编程技术。最重要的是，始终牢记爬虫开发的伦理和法律边界，做负责任的网络公民。只有在合法合规的前提下，爬虫技术才能发挥其真正的价值。

nfsto00908

303人浏览 · 2025-09-11 13:28:02

nfsto00908 · 2025-09-11 13:28:02 发布

网络爬虫是自动从互联网上采集数据的程序，Python凭借其丰富的库生态系统和简洁语法，成为了爬虫开发的首选语言。本文将全面介绍如何使用Python构建高效、合规的网络爬虫。

一、爬虫基础与工作原理

网络爬虫本质上是一种自动化程序，它模拟人类浏览网页的行为，但以更高效率和更系统化的方式收集网络信息。其基本工作流程包括：

发送HTTP请求：向目标服务器发起GET或POST请求
获取响应内容：接收服务器返回的HTML、JSON或XML数据
解析内容：从返回的数据中提取所需信息
存储数据：将提取的信息保存到文件或数据库
跟进链接（可选）：发现并跟踪新链接继续爬取

二、Python爬虫技术栈

1. 请求库选择

Requests - 简单易用的HTTP库

python

import requests

response = requests.get('https://example.com', timeout=10)
print(response.status_code)  # 200
print(response.text)  # HTML内容

urllib3 - 功能强大的HTTP客户端

python

import urllib3

http = urllib3.PoolManager()
response = http.request('GET', 'https://example.com')
print(response.data.decode('utf-8'))

2. 解析库对比

BeautifulSoup - 初学者友好，解析简单

python

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
titles = soup.find_all('h1', class_='title')

lxml - 性能优异，支持XPath

python

from lxml import html

tree = html.fromstring(html_content)
titles = tree.xpath('//h1[@class="title"]/text()')

3. 完整爬虫框架

Scrapy - 专业级爬虫框架

bash

pip install scrapy
scrapy startproject myproject

三、实战爬虫开发示例

示例1：基础静态网页爬虫

python

import requests
from bs4 import BeautifulSoup
import csv
import time

def basic_crawler(url, output_file):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    try:
        # 发送请求
        response = requests.get(url, headers=headers, timeout=15)
        response.encoding = 'utf-8'
        response.raise_for_status()
        
        # 解析内容
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # 提取数据 - 假设我们要获取所有文章标题和链接
        articles = []
        for item in soup.select('.article-list .item'):
            title = item.select_one('.title').get_text().strip()
            link = item.select_one('a')['href']
            articles.append({'title': title, 'link': link})
        
        # 保存数据
        with open(output_file, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['title', 'link'])
            writer.writeheader()
            writer.writerows(articles)
            
        print(f"成功爬取{len(articles)}条数据")
        
        # 遵守爬虫礼仪，添加延迟
        time.sleep(2)
        
    except Exception as e:
        print(f"爬取过程中出错: {e}")

# 使用爬虫
basic_crawler('https://news.example.com', 'news_data.csv')

示例2：处理动态内容（使用Selenium）

python

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def dynamic_content_crawler(url):
    # 设置无头浏览器选项
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    
    driver = webdriver.Chrome(options=options)
    
    try:
        driver.get(url)
        
        # 等待特定元素加载完成
        wait = WebDriverWait(driver, 10)
        element = wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
        )
        
        # 获取渲染后的页面源码
        page_source = driver.page_source
        
        # 使用BeautifulSoup解析
        soup = BeautifulSoup(page_source, 'html.parser')
        # ... 数据提取逻辑
        
    finally:
        driver.quit()

# 使用示例
dynamic_content_crawler('https://example.com/dynamic-page')

四、应对反爬虫策略

现代网站常采用各种反爬虫技术，以下是常见应对方法：

User-Agent轮换

python

import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15',
    # 更多User-Agent
]

headers = {'User-Agent': random.choice(user_agents)}

IP代理池

python

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

requests.get('http://example.org', proxies=proxies)

请求频率控制

python

import time
import random

# 随机延迟避免规律请求
time.sleep(random.uniform(1, 3))

五、数据存储方案

1. 文件存储

python

# CSV文件
import csv

with open('data.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['标题', '链接', '日期'])
    writer.writerows(data)

# JSON文件
import json

with open('data.json', 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=2)

2. 数据库存储

python

# SQLite数据库
import sqlite3

conn = sqlite3.connect('data.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS articles
             (id INTEGER PRIMARY KEY, title TEXT, content TEXT)''')
c.execute("INSERT INTO articles VALUES (?, ?)", (title, content))
conn.commit()
conn.close()

六、合法与伦理考量

开发爬虫时必须遵守以下原则：

尊重robots.txt：遵守网站的爬虫规则
控制访问频率：避免对目标网站造成负担
识别合规内容：只爬取允许公开访问的数据
版权意识：尊重知识产权，不滥用爬取内容
用户隐私：不收集、存储或传播个人信息

python

# 检查robots.txt
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('MyBot', 'https://example.com/target-page')

七、调试与错误处理

健壮的爬虫需要完善的错误处理机制：

python

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    
except requests.exceptions.Timeout:
    print("请求超时")
except requests.exceptions.HTTPError as err:
    print(f"HTTP错误: {err}")
except requests.exceptions.RequestException as err:
    print(f"请求异常: {err}")
except Exception as err:
    print(f"其他错误: {err}")