亚马逊商品评论爬取与情感分析：Python+BeautifulSoup实战（含防封策略）

本文介绍了亚马逊商品评论分析的完整技术方案，包含四个核心模块：1）使用Python的requests和BeautifulSoup实现评论爬取，重点说明反爬机制处理和精准元素定位；2）基于TextBlob的情感分析模块，解释极性指标的应用；3）Matplotlib可视化模块展示评分分布；4）强调合规要求，包括请求频率控制和API使用规范。整套方案从数据采集到分析呈现，兼顾技术细节与法律合规，适用于电

万邦科技-Ace

568人浏览 · 2025-05-30 16:30:56

万邦科技-Ace · 2025-05-30 16:30:56 发布

一、数据爬取模块（Python示例）

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept-Language': 'en-US'
}

def scrape_amazon_reviews(product_id, max_pages=5):
    base_url = f"https://www.amazon.com/product-reviews/{product_id}"
    reviews = []
    
    for page in range(1, max_pages + 1):
        url = f"{base_url}/?pageNumber={page}"
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        
        for review in soup.find_all('div', {'data-hook': 'review'}):
            review_data = {
                'rating': float(review.find('i', {'data-hook': 'review-star-rating'}).text.split()[0]),
                'title': review.find('a', {'data-hook': 'review-title'}).text.strip(),
                'body': review.find('span', {'data-hook': 'review-body'}).text.strip(),
                'date': review.find('span', {'data-hook': 'review-date'}).text
            }
            reviews.append(review_data)
        time.sleep(2)  # 降低请求频率
    
    return pd.DataFrame(reviews)

关键点说明：

需替换product_id为目标商品ASIN码
通过time.sleep()规避反爬机制
使用data-hook属性精准定位评论元素

二、情感分析模块（NLP示例）

from textblob import TextBlob

def analyze_sentiment(review_text):
    analysis = TextBlob(review_text)
    return {
        'polarity': analysis.sentiment.polarity,  # 情感极性（-1到1）
        'subjectivity': analysis.sentiment.subjectivity  # 主观性（0到1）
    }

输出应用：

极性＞0.3判定为积极评论
极性＜-0.3判定为消极评论

三、数据可视化（Matplotlib示例）

import matplotlib.pyplot as plt

def plot_rating_distribution(df):
    plt.figure(figsize=(8, 4))
    df['rating'].value_counts().sort_index().plot(kind='bar', color='#FF9900')
    plt.title('Amazon Review Rating Distribution')
    plt.xlabel('Star Rating')
    plt.ylabel('Count')
    plt.xticks(rotation=0)
    plt.show()

四、合规性注意事项

遵守亚马逊Robots协议（检查/robots.txt）
单IP请求频率建议≤2次/秒
商业用途需申请官方API（MWS或SP-API）

技术共进，成长同行——讯飞AI开发者社区

更多推荐

论文笔记：AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models（AlphaEdit）

论文发表于人工智能顶会ICLR（基于定位和修改的模型编辑方法（针对和等）会破坏LLM中最初保存的知识，特别是在顺序编辑场景。为此，本文提出AlphaEdit：1、在将保留知识应用于参数之前，将扰动投影到保留知识的零空间上。2、从理论上证明，这种预测确保了在查询保留的知识时，编辑后的LLM的输出保持不变，从而减轻中断问题。3、对各种LLM（包括LLaMA3、GPT2XL和GPT-J）的广泛实验表明，