使用Python进行Web爬虫开发：从零开始构建一个简单的网页抓取器

昨天 6阅读

随着互联网的快速发展，数据已经成为新时代的“石油”。而网络爬虫（Web Scraper）作为获取互联网数据的重要工具，在数据分析、市场研究、人工智能训练等领域发挥着巨大作用。本文将详细介绍如何使用Python构建一个基础但功能完整的网页爬虫，并通过实际代码示例演示其工作原理。

什么是网络爬虫？

网络爬虫是一种自动抓取网页内容的程序，它模拟浏览器访问网站并提取所需的数据。爬虫技术广泛应用于搜索引擎索引、价格监控、舆情分析等多个领域。

在Python中，我们可以使用如requests、BeautifulSoup、Scrapy等库来快速实现爬虫程序。

环境准备

首先确保你的系统中安装了以下Python包：

pip install requests beautifulsoup4 lxml

requests：用于发送HTTP请求。beautifulsoup4：用于解析HTML文档。lxml：提供更快的HTML解析速度。

项目目标

我们将编写一个爬虫程序，从一个虚拟商品页面抓取商品名称和价格，并将其保存为CSV文件。

目标网址：https://example.com/products （这是一个示例地址）

第一步：发送HTTP请求

我们先使用requests模块向目标网页发起GET请求。

import requestsdef fetch_page(url):    headers = {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36'    }    try:        response = requests.get(url, headers=headers)        if response.status_code == 200:            return response.text        else:            print(f"Failed to fetch page: {url}, status code: {response.status_code}")            return None    except Exception as e:        print(f"Error fetching page: {e}")        return None# 示例调用html_content = fetch_page("https://example.com/products")

注意：某些网站会检测用户代理，设置合理的User-Agent可以避免被服务器拒绝。

第二步：解析HTML内容

接下来我们使用BeautifulSoup来解析HTML内容，提取商品信息。

假设网页结构如下：

<div class="product">    <h3 class="name">商品A</h3>    <span class="price">$99.99</span></div>

对应的解析代码如下：

from bs4 import BeautifulSoupdef parse_products(html):    soup = BeautifulSoup(html, 'lxml')    products = []    for item in soup.find_all('div', class_='product'):        name = item.find('h3', class_='name').get_text(strip=True)        price = item.find('span', class_='price').get_text(strip=True)        products.append({            'name': name,            'price': price        })    return products# 示例调用if html_content:    product_list = parse_products(html_content)    print(product_list)

第三步：保存数据到CSV文件

我们将抓取到的数据保存为CSV格式以便后续处理。

import csvdef save_to_csv(data, filename='products.csv'):    if not data:        print("No data to save.")        return    with open(filename, mode='w', newline='', encoding='utf-8') as f:        writer = csv.DictWriter(f, fieldnames=data[0].keys())        writer.writeheader()        writer.writerows(data)    print(f"Data saved to {filename}")# 示例调用save_to_csv(product_list)

整合所有步骤

现在我们将前面的函数整合成一个完整的脚本：

import requestsfrom bs4 import BeautifulSoupimport csvdef fetch_page(url):    headers = {        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36'    }    try:        response = requests.get(url, headers=headers)        if response.status_code == 200:            return response.text        else:            print(f"Failed to fetch page: {url}, status code: {response.status_code}")            return None    except Exception as e:        print(f"Error fetching page: {e}")        return Nonedef parse_products(html):    soup = BeautifulSoup(html, 'lxml')    products = []    for item in soup.find_all('div', class_='product'):        name = item.find('h3', class_='name').get_text(strip=True)        price = item.find('span', class_='price').get_text(strip=True)        products.append({            'name': name,            'price': price        })    return productsdef save_to_csv(data, filename='products.csv'):    if not data:        print("No data to save.")        return    with open(filename, mode='w', newline='', encoding='utf-8') as f:        writer = csv.DictWriter(f, fieldnames=data[0].keys())        writer.writeheader()        writer.writerows(data)    print(f"Data saved to {filename}")def main():    url = "https://example.com/products"    html = fetch_page(url)    if html:        products = parse_products(html)        save_to_csv(products)if __name__ == "__main__":    main()

进阶技巧与注意事项

1. 分页处理

如果目标网站有分页，我们需要遍历多个页面。

for page in range(1, 6):  # 抓取前5页    url = f"https://example.com/products?page={page}"    html = fetch_page(url)    if html:        products.extend(parse_products(html))

2. 设置请求间隔

为了避免对服务器造成过大压力，建议在每次请求之间添加延迟。

import timetime.sleep(2)  # 每次请求后等待2秒

3. 异常处理与重试机制

增加更完善的异常处理逻辑，提高爬虫健壮性。

import requestsfrom requests.adapters import HTTPAdapterfrom urllib3.util.retry import Retrydef fetch_page_with_retry(url):    session = requests.Session()    retry = Retry(total=5, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])    adapter = HTTPAdapter(max_retries=retry)    session.mount('http://', adapter)    session.mount('https://', adapter)    headers = {'User-Agent': 'Mozilla/5.0'}    try:        response = session.get(url, headers=headers)        return response.text    except Exception as e:        print(f"Request failed after retries: {e}")        return None

4. 使用代理IP池

当需要大规模采集时，建议使用代理IP防止被封禁。

proxies = {    'http': 'http://user:pass@ip:port',    'https': 'http://user:pass@ip:port'}response = requests.get(url, proxies=proxies)

5. 遵守Robots协议

每个网站都有自己的robots.txt文件，规定哪些路径允许爬虫访问。例如：https://example.com/robots.txt。开发者应遵守该协议，避免非法抓取。

总结

本文介绍了如何使用Python构建一个基础的网页爬虫程序，包括发送HTTP请求、解析HTML内容、保存数据等核心流程，并展示了实际代码实现。同时讨论了一些进阶技巧，如分页处理、异常处理、使用代理等。

网络爬虫是获取网络数据的强大工具，但在使用过程中也应注意合法性和道德规范，尊重目标网站的规则，合理控制请求频率，避免给服务器带来负担。

附录：完整源码下载链接

你可以在GitHub上找到本项目的完整源码：🔗 https://github.com/example/web-scraper-demo

如需进一步学习，推荐阅读官方文档：

Requests: https://docs.python-requests.org/BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

如果你希望构建更复杂的爬虫系统，还可以尝试使用专业的爬虫框架 Scrapy。

字数统计：约1800字

免责声明：本文来自网站作者，不代表CIUIC的观点和立场，本站所发布的一切资源仅限用于学习和研究目的；不得将上述内容用于商业或者非法用途，否则，一切后果请用户自负。本站信息来自网络，版权争议与本站无关。您必须在下载后的24个小时之内，从您的电脑中彻底删除上述内容。如果您喜欢该程序，请支持正版软件，购买注册，得到更好的正版服务。客服邮箱：ciuic@ciuic.com

使用Python进行Web爬虫开发：从零开始构建一个简单的网页抓取器

什么是网络爬虫？

环境准备

项目目标

第一步：发送HTTP请求

第二步：解析HTML内容

第三步：保存数据到CSV文件

整合所有步骤

进阶技巧与注意事项

1. 分页处理

2. 设置请求间隔

3. 异常处理与重试机制

4. 使用代理IP池

5. 遵守Robots协议

总结

相关阅读

使用Python构建一个简单的Web爬虫

基于Python的图像分类技术详解与实现

使用Python进行数据分析：从入门到实战

使用Python进行数据分析：从数据加载到可视化

目录[+]

微信号复制成功