当前位置：UUpython > Python代码 > 原创代码 > 正文

爬小姐姐的源码

2023-07-24 分类：原创代码阅读(1087) 评论(0)

这段代码实现了一个多线程图片下载工具，用于从网站 https://www.xiezhen.xyz/ 下载漫画图片。代码中使用了requests库来进行网络请求，lxml库来解析HTML，以及concurrent.futures.ThreadPoolExecutor来实现多线程下载。

以下是代码的主要功能和流程：

导入所需的模块：time、requests、etree（来自lxml库）、os、concurrent.futures。
定义函数download_image(url, img_path)：用于下载单张图片。它接受图片的URL和保存路径作为参数，发送HTTP请求并将图片内容保存到指定路径。
定义函数process_page(page)：用于处理每一页漫画图片。它首先根据页面号构建URL，然后发送HTTP请求获取页面内容。接着，从页面中提取每篇文章（漫画章节）的链接，进入每篇文章并提取漫画图片的URL。随后，创建相应的文件夹，使用ThreadPoolExecutor来并行下载漫画图片，将下载的图片保存到相应的文件夹中。
在__name__ == '__main__'部分，使用ThreadPoolExecutor来并行处理不同页面的漫画图片。首先，通过循环迭代的方式，创建多个线程来处理每一页漫画。然后，使用concurrent.futures.as_completed来等待并处理这些线程的结果。

需要注意的是，多线程下载可以显著提高下载效率，但也要注意在使用多线程时避免出现线程安全问题。此外，代码中的延时time.sleep(0.5)可能是为了避免频繁的网络请求，减轻服务器负担。如果你想要运行这段代码，确保你已经安装了requests、lxml等相关的库。

import time
import requests
from lxml import etree
import os
import concurrent.futures
 
def download_image(url, img_path):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    img_name = url.split('/')[-1]
    with open(os.path.join(img_path, img_name), 'wb') as f:
        f.write(response.content)
        print(f'图片：{img_path}' + '/' + f'{img_name}下载完成！')
 
def process_page(page):
    url = f'https://www.xiezhen.xyz/page/{page}'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(url, headers=headers)
    html = etree.HTML(response.content)
    mail_url = html.xpath('//div[@class="excerpts"]/article/a/@href')
    for url in mail_url:
        response = requests.get(url, headers=headers)
        html = etree.HTML(response.content)
        sub_url = html.xpath('//article/p/img')
        img_title = html.xpath('//title/text()')[0].split('-')[0]
        img_path = f'J:/xiezhen/{img_title}'
        if not os.path.exists(img_path):
            os.makedirs(img_path)
        with concurrent.futures.ThreadPoolExecutor() as executor:
            futures = []
            for s_url in sub_url:
                img_url = s_url.attrib['src']
                futures.append(executor.submit(download_image, img_url, img_path))
            for future in concurrent.futures.as_completed(futures):
                pass
        time.sleep(0.5)
 
if __name__ == '__main__':
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = []
        for page in range(1, 573):
            futures.append(executor.submit(process_page, page))
        for future in concurrent.futures.as_completed(futures):
            pass

赞(0) 打赏

未经允许不得转载：UUpython » 爬小姐姐的源码

标签：小姐姐

相关推荐

评论抢沙发

评论前必须登录！

立即登录注册

QQ咨询
QQ咨询
回顶
回顶部