爬取图虫网的美女-UUpython

这段代码是一个简单的Python脚本，用于从图虫网站下载图片。图虫网站（Tuchong）是一个以摄影作品为主的社交平台。代码通过爬取图虫网站上的帖子信息，获取每个帖子中的图片链接，然后使用多线程方式下载这些图片。

以下是代码的主要功能和流程：

导入所需的模块：os、time、requests、re、concurrent.futures.ThreadPoolExecutor。
定义函数normalize_directory_name(name)：用于处理目录名称，将非法字符替换为空格，并限制长度为100个字符。
定义函数download_image(image_info)：用于下载图片。它从传入的image_info中提取用户ID和图片ID，构建图片的URL，然后使用requests.get()发送HTTP请求获取图片内容。最后，返回图片ID和图片数据。
在主循环中，从1到98的范围内，构建URL来获取图虫网站上的帖子信息。通过发送HTTP请求，获取每一页的帖子列表。
使用ThreadPoolExecutor来并行下载每个帖子中的图片。首先，构建一个待执行任务列表futures，每个任务都是一个executor.submit()调用，传入download_image函数和每个帖子中的图片信息。然后，通过迭代futures列表，使用future.result()获取下载的图片数据和图片ID。
对每个帖子，获取帖子中的图片列表，如果图片列表长度大于1，遍历图片列表，找到与下载的图片ID匹配的图片信息。获取帖子的标题作为文件夹名，并调用normalize_directory_name()来处理文件夹名。然后，构建图片的保存路径，如果文件夹不存在则创建文件夹，并将下载的图片内容保存到相应路径。

代码的目标是下载图虫网站上的图片，每个帖子可能包含多张图片，代码在下载时将图片保存在以帖子标题为名的文件夹中。注意，多线程下载可以提高效率，但也需要考虑线程安全问题和服务器访问频率限制。在使用此代码时，请务必遵循网站的使用规定和法律法规。

import os
import time
import requests
import re
from concurrent.futures import ThreadPoolExecutor
 
# 处理目录非法字符
def normalize_directory_name(name):
    # 使用正则表达式替换非法字符为空格
    cleaned_name = re.sub(r'[\\/:*?"<>|]', ' ', name)
    # 去除多余的空格，并限制长度为100个字符
    normalized_name = re.sub(r'\s+', ' ', cleaned_name).strip()[:100]
    return normalized_name
 
def download_image(image_info):
    user_id = image_info['user_id']
    img_id_str = image_info['img_id_str']
    img_url = f'https://photo.tuchong.com/{user_id}/f/{img_id_str}.webp'
    image_data = requests.get(url=img_url, proxies=None).content
    return img_id_str, image_data
 
for k in range(1, 99):
    url = f'https://tuchong.com/rest/tags/%E7%BE%8E%E5%A5%B3/posts?page={k}&count=20&order=weekly&before_timestamp='
 
    response = requests.get(url=url) 
    json_data = response.json()
 
    post_list = json_data['postList']
 
    with ThreadPoolExecutor() as executor:
        futures = [executor.submit(download_image, image_info) for post in post_list for image_info in post.get('images', [])]
 
        for future in futures:
            img_id_str, image_data = future.result()
 
            # 获取帖子标题
            for post in post_list:
                images = post.get('images', [])
                if len(images) <= 1:
                    continue
                if any(image_info['img_id_str'] == img_id_str for image_info in images):
                    title = post['title']
                    if not title:  # 如果标题为空，则用用户ID替代
                        title = post['author_id']
                    title = normalize_directory_name(title)
                    image_dir = f'G:/tuchongspider/{title}'
                    if not os.path.exists(image_dir):
                        os.makedirs(image_dir)
 
                    image_path = f'{image_dir}/{title}-{img_id_str}.webp'
                    with open(image_path, 'wb') as f:
                        f.write(image_data)
 
                    print(f'Downloaded {img_id_str}.webp')
                    break

爬取图虫网的美女

相关推荐

评论抢沙发

评论前必须登录！

热门文章

热门标签

最新评论

QQ咨询

回顶部