当前位置：UUpython > Python代码 > 转载代码 > 正文

从小说阅读网站上抓取小说的章节内容

2023-08-25 分类：转载代码阅读(1388) 评论(0)

这段代码是一个简单的爬虫脚本，用于从小说阅读网站上抓取小说的章节内容。脚本使用了 requests 库来发送HTTP请求，使用 BeautifulSoup 来解析HTML内容，以及使用正则表达式来匹配章节目录。

以下是代码的主要功能和流程：

导入所需的模块：

requests：用于发送HTTP请求和获取网页内容。
re：用于正则表达式匹配。
BeautifulSoup：用于解析HTML页面内容。

定义 get_soup 函数：

通过给定的URL和参数，发送GET请求并获取网页内容。
使用 BeautifulSoup 解析网页内容，并返回解析后的BeautifulSoup对象。

定义 get_chapter_content 函数：

通过给定的章节URL，获取章节内容。
使用正则表达式匹配章节内容，并去除不需要的部分。
返回章节内容。

定义 main 函数：

定义一个小说名字列表 namelist。
遍历小说名字列表，对每本小说进行爬取。
通过小说名字搜索小说链接，获取小说章节目录。
对每个章节链接，循环尝试获取章节内容，并保存到对应的文本文件。

此脚本通过爬取小说阅读网站，根据小说名字搜索小说链接，获取小说章节目录，然后逐个爬取章节内容并保存到对应的文本文件中。运行脚本前，请确保已安装了 requests、BeautifulSoup 和 re 库。另外，请注意网站的使用条款和规定，避免过度访问和爬取。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# time: 2023/7/25 15:22
# file: xbiquge.py
# author: lyx
 
import requests
import re
from bs4 import BeautifulSoup
 
def get_soup(url, params=None):
    res = requests.get(url=url, params=params)
    res.encoding = 'utf-8'
    return BeautifulSoup(res.text, 'html.parser')
 
def get_chapter_content(url):
    chapter = requests.get(url)
    chapter.encoding = 'utf-8'
    chapter1 = BeautifulSoup(chapter.text, 'html.parser')
    chapter_div = chapter1.find('div', {'id': 'content'})
    if chapter_div:
        paragraphs = chapter_div.find_all('p')
        for p in paragraphs:
            p.decompose()
        div = chapter_div.find_all('div')
        for d in div:
            d.decompose()
    return chapter_div
 
def main():
    namelist = ['阵问长生','神明模拟器','半岛检查官','全民逃荒，我的物品能升级','玄鉴仙族']
    for bookname in namelist:
        print(f"爬取书籍： {bookname}!")
        url = 'https://www.ibiquges.info/modules/article/waps.php'
        data = {'searchkey': bookname}
        soup = get_soup(url, data)
        content = soup.find('a', string=bookname)
 
        if content:
            url1 = 'https://www.ibiquges.info'
            url = content['href']
            content = get_soup(url)
            catalog1 = content.find('div', id='list')
            catalog2 = str(catalog1)
            pattern = r'<dd><a href="(.*?)">(.*?)</a></dd>'
            matches = re.findall(pattern, catalog2)
            result = [(match[0], match[1]) for match in matches]
            with open(bookname + '.txt', 'a', encoding='utf-8') as file:
                for chapter_url, chapter_title in result:
                    for _ in range(20):
                        url2 = url1 + str(chapter_url)
                        chapter_div = get_chapter_content(url2)
                        if chapter_div is not None:
                            middle_text = chapter_div.get_text("\n", strip=True)
                            print('\n\n\n' + chapter_title + '\n\n\n')
                            file.write('\n\n\n' + chapter_title + '\n\n\n')
                            file.write(middle_text)
                            break
                        else:
                            continue
        else:
            print("未找到匹配的URL")
 
if __name__ == "__main__":
    main()

赞(0) 打赏

未经允许不得转载：UUpython » 从小说阅读网站上抓取小说的章节内容

标签：爬取小说

相关推荐

评论抢沙发

评论前必须登录！

立即登录注册

QQ咨询
QQ咨询
回顶
回顶部