Python实现一个论文下载器的过程

站长资源 2024/11/23 佚名

2 0 1

在科研学习的过程中，我们难免需要查询相关的文献资料，而想必很多小伙伴都知道SCI-HUB，此乃一大神器，它可以帮助我们搜索相关论文并下载其原文。可以说，SCI-HUB造福了众多科研人员，用起来也是“美滋滋”。

然而，当师姐告诉我：“xx，可以帮我下载几篇文献嘛"text-align: center">

一、代码分析

代码分析的详细思路跟以往依旧如此雷同，逃不过的还是：抓包分析->模拟请求->代码整合。由于一会儿kimol君还得去搬砖，今天就不详细展开了。

1. 搜索论文

通过论文的URL、PMID、DOI号或者论文标题等搜索到对应的论文，并通过bs4库找出PDF原文的链接地址，代码如下：

def search_article(artName):
 '''
 搜索论文
 ---------------
 输入：论文名
 ---------------
 输出：搜索结果（如果没有返回""，否则返回PDF链接）
 '''
 url = 'https://www.sci-hub.ren/'
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Content-Type':'application/x-www-form-urlencoded',
    'Content-Length':'123',
    'Origin':'https://www.sci-hub.ren',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 data = {'sci-hub-plugin-check':'',
   'request':artName}
 res = requests.post(url, headers=headers, data=data)
 html = res.text
 soup = BeautifulSoup(html, 'html.parser')
 iframe = soup.find(id='pdf')
 if iframe == None: # 未找到相应文章
  return ''
 else:
  downUrl = iframe['src']
  if 'http' not in downUrl:
   downUrl = 'https:'+downUrl
  return downUrl

2. 下载论文

得到了论文的链接地址之后，只需要通过requests发送一个请求，即可将其下载：

def download_article(downUrl):
 '''
 根据论文链接下载文章
 ----------------------
 输入：论文链接
 ----------------------
 输出：PDF文件二进制
 '''
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 res = requests.get(downUrl, headers=headers)
 return res.content

二、完整代码

将上述两个函数整合之后，我的完整代码如下：

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 5 16:32:22 2021
@author: kimol_love
"""
import os
import time
import requests
from bs4 import BeautifulSoup
 
def search_article(artName):
 '''
 搜索论文
 ---------------
 输入：论文名
 ---------------
 输出：搜索结果（如果没有返回""，否则返回PDF链接）
 '''
 url = 'https://www.sci-hub.ren/'
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Content-Type':'application/x-www-form-urlencoded',
    'Content-Length':'123',
    'Origin':'https://www.sci-hub.ren',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 data = {'sci-hub-plugin-check':'',
   'request':artName}
 res = requests.post(url, headers=headers, data=data)
 html = res.text
 soup = BeautifulSoup(html, 'html.parser')
 iframe = soup.find(id='pdf')
 if iframe == None: # 未找到相应文章
  return ''
 else:
  downUrl = iframe['src']
  if 'http' not in downUrl:
   downUrl = 'https:'+downUrl
  return downUrl
  
def download_article(downUrl):
 '''
 根据论文链接下载文章
 ----------------------
 输入：论文链接
 ----------------------
 输出：PDF文件二进制
 '''
 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'Accept-Encoding':'gzip, deflate, br',
    'Connection':'keep-alive',
    'Upgrade-Insecure-Requests':'1'}
 res = requests.get(downUrl, headers=headers)
 return res.content
 
def welcome():
 '''
 欢迎界面
 '''
 os.system('cls')
 title = '''
    _____ _____ _____  _ _ _ _ ____ 
    / ____|/ ____|_ _| | | | | | | | _ \ 
    | (___ | |  | |______| |__| | | | | |_) |
    \___ \| |  | |______| __ | | | | _ < 
    ____) | |____ _| |_  | | | | |__| | |_) |
    |_____/ \_____|_____| |_| |_|\____/|____/
    
   '''
 print(title)
 
if __name__ == '__main__':
 while True:
  welcome()
  request = input('请输入URL、PMID、DOI或者论文标题：')
  print('搜索中...')
  downUrl = search_article(request)
  if downUrl == '':
   print('未找到相关论文，请重新搜索！')
  else:
   print('论文链接：%s'%downUrl)
   print('下载中...')
   pdf = download_article(downUrl)
   with open('%s.pdf'%request, 'wb') as f:
    f.write(pdf)
   print('---下载完成---')
  time.sleep(0.8)

不出所料，代码一跑，我便轻松完成了师姐交给我的任务，不香嘛？

python论文下载器,python,下载器

广告合作：本站广告合作请联系QQ：858582 申请时备注：广告合作（否则不回）
免责声明：本站资源来自互联网收集,仅供用于学习和交流,请遵循相关法律法规,本站一切资源不代表本站立场,如有侵权、后门、不妥请联系本站删除！

评论“Python实现一个论文下载器的过程”

暂无评论...

www.wwsws.com 伏龙阁资源网

39,976影音资源

44,792技术资源

21,817软件资源

651,128站长资源

Python实现一个论文下载器的过程

一、代码分析

1. 搜索论文

2. 下载论文

二、完整代码

Python实现邮件发送的详细设置方法(遇到问题)

利用python为PostgreSQL的表自动添加分区

评论“Python实现一个论文下载器的过程”

RTX 5090要首发性能要翻倍！三星展示GDDR7显存

更新日志

友情链接

Python实现一个论文下载器的过程

一、代码分析

1. 搜索论文

2. 下载论文

二、完整代码

Python实现邮件发送的详细设置方法(遇到问题)

利用python为PostgreSQL的表自动添加分区

评论“Python实现一个论文下载器的过程”

RTX 5090要首发 性能要翻倍！三星展示GDDR7显存

更新日志

友情链接

RTX 5090要首发性能要翻倍！三星展示GDDR7显存