node.js突破nginx防盗链机制,下载图片案例分析 原创
问题
今天项目需求要求采集几个网站的信息,包括一些区块链统计图表之类的信息。
笔者使用的是node.js+axios库发送get请求来获取在图片,下载到本地。测试代码如下:
import fs from 'fs'; import path from 'path'; import http from 'http'; import https from 'https'; const __dirname = path.resolve(); let filePath = path.join(__dirname,'/imgtmp/'); async function downloadfile(url,filename,callback){ try { let ext = path.extname(url); console.log('下载的文件名:',filename) let mod = null;//http、https 别名 if(url.indexOf('https://')!==-1){ mod = https; }else{ mod = http; } const req = mod.get(url, { headers:{ "Content-Type": "application/x-www-form-urlencoded" } },(res)=>{ let writePath = ''; writePath = filePath + '/' + filename; const file = fs.createWriteStream(writePath) res.pipe (file) file.on ("error", (error) => { console.log (`There was an error writing the file. Details: `,error) return false; }) file.on ("close", () => { callback (filename) }) file.on ('finish', () => { file.close () console.log ("Completely downloaded.") }) }) req.on ("error", (error) => { console.log (`Error downloading file. Details: $ {error}`) }) } catch (error) { console.log('图片下载失败!',error); } } let url = 'https://xx.xxxx.com/d/file/zxgg/a2cffb8166f07c0232eca49f8c9cc242.jpg';//图片url let filename = path.basename(url); await downloadfile(url,filename,()=>{ console.log(filename,"文件已下载成功"); })
运行代码,图示文件下载成功!
然而当笔者打开图片一看,就傻眼了~图片显示损坏,再看大小,只有304字节~
目测应该是图片保存了一些错误信息,于是用editplus以文本形式打开该图片,果然看到了错误信息~
解决方法
百度了一下,确定是图片nginx服务器Referer防盗链设置,于是继续百度,找到了问题的关键~
谷歌浏览器打开网址,在控制台上看到了这段Referer信息:
对方的网站在Referer设置的就是他的网址,于是改进代码,在headers中加入Referer参数"referer":'https://www.xxxx.com/'
:
import fs from 'fs'; import path from 'path'; import http from 'http'; import https from 'https'; const __dirname = path.resolve(); let filePath = path.join(__dirname,'/imgtmp/'); async function downloadfile(url,filename,callback){ try { let ext = path.extname(url); console.log('下载的文件名:',filename) let mod = null;//http、https 别名 if(url.indexOf('https://')!==-1){ mod = https; }else{ mod = http; } const req = mod.get(url, { headers:{ "Content-Type": "application/x-www-form-urlencoded", "referer":'https://www.xxxx.com/' } },(res)=>{ let writePath = ''; writePath = filePath + '/' + filename; const file = fs.createWriteStream(writePath) res.pipe (file) file.on ("error", (error) => { console.log (`There was an error writing the file. Details: `,error) return false; }) file.on ("close", () => { callback (filename) }) file.on ('finish', () => { file.close () console.log ("Completely downloaded.") }) }) req.on ("error", (error) => { console.log (`Error downloading file. Details: $ {error}`) }) } catch (error) { console.log('图片下载失败!',error); } } let url = 'https://xx.xxxx.com/d/file/zxgg/a2cffb8166f07c0232eca49f8c9cc242.jpg';//图片url let filename = path.basename(url); await downloadfile(url,filename,()=>{ console.log(filename,"文件已下载成功"); })
再次运行代码,图片文件下载成功,打开显示一切正常!
后记
笔者又测试了另一种实现方法,即使用playwright调用浏览器打开页面,再使用await page.locator('selector路径').screenshot({ path: 'image图片保存路径'});
将图片网页截图保存下载。
对比了一番,发现使用playwright截图的方法需要在遍历图片元素的时候根据当前元素逆向获取parentNode节点以及遍历childNodes节点,算法相对比较复杂!而且screenshot函数截图的效果也会比原图略显模糊,因此推荐使用axios传递Referer参数的方法获取原图。
PS:方法二的调试过程中写了一段逆向遍历selector的函数,提供给大家参考,如有不足之处,欢迎指正~
/** * 获取selector */ function getSelectorPath(element) { if (!!element.id !== false) { return '#' + element.id; } if (element === document.body && !!element) { return element.tagName.toLowerCase(); } let ix = 0; const siblings = element.parentNode?.childNodes; for (let i = 0; i < siblings?.length; i++) { const sibling = siblings[i]; if (sibling.innerHTML === element.innerHTML && !!element.parentNode) { return `${getSelectorPath(element.parentNode)} > ${element.tagName.toLowerCase()}:nth-child(${ix + 1})`; } if (sibling.nodeType === 1) { ix++; } } }
赞 (0)