爬取绿盟漏洞通报
阿里云漏洞库 url有加密,目前还没研究出怎么绕过。cnnvd 虽然公开xml文件,但是xml里的信息不全,没有漏洞修复建议。对爬虫了解不多,只会request。既想好爬又要信息全面,看了好几个漏洞库后发现绿盟 完美符合这两大要求。
Step.1 获取页报告列表 (test.py )
列表页url结构:http://www.nsfocus.net/index.php?act=sec_bug&type_id=&os=&keyword=&page=
+ 页数
修改page参数即可爬取对应页的通报列表。
先尝试打印一下
中文乱码,设置编码utf-8,再用bs解析一下
成功获取页面内容,现在需要筛选 日期(date)
链接(link)
标题(title)
列表在 vul_list
下的 <li>
内
1 list = soup.find('ul' , class_='vul_list' ).find_all('li' )
提取三个信息
1 2 3 4 5 for vul in list : date = vul.find('span' ).text title = vul.find('a' ).text link = vul.find('a' )['href' ] print (date, title, link)
为了让结构更清晰,在新的文件 detail.py
编写
报告页面url结构:http://www.nsfocus.net
+ link
先随意打印一页,能够正常抓取
期望获得的内容为:
title和affected_systems很好获得
1 2 title = soup.find('div' , align='center' ).text affected_systems = soup.find('blockquote' ).text.strip()
但是其他内容没有很明显的标志,基本都是 br
,通过 split()
分割会更简单
1 2 3 text = soup.find('div' , class_='vulbar' ).text description = text.split('描述:' )[1 ].split('<**>' )[0 ] suggestion = text.split('<**>' )[1 ].split('浏览次数' )[0 ]
将整个 greb()
函数修改为一个类 Detail
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import requestsfrom bs4 import BeautifulSoupheaders = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0' } base_url = r'http://www.nsfocus.net' class Detail : def __init__ (self, url ): self .url = base_url + url response = requests.get(self .url, headers=headers) response.encoding = 'utf-8' self .soup = BeautifulSoup(response.text, "html.parser" ) self .vulbar = self .soup.find('div' , class_='vulbar' ) self .text = self .vulbar.text def get_title (self ): return self .soup.find('div' , align='center' ).text def get_affected_systems (self ): return self .soup.find('blockquote' ).text.strip() def get_description (self ): return self .text.split('描述:' )[1 ].split('建议:' )[0 ].replace('<*' , '' ).replace('*>' , '' ) def get_suggestion (self ): return self .text.split('*>建议:' )[1 ].split('浏览次数' )[0 ]
这样只需要调用访问函数即可
Step3. 写入文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 import osimport requestsfrom bs4 import BeautifulSoupfrom detail import Detailfrom time import sleepfrom docx import Documentfrom docx.oxml.ns import qnfrom docx.oxml import OxmlElementpages = 5 output_dir = 'output' os.makedirs(output_dir, exist_ok=True ) headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0' } def add_paragraph_with_font (document, text ): paragraph = document.add_paragraph(text) run = paragraph.runs[0 ] run.font.name = '宋体' run._element.rPr.rFonts.set (qn('w:eastAsia' ), '宋体' ) run._element.rPr.rFonts.set (qn('w:ascii' ), '宋体' ) run._element.rPr.rFonts.set (qn('w:hAnsi' ), '宋体' ) run._element.rPr.rFonts.set (qn('w:cs' ), '宋体' ) for page in range (1 , pages+1 ): url = r'http://www.nsfocus.net/index.php?act=sec_bug&type_id=&os=&keyword=&page=' +str (page) response = requests.get(url, headers=headers) response.encoding = 'utf-8' soup = BeautifulSoup(response.text, "html.parser" ) list = soup.find('ul' , class_='vul_list' ).find_all('li' ) for vul in list : try : date = vul.find('span' ).text title = vul.find('a' ).text link = vul.find('a' )['href' ] sleep(3 ) detail = Detail(link) document = Document() core_properties = document.core_properties core_properties.author = '' heading = document.add_heading(title, level=0 ) heading_run = heading.runs[0 ] heading_run.font.name = '宋体' r = heading_run._element rPr = r.get_or_add_rPr() eastAsia = OxmlElement('w:eastAsia' ) eastAsia.set (qn('w:val' ), '宋体' ) rPr.append(eastAsia) add_paragraph_with_font(document, '受影响版本:' ) add_paragraph_with_font(document, detail.get_affected_systems()) document.add_paragraph('' ) add_paragraph_with_font(document, '描述:' ) add_paragraph_with_font(document, detail.get_description()) add_paragraph_with_font(document, '建议:\n' ) add_paragraph_with_font(document, detail.get_suggestion()) document.save(os.path.join(output_dir, f"{date} -{title} .docx" )) except Exception as e: print (f"发生错误: {title} {link} , Error: {e} " ) continue