爬取绿盟漏洞通报

爬取绿盟漏洞通报

阿里云漏洞库url有加密,目前还没研究出怎么绕过。cnnvd虽然公开xml文件,但是xml里的信息不全,没有漏洞修复建议。对爬虫了解不多,只会request。既想好爬又要信息全面,看了好几个漏洞库后发现绿盟完美符合这两大要求。

Step.1 获取页报告列表 (test.py)

image-20250227124753933

列表页url结构:http://www.nsfocus.net/index.php?act=sec_bug&type_id=&os=&keyword=&page= + 页数

修改page参数即可爬取对应页的通报列表。

先尝试打印一下

image-20250227125237532

中文乱码,设置编码utf-8,再用bs解析一下

image-20250227125932943

成功获取页面内容,现在需要筛选 日期(date) 链接(link) 标题(title)

列表在 vul_list 下的 <li>

image-20250227125821724

1
list = soup.find('ul', class_='vul_list').find_all('li')

image-20250227152835130

提取三个信息

1
2
3
4
5
for vul in list:
date = vul.find('span').text
title = vul.find('a').text
link = vul.find('a')['href']
print(date, title, link)

image-20250227153045845

Step2. 获取报告详细信息 (detail.py)

为了让结构更清晰,在新的文件 detail.py 编写

报告页面url结构:http://www.nsfocus.net + link

image-20250227153525580

先随意打印一页,能够正常抓取

image-20250227153921548

期望获得的内容为:

  • 标题(title)(可选,因为上一步已经获取到了)

  • 受影响的系统(affected_systems)

  • 描述(description)

  • 建议(suggestion)

image-20250227160420685

title和affected_systems很好获得

1
2
title = soup.find('div', align='center').text
affected_systems = soup.find('blockquote').text.strip()

image-20250227160802972

但是其他内容没有很明显的标志,基本都是 br ,通过 split() 分割会更简单

1
2
3
text = soup.find('div', class_='vulbar').text
description = text.split('描述:')[1].split('<**>')[0]
suggestion = text.split('<**>')[1].split('浏览次数')[0]

image-20250227161131105

将整个 greb() 函数修改为一个类 Detail

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import requests
from bs4 import BeautifulSoup


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'}
base_url = r'http://www.nsfocus.net'

class Detail:
def __init__(self, url):
self.url = base_url + url
response = requests.get(self.url, headers=headers)
response.encoding = 'utf-8'
self.soup = BeautifulSoup(response.text, "html.parser")
self.vulbar = self.soup.find('div', class_='vulbar')
self.text = self.vulbar.text
def get_title(self):
return self.soup.find('div', align='center').text
def get_affected_systems(self):
return self.soup.find('blockquote').text.strip()
def get_description(self):
return self.text.split('描述:')[1].split('建议:')[0].replace('<*', '').replace('*>', '')
def get_suggestion(self):
return self.text.split('*>建议:')[1].split('浏览次数')[0]

这样只需要调用访问函数即可

Step3. 写入文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import os
import requests
from bs4 import BeautifulSoup
from detail import Detail
from time import sleep
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement


pages = 5
output_dir = 'output'
os.makedirs(output_dir, exist_ok=True)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:88.0) Gecko/20100101 Firefox/88.0'}

def add_paragraph_with_font(document, text):
paragraph = document.add_paragraph(text)

run = paragraph.runs[0]
run.font.name = '宋体'
run._element.rPr.rFonts.set(qn('w:eastAsia'), '宋体')
run._element.rPr.rFonts.set(qn('w:ascii'), '宋体')
run._element.rPr.rFonts.set(qn('w:hAnsi'), '宋体')
run._element.rPr.rFonts.set(qn('w:cs'), '宋体')

for page in range(1, pages+1):
url = r'http://www.nsfocus.net/index.php?act=sec_bug&type_id=&os=&keyword=&page='+str(page)
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
soup = BeautifulSoup(response.text, "html.parser")
list = soup.find('ul', class_='vul_list').find_all('li')

for vul in list:
try:

date = vul.find('span').text
title = vul.find('a').text
link = vul.find('a')['href']
sleep(3)
detail = Detail(link)
document = Document()

core_properties = document.core_properties
core_properties.author = ''

heading = document.add_heading(title, level=0)
heading_run = heading.runs[0]
heading_run.font.name = '宋体'
r = heading_run._element
rPr = r.get_or_add_rPr()
eastAsia = OxmlElement('w:eastAsia')
eastAsia.set(qn('w:val'), '宋体')
rPr.append(eastAsia)
# sleep(1)
add_paragraph_with_font(document, '受影响版本:')
add_paragraph_with_font(document, detail.get_affected_systems())
document.add_paragraph('')
# sleep(1)
add_paragraph_with_font(document, '描述:')
add_paragraph_with_font(document, detail.get_description())
# sleep(1)
add_paragraph_with_font(document, '建议:\n')
add_paragraph_with_font(document, detail.get_suggestion())
document.save(os.path.join(output_dir, f"{date}-{title}.docx"))
except Exception as e:
print(f"发生错误: {title} {link}, Error: {e}")
continue