BeautifulSoup爬取数据htmlxml

IT业界
2025-08-12 20:24:02

简介

Beautiful Soup是一个Python库，用于从HTML或XML文件中提取数据。它提供了一种简单而灵活的方式来解析和遍历HTML或XML文档，并提供了一些有用的方法来提取所需的数据。

安装 pip install beautifulsoup4 使用导入库：在Python脚本的开头，导入Beautiful Soup库。 from bs4 import BeautifulSoup 读取HTML或XML文档：使用适当的方法读取HTML或XML文档，并将其存储在一个变量中。您可以从文件中读取文档，也可以直接将文档内容作为字符串传递给Beautiful Soup。 # 从文件中读取HTML文档 with open('example.html', 'r') as f: html_doc = f.read() 或者直接传递HTML字符串 html_doc = '<html><body><h1>Hello, World!</h1></body></html>' 创建Beautiful Soup对象：使用Beautiful Soup库创建一个BeautifulSoup对象，将文档内容和解析器类型作为参数传递给它。 soup = BeautifulSoup(html_doc, 'html.parser') 解析和提取数据：使用Beautiful Soup提供的方法和属性，解析和提取您需要的数据。您可以使用标签名、类名、属性等方式来定位和选择元素。 # 通过标签名选择元素 title = soup.h1 print(title.text) # 输出元素文本内容 # 通过类名选择元素 paragraphs = soup.find_all('p') for p in paragraphs: print(p.text) # 通过属性选择元素 links = soup.find_all('a', href=<a href="http://example " class="underline" target="_blank">Click this URL</a>) for link in links: print(link['href']) 举例

URL爬数据，弄两万用户左右，然后还需要follower和following的数量 .personalitycafe /members/ .html 保存在csv中

导入所需的库： import requests from bs4 import BeautifulSoup import csv 发送HTTP请求并创建Beautiful Soup对象： url = <a href=" .personalitycafe /members/" class="underline" target="_blank">Click this URL</a> response = requests.get(url) html_doc = response.text soup = BeautifulSoup(html_doc, 'html.parser') 解析用户列表并提取所需信息： user_list = soup.find_all('li', class_='member') data = [] for user in user_list: username = user.find('a', class_='username').text follower_count = user.find('dd', class_='follow_count').text following_count = user.find('dd', class_='following_count').text data.append([username, follower_count, following_count]) 将数据保存到CSV文件： filename = 'user_data.csv' with open(filename, 'w', newline='', encoding='utf-8') as file: writer = csv.writer(file) writer.writerow(['Username', 'Follower Count', 'Following Count']) writer.writerows(data) print(f"数据已保存到 {filename} 文件中。")

这样，爬取到的用户数据将会保存在名为 “user_data.csv” 的CSV文件中，包括用户名、follower数量和following数量。

请注意，根据目标网站的结构和HTML标记，可能需要进一步的调整和修改代码以正确提取所需的数据。要正确提取所需的数据，需要根据目标网站的结构和HTML标记进行进一步的调整和修改代码。

Beautiful Soup

一些常用的Beautiful Soup操作和技巧

使用标签名称提取元素： elements = soup.find_all('tag_name') 使用CSS选择器提取元素： elements = soup.select('css_selector') 提取元素的文本内容： text = element.get_text() 提取元素的属性值： attribute_value = element['attribute_name']

标签：

BeautifulSoup爬取数据htmlxml由讯客互联IT业界栏目发布，感谢您对讯客互联的认可，以及对我们原创作品以及文章的青睐，非常欢迎各位朋友分享到个人网站或者朋友圈，但转载请说明文章出处“BeautifulSoup爬取数据htmlxml”

上一篇
组合数学(上)：数列、排列、组合

下一篇
论文笔记--Baichuan2:OpenLarge-scal