【零基础学爬虫】BeautifulSoup-解析库一

<aside> 😀 今天讲一个常见爬虫解析库BeautifulSoup，他可以很好的兼容不规则的网页进行数据提取，是所有解析库兼容性最好的。

</aside>

今天介绍一种数据解析方法：beautifulsoup4。bs4是Python特有的解析工具，而前面提到的正则解析方法则基于正则表达式，不受编程语言限制。

使用bs4进行数据解析的步骤如下：

按照前面讲过的数据解析原理，就是定位标签和获取便签或者是标签属性中存储的数据值，按照这个思路，bs4的数据解析的流程是这样的：

实例化一个BeautifulSoup对象，并且将页面的源码的数据加载到该对象中。
通过调用BeautifulSoup对象中相关属性和方法进行标签定位和数据提取

bs4环境安装

bs4的安装可以使用pip直接安装，安装后还需要安装一个lxml解析器

pip install bs4 pip install lxml

在安装过程中可以用-i指定国内的源。

BeautifulSoup支持Python标准库中的HTML解析器，还支持了几种第三方解析器。下面的表格讲的是各种第三方解析气的特点

解析器	使用方法	优势
Python标准库	`BeautifulSoup(markup, "html.parser")`	• Python的内置标准库
• 执行速度适中
• 文档容错能力强	• Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	• 速度快
• 文档容错能力强	• 需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml-xml"])`
`BeautifulSoup(markup, "xml")`	• 速度快
• 唯一支持XML的解析器	• 需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	• 最好的容错性
• 以浏览器的方式解析文档
• 生成HTML5格式的文档	• 速度慢
• 不依赖外部扩展

BeautifulSoup的实例化

BeautifulSoup的实例化有两种情况，一个是加载本地的html文档数据，还有一种是加载爬取网上数据。

加载本地html文件

先写一个简单的html文件供后面的案例使用（文件名test.html）

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="<http://example.com/elsie>" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="<http://example.com/lacie>" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="<http://example.com/tillie>" id="link2">
    Tillie
   </a>
   ; and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

实例化本地文件袋方法有两种

方式1