Python爬虫(六)—解析利器 BeautifulSoup

in python爬虫 with 0 comment

前言

以下关于正则表达式 BeautifulSoup 学习,主要记录常用的知识点,深入了解的查看官方文档。

BeautifulSoup : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

BeautifulSoup 介绍

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

解析器使用方法优势劣势
Python标准库BeautifulSoup(markup, "html.parser")Python的内置标准库
执行速度适中
文档容错能力强
Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差
lxml HTML 解析器BeautifulSoup(markup, "lxml")速度快
文档容错能力强
需要安装C语言库
lxml XML 解析器BeautifulSoup(markup, ["lxml", "xml"])
BeautifulSoup(markup, "xml")
速度快
唯一支持XML的解析器
需要安装C语言库
html5libBeautifulSoup(markup, "html5lib")最好的容错性
以浏览器的方式解析文档
生成HTML5格式的文档
速度慢
不依赖外部扩

对象的种类

下面简单介绍一下Beautiful转化成Python的对象种类:Tag , NavigableString , BeautifulSoup , Comment

# 以下代码是四个对象种类的演示

soup = BeautifulSoup('<b class="boldest b">Extremely bold</b>')
tag = soup.b
type(tag)  # <class 'bs4.element.Tag'>
tag.name  # b
tag.attr  # {'class': ['boldest']}
tal[class]  # ['boldest']

print(type(tag.string))  # <class 'bs4.element.NavigableString'>
tag.string.replace_with("No longer bold")
print(tag.string)  # No longer bold

print(soup.name)  # [document]

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'lxml')
comment = soup.b.string
print(type(comment))  # <class 'bs4.element.Comment'>

警告 UserWarning: No parser was explicitly specified ,原因是未指定解析器。

遍历文档树

“爱丽丝梦游仙境”的文档来做例子:

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
link = soup.a
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

"""
p
body
html
[document]
"""
html = """
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a><a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
"""
sibling_soup = BeautifulSoup(html, 'lxml')
print(sibling_soup.a.next_sibling)   # 输出空白
print(sibling_soup.a.next_sibling.next_sibling)  # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

搜索文档树 find() 、 find_all()

Beautiful Soup定义了很多搜索方法,这里着重介绍2个: find() 和 find_all()。

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

下面代码找到所有被文字包含的节点内容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p
soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'
soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

CSS选择器

Beautiful Soup支持大部分的CSS选择器,在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

修改文档树

Beautiful Soup的强项是文档树的搜索,但同时也可以方便的修改文档树。
内容略,请直接查看官方文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40

Responses