Python爬虫_BeautifulSoup 定位取值

发表于： 2020年6月2日 2020年6月2日
分类： python3, selenium

从网页中获取指定标签、属性值，取值方式：

　　1.通过标签名获取：tag.name tag对应的type是<class 'bs4.element.Tag'>

2.通过属性获取：tag.attrs

3.获取标签属性：tag.get('属性名') 或 tag['属性名']

获取标签内容：

　　1.tag.string 获取当前标签的内容，只有一个标签的时候，（是能处理一个标签，返回标签的text内容）

2.tag.get_text() 获取标签内所有的字符串

BeautifulSoup 功能标签

　　1. stripped_strings

　　   输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

1

2

3

4

5

6

7

8

9

10

11

for string in soup.stripped_strings:

    print(repr(string))

    # u"The Dormouse's story"

    # u"The Dormouse's story"

    # u'Once upon a time there were three little sisters; and their names were'

    # u'Elsie'

    # u','

    # u'Lacie'

    # u'and'

    # u'Tillie'

    # u';\nand they lived at the bottom of a well.'

　　2. 标准输出页面：

　　　　soup.prettify()

BeautifulSoup 查找元素：

　　1.find_all(class_="class") 返回的是多个标签，格式为<class 'bs4.element.ResultSet'>

2.find(class_="class") 返回一个标签，格式是<class 'bs4.element.Tag'>

3.select_one()    返回一个标签，格式是<class 'bs4.element.Tag'>

4.select()    返回的是多个标签，格式为<class 'bs4.element.ResultSet'>

5.　soup = BeautifulSoup(backdata,'html.parser')　　#转换为BeautifulSoup形式属性

soup.find_all('标签名'，attrs{'属性名':'属性值'} ) #返回的是列表

limitk 控制 find_allf返回的数量

recursive=Flasef返回tag的直接子元素

soup.find_all(text=re.compile(' content '))     根据文本匹配，可模糊匹配

子节点处理方式：

　　1. contents

　　　　.contents 属性可以将tag的子节点以列表的方式输出

　　2. children

　　　　.children 生成器,可以对tag的子节点进行循环

　　3. descendants

　　　　contents和children 只是返回的是直接子节点，而descendants返回的是对多有的子孙节点进行循环

父节点处理方式：

　　1. parent

通过 .parent 属性来获取某个元素的父节点

2. find_parents（）

返回祖先节点

2. find_parent（）

返回父节点

兄弟节点处理方式：

　　1. next_siblings 下一个兄弟节点

2. previous_siblings 上一个兄弟节点

3. find_next_siblings（）下一个兄弟节点

4. find_next_sibling（）上一个兄弟节点

tingyuxinsheng@gmail.com

1294

tingyuxinsheng@gmail.com

发表评论 取消回复

发表评论取消回复