python - 如何使用Python从网站中提取多个独立嵌套的JSON对象和密钥

发表于： 2020年6月5日 2020年6月5日
分类： python, python3, selenium

我想从网页中提取多个独立的JSON对象和关联的键。 “独立嵌套”是指每个JSON对象都嵌套在script type = "application/ld+json元素内。

我目前正在使用beautifulsoup，json和requests尝试完成此任务，但是我无法使其正常工作。我已阅读过类似的帖子（例如here，here和here），但没有一个解决此问题。具体来说，如何同时提取多个独立嵌套的JSON对象，然后从这些对象中提取特定的键。其他示例假定JSON对象都在一个嵌套中。

这是我目前所在位置的工作示例：

# Using Python 3.8.1, 32 bit, Windows 10

from bs4 import BeautifulSoup

import requests

import json


#%% Create variable with website location

reno = 'https://www.foodpantries.org/ci/nv-reno'


#%% Downlod the webpage

renoContent = requests.get(reno)


#%% Make into nested html

renoHtml = BeautifulSoup(renoContent.text, 'html.parser')


#%% Keep only the HTML that contains the JSON objects I want

spanList = renoHtml.find("div", class_="span8")


#%% Get JSON objects.

data = json.loads(spanList.find('script', type='application/ld+json').text)

print(data)

这就是我被困住的地方。我可以获取第一个位置的JSON数据，但是，我无法获取spanList变量中列出的其他9个位置的JSON数据。如何让Python从其他9个位置获取JSON数据？我确实尝试了spanList.find_all，但是返回了AttributeError: ResultSet object has no attribute 'text'。但是，如果从.text中删除json.loads，则会得到TypeError: the JSON object must be str, bytes or bytearray, not ResultSet。

我的直觉是这很复杂，因为每个JSON对象都有自己的script type = "application/ld+jso属性。我看到的其他示例都没有类似的情况。看来json.loads仅识别第一个JSON对象，然后停止。

另一个复杂之处是，地点的数量根据城市而变化。我希望有一种解决方案，无论页面上有多少位置，该功能都会自动提取所有位置（例如，里诺（Reno）有10个，拉斯维加斯（Las Vegas）有20个）。

我也无法弄清楚如何使用诸如name和streetAddress.之类的键名从此JSON负载中提取键。这可能基于我如何通过json.dumps提取JSON对象，但是我不确定。

这是一个如何布置JSON对象的示例

           <script type = "application/ld+json">
            {
            "@context": "https://schema.org",
            "@type": "LocalBusiness",
            "address": {
            "@type":"PostalAddress",
            "streetAddress":"2301 Kings Row",
            "addressLocality":"Reno",
            "addressRegion":"NV",
            "postalCode": "89503"
            },
            "name": "Desert Springs Baptist Church"
            ,"image": 
             "https://www.foodpantries.org/gallery/28591_desert_springs_baptist_church_89503_wzb.jpg"
            ,"description": "Provides a food pantry.  Must provide ID and be willing to fill out intake 
              form Pantry.Hours: Friday 11:00am - 12:00pmFor more information, please call. "
            ,"telephone":"(775) 746-0692"
            }

我的最终目标是将键name，streetAddress，addressLocality，addressRegion和postalCode中包含的数据导出到CSV文件。

最佳答案

IIUC，您只需要在.find_all中调用spanList方法即可获取所有json对象。

尝试这个：

from bs4 import BeautifulSoup
import requests
import json

reno = 'https://www.foodpantries.org/ci/nv-reno'
renoContent = requests.get(reno)
renoHtml = BeautifulSoup(renoContent.text, 'html.parser')
json_scripts = renoHtml.find("div", class_="span8").find_all('script', type='application/ld+json')
data = [json.loads(script.text, strict=False) for script in json_scripts] 
#use strict=False to bypass json.decoder.JSONDecodeError: Invalid control character
print(data)

tingyuxinsheng@gmail.com

1294

tingyuxinsheng@gmail.com

发表评论 取消回复

发表评论取消回复