当前位置: 代码迷 >> python >> Beutifulsoup解析,从子标签获取信息
  详细解决方案

Beutifulsoup解析,从子标签获取信息

热度:112   发布时间:2023-07-16 10:18:55.0

我有以下“网站”(这里是html的一部分):

<div class="moduleBody">
     <div class="feature">
     <div class="feature">
         <h2>
             <a href="somelink">sometext</a>
         </h2>
         <div class="relatedInfo">
              <span class="relatedTopics">
              <span class="timestamp">22 Mar 2014</span>
         </div>
      </div>
</div> 

我想提取sometextsomelink 为此,我编写了python代码,这里是:

for links in soup.find_all('div','moduleBody'):
        for link in links.find_all('div','feature'):
            if not("video" in (link['href'])):
                print "Name: "+link.text
                #sibling_page=urllib2.urlopen("major_link"+link['href'])
                print " Link extracted: "+link['href']

但是,此代码不打印任何内容。 你能说出我的错误在哪里吗?

你的div没有href属性。 你必须在<a>元素下看一级。

from bs4 import BeautifulSoup

html = """
<div class="moduleBody">
     <div class="feature">
     <div class="feature">
         <h2>
             <a href="somelink">sometext</a>
         </h2>
         <div class="relatedInfo">
              <span class="relatedTopics">
              <span class="timestamp">22 Mar 2014</span>
         </div>
      </div>
</div>
"""

soup = BeautifulSoup(html)

for links in soup.find_all("div", "moduleBody"):
    for link in links.find_all("div", "feature"):
        for a in links.find_all("a"):
            if not "video" in a['href']:
                print("Name: " + a.text)
                print("Link extracted: " + a['href'])

打印:

Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink

它找到它两次,因为你的HTML坏了。 BeautifulSoup将其修复如下:

<div class="moduleBody">
 <div class="feature">
  <div class="feature">
   <h2>
    <a href="somelink">
     sometext
    </a>
   </h2>
   <div class="relatedInfo">
    <span class="relatedTopics">
     <span class="timestamp">
      22 Mar 2014
     </span>
    </span>
   </div>
  </div>
 </div>
</div>

在你的第二个for循环中,你的link变量保存对<div class="feature">...</div>引用,它没有属性href

它在很大程度上取决于您的结构,但如果<div class="feature">标记始终以<h2>标记开头,该标记仅包含<a>标记,那么您可以做的是首先获取锚标记<a>

for links in soup.find_all('div','moduleBody'):
    for link in links.find_all('div','feature'):
        anchor_tag = link.h2.a
        if not 'video' in anchor_tag['href']:
            print 'Name: %s' % anchor_tag.text
            print 'Link extracted: %s' % anchor_tag['href']

顺便说一下,你的HTML格式不正确,应该关闭第一个<div class="feature">标签。

<div class="moduleBody">
 <div class="feature"></div>
 <div class="feature">
     <h2>
         <a href="somelink">sometext</a>
     </h2>
     <div class="relatedInfo">
          <span class="relatedTopics">
          <span class="timestamp">22 Mar 2014</span>
     </div>
  </div>
</div> 
  相关解决方案