问题描述
我有以下“网站”(这里是html的一部分):
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
我想提取sometext和somelink 。 为此,我编写了python代码,这里是:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
if not("video" in (link['href'])):
print "Name: "+link.text
#sibling_page=urllib2.urlopen("major_link"+link['href'])
print " Link extracted: "+link['href']
但是,此代码不打印任何内容。 你能说出我的错误在哪里吗?
1楼
你的div
没有href
属性。
你必须在<a>
元素下看一级。
from bs4 import BeautifulSoup
html = """
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html)
for links in soup.find_all("div", "moduleBody"):
for link in links.find_all("div", "feature"):
for a in links.find_all("a"):
if not "video" in a['href']:
print("Name: " + a.text)
print("Link extracted: " + a['href'])
打印:
Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink
它找到它两次,因为你的HTML坏了。 BeautifulSoup将其修复如下:
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">
sometext
</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">
22 Mar 2014
</span>
</span>
</div>
</div>
</div>
</div>
2楼
在你的第二个for
循环中,你的link
变量保存对<div class="feature">...</div>
引用,它没有属性href
。
它在很大程度上取决于您的结构,但如果<div class="feature">
标记始终以<h2>
标记开头,该标记仅包含<a>
标记,那么您可以做的是首先获取锚标记<a>
:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
anchor_tag = link.h2.a
if not 'video' in anchor_tag['href']:
print 'Name: %s' % anchor_tag.text
print 'Link extracted: %s' % anchor_tag['href']
顺便说一下,你的HTML格式不正确,应该关闭第一个<div class="feature">
标签。
<div class="moduleBody">
<div class="feature"></div>
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>