View Full Version : I need help for my new plugin(College Humor + BeautifulSoup)
I really hate BeautifulSoup. Sometimes work great but mostly i don't understand why it isn't working.
I need to print all "title, desc, thumb, video" with this code. But return just first ones.
import urllib2
from BeautifulSoup import BeautifulSoup
CH_ROOT = "http://www.collegehumor.com"
CH_RECENT = "/originals/recent"
CH_VIEWED = "/originals/most-viewed"
CH_LIKED = "/originals/most-liked"
CH_PLAYLIST = "/moogaloop"
def getHTML(url):
try:
print 'common :: getHTML :: url = ' + url
req = urllib2.Request(url)
response = urllib2.urlopen(req)
link = response.read()
response.close()
except urllib2.HTTPError, e:
print "HTTP error: %d" % e.code
except urllib2.URLError, e:
print "Network error: %s" % e.reason.args[1]
else:
return link
html = getHTML(CH_ROOT + CH_RECENT)
soup = BeautifulSoup(html)
for result in soup.findAll("div", id="tab_content_0"):
title = result.findAll("strong", {"class":"title"})[0].a.string.strip()
desc = result.findAll("div", {"class":"linked_details"})[0].p.string.strip()
thumb = result.findAll("img", {"class":"media_thumb"})[0]['src']
video = CH_ROOT + CH_PLAYLIST + result.findAll("a", {"class":"video_link"})[0]['href']
print title, desc, thumb, video
rwparris2
2009-09-18, 20:02
There should only be 1 div with the id tab_content_0. That's why you only get 1 result. If you do len(soup.findAll("div", id="tab_content_0")) you should get 1 as a result.
What you're really interested in is the <li class="video"> stuff. So search for that instead!
If you find yourself using findAll(something)[0], just use find(something). It stops searching after the first result, and is therefore faster on large documents.
In BeautifulSoup, classes don't need to be defined in a dict unless you just like to be explicit.
from BeautifulSoup import MinimalSoup as BeautifulSoup, SoupStrainer
videoContainers = SoupStrainer('li', 'video')
for liTag in BeautifulSoup(html, parseOnlyThese = videoContainers):
title = liTag.find('strong', 'title').a.string.strip()
desc = liTag.p.string.strip()
thumb = liTag.find('img', 'media_thumb')['src']
video = liTag.find('a', 'video_link')['href']
print title, desc, thumb, video
Dan Dare
2009-09-19, 13:05
@queeup
I found BeautifulSoup being just like any other tool - you can use it to hammer nails in, or just as well to hammer your own fingers :-)
I found that starting small helps, as in do the outer search first and print what you found, then write the "for" and display the larger bits you find, that will help you then write the code to parse the little bits, without having to go every time to the DOM tree explorer tool.
@rwparris2
Nice code, thanks for that. Personally I like the explicit dicts for parameters, makes maintenance easier as well as for other people to learn and understand what the code was trying to do, although true makes the code tidier.
Yea my fingers hurt :)
Just one issue left. How can i limit search result?
Because @rwparris2 code find 60 result. I need ('li', 'video') inside in ("div", id="tab_content_0").
tab_content_0 = recent
('li', 'video') = 20
tab_content_1 = most-viewed
('li', 'video') = 20
tab_content_2 = most-liked
('li', 'video') = 20
Dan Dare
2009-09-19, 16:03
Then set the strainer to tab_content_0, then do a findAll('li', 'video') on that and for on the results, something like:
divContent = BeautifulSoup(html, parseOnlyThese = SoupStrainer('div', 'tab_content_0'))
liTags = divContent.findAll('li', 'video')
for liTag in liTags:
title = liTag.find('strong', 'title').a.string.strip()
desc = liTag.p.string.strip()
thumb = liTag.find('img', 'media_thumb')['src']
video = liTag.find('a', 'video_link')['href']
print title, desc, thumb, video
grrrrr am I stupid?
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer
html = urllib.urlopen('http://www.collegehumor.com/originals/recent')
divContent = BeautifulSoup(html, parseOnlyThese=SoupStrainer('div', 'tab_content_0'))
print divContent
Nothing print :(
Dan Dare
2009-09-19, 16:40
Try SoupStrainer('div', { 'id' : 'tab_content_0' })
rwparris2
2009-09-19, 16:42
Try SoupStrainer('div', { 'id' : 'tab_content_0' })
Right, the no dict shortcut only applies to classes.
Thanks anyway both of you :) I'm confusing atm. My brain stoped. I need a break. Hey Dan are you sick and tired of my questions? lol...
Dan Dare
2009-09-19, 17:00
In my view, asking questions is the best way to learn, but you still have to put the effort to understand the answers - not that you don't, that was a generic thought :) That's how I learned everything I know, whether it was asking people, the Internet and even myself sometimes :) Ask, understand and try (sometimes lots of trying)...