View Full Version : Accessing XBMC's IMDb scraping from within a (python) video plugin or script?
gzusrawx
2008-10-02, 04:46
Is there currently a way to access the imdb scraping from within a plugin or script? I want to be able to pass imdb urls to xbmc from another source in the video plugin to view the movie information.
Nuka1195
2008-10-02, 05:59
there is a theater showtimes plugin that has an imdb module. most of the regex is stolen from the imdb.xml of xbmc.
it may be broken now, but you can see how to do it.
it's in the xbmc-addons svn.
BigBellyBilly
2008-10-28, 19:44
myTV and DVDProfiler script have IMDb scraping modules. they were originally based on those Nuka mentioned. Were working last time I tried them.
It would be nice to be able to tap into the main XBMC scrappers thou...
exposing the scrapers to python is certainly doable. it would need some internal reorganizations but those i think we want to do no matter after atlantis.
nate12o6
2008-12-05, 23:05
I second this.
This would be great.
nate12o6
2008-12-05, 23:22
It would be great if there was an API that would return imdb info when passed an imdb id.
Try this function.
It`s far from perfect but it works quite well.
imdb() - pass it a string of the name.
It returns genre,year,image,rating,plot
def imdb(url):
req = urllib2.Request('http://www.imdb.com/find?s=all&q='+urllib.quote(url))
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
response = urllib2.urlopen(req).read()
alt=re.compile('<b>Media from <a href="/title/(.+?)/">').findall(response)
if len(alt)>0:
req = urllib2.Request('http://imdb.com/title/'+alt[0])
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
response = urllib2.urlopen(req).read()
genre=re.compile(r'<h5>Genre:</h5>\n<a href=".+?">(.+?)</a>').findall(response)
year=re.compile(r'<a href="/Sections/Years/.+?/">(.+?)</a>').findall(response)
image=re.compile(r'<img border="0" alt=".+?" title=".+?" src="(.+?)" /></a>').findall(response)
rating=re.compile(r'<div class="meta">\n<b>(.+?)</b>').findall(response)
req = urllib2.Request('http://www.imdb.com/title/'+alt[0]+'/plotsummary')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
response = urllib2.urlopen(req).read()
plot=re.compile('<p class="plotpar">\n(.+?)\n<i>\n').findall(response)
try:
if plot[0].find('div')>0:
plot[0]='No Plot found on Imdb'
except IndexError: pass
if len(plot)<1:
req = urllib2.Request('http://www.imdb.com/title/'+alt[0]+'/synopsis')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
plotter = urllib2.urlopen(req).read();clean=re.sub('\n','',p lotter)
plot=re.compile('<div id="swiki.2.1">(.+?)</div>').findall(clean)
try:
if plot[0].find('div')>0:
plot[0]='No Plot found on Imdb'
except IndexError:
plot=['No plot found on Imdb']
return genre[0],year[0],image[0],rating[0],plot[0]
else :
genre=re.compile(r'<h5>Genre:</h5>\n<a href=".+?">(.+?)</a>').findall(response)
year=re.compile(r'<a href="/Sections/Years/.+?/">(.+?)</a>').findall(response)
image=re.compile(r'<img border="0" alt=".+?" title=".+?" src="(.+?)" /></a>').findall(response)
rating=re.compile(r'<div class="meta">\n<b>(.+?)</b>').findall(response)
bit=re.compile(r'<a class="tn15more inline" href="/title/(.+?)/plotsummary" onClick=".+?">.+?</a>').findall(response)
try:
req = urllib2.Request('http://www.imdb.com/title/'+bit[0]+'/plotsummary')
except: pass
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
response = urllib2.urlopen(req).read()
plot=re.compile('<p class="plotpar">\n(.+?)\n<i>\n').findall(response)
try:
if plot[0].find('div')>0:
plot[0]='No Plot found on Imdb'
except IndexError: pass
if len(plot)<1:
try:
req = urllib2.Request('http://www.imdb.com/title/'+bit[0]+'/synopsis')
except: pass
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14')
plotter = urllib2.urlopen(req).read();clean=re.sub('\n','',p lotter)
plot=re.compile('<div id="swiki.2.1">(.+?)</div>').findall(clean)
try:
if plot[0].find('div')>0:
plot[0]='No Plot found on Imdb'
except IndexError:
plot=['No Plot found on imdb']
return genre[0],year[0],image[0],rating[0],plot[0]
Dan Dare
2009-05-30, 21:30
Exposing the scrapers to Python is probably the way to go - replicating the logic in each Python screen that needs the info thats just mores place to break when the website structure changes...