XBMC Community Forum  

Go Back   XBMC Community Forum > Development > Scraper Development

Scraper Development Developers forum for meta data scrapers. Scraper developers only!
Not for posting feature requests, bugs, or end-user support requests!

Reply
 
Thread Tools Search this Thread Display Modes
Old 2008-07-26, 16:41   #1
jelockwood
Member
 
Join Date: Mar 2008
Posts: 82
jelockwood is on a distinguished road
Default Developing an Amazon Movie Scraper

I have recently started using XBMC (on a Mac) and found that while the IMDB scraper works well enough, there are many DVDs not on IMDB that are on Amazon.

[Note: While the examples below use a film title of "Soylent Green" I have manually searched IMDB using a browser to confirm other titles are definitely not listed.]

Surprisingly there is no existing Amazon scraper. As part of an effort to make one myself I started off by looking at the existing scrapers to see how they worked, and following on from this I made some initial efforts to convert the current FilmAffinity scraper to use English results rather than Spanish results (you can download a copy here if you are interested http://homepage.mac.com/jelockwood/....affinityen.zip).

While I have not yet got an Amazon scraper even partially working yet, I have found some important information about the format of the various URLs that Amazon uses.

1. Amazon itself normally replaces spaces in Title searches with a plus (+) symbol, however it does seem to also work with a space (or %20).

A search URL like the following entered in a web-browser all work

Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent+green&x=0&y=0
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent green&x=0&y=0
Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green&x=0&y=0
and indeed also the slightly shorter

Code:
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green
2. The URL of a result is normally a rather messy and complicated format like this

Code:
http://www.amazon.com/Soylent-Green-John-Barclay/dp/B0016I0AJG/ref=sr_1_1?ie=UTF8&s=dvd&qid=1217077050&sr=1-1
as you can see there would appear to be two different ID numbers plus a text field. However I have been able to determine that the following much simpler form of the URL also works.

Code:
http://www.amazon.com/dp/B0016I0AJG/
Therefore we just need to extract the ID number beginning with a B (they all seem to begin with a B).

3. The thumbnail image normally has a URL of the form

Code:
http://ecx.images-amazon.com/images/I/51bU-puSlkL._SL500_AA240_.jpg
and the large image a URL of the form

Code:
http://ecx.images-amazon.com/images/I/51bU-puSlkL._SS500_.jpg
as you can see the ID number is totally different to anything previously used. However I have also found that the following URL produces the same large image and uses the main ID number from the original URL

Code:
http://ecx.images-amazon.com/images/P/B0016I0AJG.01.L.jpg
or the older alternative host name

Code:
http://images.amazon.com/images/P/B0016I0AJG.01.L.jpg
Note these forms of the URL must use a P rather than an I.

Based on all the above, would anyone care to assist by coming up with an initial Scraper by coding up the CreateSearchUrl and GetSearchResults sections? I will then try scraping the info fields.

PS. On a different topic, if one has a VIDEO_TS folder in a folder representing the name of the film one can use this folder name for IMDB scraping, however as mentioned not all the DVDs are listed on IMDB, I can see it should be possible to use an NFO file to provide at least some metadata but I am unsure of the correct naming and placement in this scenario.

e.g. /DVDs/Soylent Green/VIDEO_TS/

What should the NFO file be called and in which of the three possible folders (DVDs, Soylent Green, or VIDEO_TS) should it be placed?
jelockwood is offline   Reply With Quote
Old 2008-07-26, 17:30   #2
blittan
Team-XBMC Handyman
 
blittan's Avatar
 
Join Date: Jun 2004
Location: Sweden
Posts: 1,302
blittan is on a distinguished road
Send a message via MSN to blittan
Default

There are more information on the wiki about scrapers.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


blittan is offline   Reply With Quote
Old 2008-07-28, 13:04   #3
jelockwood
Member
 
Join Date: Mar 2008
Posts: 82
jelockwood is on a distinguished road
Default

Quote:
Originally Posted by blittan View Post
There are more information on the wiki about scrapers.
Been there and read it already.

It may have enough information for people already expert in writing regex but for the majority of us it still does not help. What is really needed is some examples in the Wiki.

What I would suggest would be that for the CreateSearchUrl it should include an example Url and describe how it is created.

Likewise, for the GetSearchResults it should show a URL returned as a result of using the CreateSearchUrl and describe how the example regex extracts the needed information.

I am not expecting it to cover every eventuality or source, but it does not have any examples at all. This is why I found it much more helpful to look at an existing scraper (the FilmAffinity one) and compare the xml code with what appears in a web-browser doing the same thing.

Don't get me wrong I think the developers of XBMC have done a great job but like most/all open-source projects (especially Linux people) they assume all the users have the same level of programming expertise as themselves. I am not a full-time programmer but I have found bugs in some open-source projects and submitted successful fixes and I still find the documentation less than desirable.

Hmm, just had a thought, part of the difficulty is that the only way to 'test' a scraper is to use it within XBMC, and you don't get any detailed feedback, it either works or it does not. However I do have a utility to test regex against sample text, this might at least help me test the extracting portion.

Anyway, if any one else is interested, the original Amazon information I posted may assist in a group effort.
jelockwood is offline   Reply With Quote
Old 2008-07-28, 13:10   #4
jmarshall
Team-XBMC Developer
 
Join Date: Oct 2003
Posts: 15,077
jmarshall is on a distinguished road
Default

Totally agreed about the documentation. The problem is that it's only ever developers or very experienced users that bother writing any.

I'll see if I can get C-Quel to make some comments in this thread as to what the functions do and hopefully it can be used as a base or example that others can see and get some good use out of.

Cheers,
Jonathan
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


jmarshall is online now   Reply With Quote
Old 2008-07-28, 13:50   #5
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

afaik c-quel already has a amazon scraper that is almost finished.

i answer all specific questions. but you need to grasp the principles from examples.

Code:
<CreateSearchUrl dest="3">
  <RegExp input="$$1" output="http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=\1" dest="3">
  <expression noclean ="1"/>
</RegExp>
</CreateSearchUrl>
this is as basic as it gets - we get our url encoded search string passed in $$1, we select it all and append it at the appropriate spot to create the search url.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Last edited by spiff; 2008-07-28 at 14:14.
spiff is online now   Reply With Quote
Old 2008-07-28, 13:57   #6
jelockwood
Member
 
Join Date: Mar 2008
Posts: 82
jelockwood is on a distinguished road
Default

Quote:
Originally Posted by jmarshall View Post
Totally agreed about the documentation. The problem is that it's only ever developers or very experienced users that bother writing any.

I'll see if I can get C-Quel to make some comments in this thread as to what the functions do and hopefully it can be used as a base or example that others can see and get some good use out of.

Cheers,
Jonathan
Just to make it clearer (for others) what I was referring to by examples (since there is a bit in the Wiki). For the example Regexc given, it should show what the resulting URL actually looks like that has been generated from the Scraper and Regexc.

Note to others the relevant Wiki page is http://www.xbmc.org/wiki/?title=Scraper.xml

On that page it shows an example GetSearchResults using JadedVideo but does not show what the full real URL was so you cannot compare before and after. Also it would help if they use the same example source for both the CreateSearchUrl and the GetSearchResults (they use different ones).
jelockwood is offline   Reply With Quote
Old 2008-08-06, 03:12   #7
flipped cracker
Senior Member
 
flipped cracker's Avatar
 
Join Date: May 2008
Location: anaheim, ca
Posts: 197
flipped cracker is on a distinguished road
Send a message via AIM to flipped cracker
Default

Quote:
Originally Posted by jelockwood View Post
Been there and read it already.

It may have enough information for people already expert in writing regex but for the majority of us it still does not help. What is really needed is some examples in the Wiki.

What I would suggest would be that for the CreateSearchUrl it should include an example Url and describe how it is created.

Likewise, for the GetSearchResults it should show a URL returned as a result of using the CreateSearchUrl and describe how the example regex extracts the needed information.

I am not expecting it to cover every eventuality or source, but it does not have any examples at all. This is why I found it much more helpful to look at an existing scraper (the FilmAffinity one) and compare the xml code with what appears in a web-browser doing the same thing.

Don't get me wrong I think the developers of XBMC have done a great job but like most/all open-source projects (especially Linux people) they assume all the users have the same level of programming expertise as themselves. I am not a full-time programmer but I have found bugs in some open-source projects and submitted successful fixes and I still find the documentation less than desirable.

Hmm, just had a thought, part of the difficulty is that the only way to 'test' a scraper is to use it within XBMC, and you don't get any detailed feedback, it either works or it does not. However I do have a utility to test regex against sample text, this might at least help me test the extracting portion.

Anyway, if any one else is interested, the original Amazon information I posted may assist in a group effort.
what utility is it that you're using to "test" the scraper?
flipped cracker is offline   Reply With Quote
Old 2008-08-06, 12:39   #8
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

there is no such utility available. there was one (/tools/Scrap) but "somebody" forgot to commit parts of the code and consequently it was lost in a hdd crash or something along that. at one point hopefully somebody will redo the utility as it is really very handy :/
however i feel it's a waste of my time personally, i have just this much coding time to put into the project and i much rather use it improving xbmc.

you can use any regular expression evaluator to test the extraction part, i have been using 'the regexp coach' myself (when on win32) or kregexpeditor in linux.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is online now   Reply With Quote
Old 2008-08-06, 15:42   #9
DonJ
Team-XBMC Scraper Specialist
 
Join Date: May 2005
Posts: 328
DonJ is on a distinguished road
Default

That "somebody" was me, sorry about this.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
DonJ is online now   Reply With Quote
Old 2008-08-07, 04:30   #10
jelockwood
Member
 
Join Date: Mar 2008
Posts: 82
jelockwood is on a distinguished road
Default

Quote:
Originally Posted by spiff View Post
there is no such utility available. there was one (/tools/Scrap) but "somebody" forgot to commit parts of the code and consequently it was lost in a hdd crash or something along that. at one point hopefully somebody will redo the utility as it is really very handy :/
however i feel it's a waste of my time personally, i have just this much coding time to put into the project and i much rather use it improving xbmc.

you can use any regular expression evaluator to test the extraction part, i have been using 'the regexp coach' myself (when on win32) or kregexpeditor in linux.
Unfortunately for me I am using Mac OS X and it seems the range of Regexp tools is not as good. I have tried using RegExhibit - pretty much the only one for Mac, but it does not seem to use exactly the same rules (as it appears Scrapers.xml use).

For example, the following genuine search result from Amazon which I believe to be what the scraper needs to search for -

Code:
<div class="productTitle"><a href="http://www.amazon.com/Soylent-Green-John-Barclay/dp/B0016I0AJG/ref=sr_1_1?ie=UTF8&s=dvd&qid=1217775403&sr=1-1"> Soylent Green </a><span class="binding"> ~ John Barclay, Whit Bissell, Jan Bradley, and Chuck Connors</span><span class="binding"> (<span class="format">DVD</span> - 2008)</span></div>
Seems with RegExhibit to need regex like this

Code:
<div class="productTitle"><a href="http://www.amazon.com/.*/dp/(B[a-zA-Z0-9]*)/ref=sr_1_[0-9]*\?ie=UTF8&s=dvd&qid=[0-9]*&sr=1-[0-9]*"> .* </a><span class="binding">?.~ .*</span><span class="binding"> ?\(<span class="format">DVD</span> - .*\)</span></div>
to get the important ID number (in some places [a-zA-Z0-9] does not seem to work), and like this

Code:
<div class="productTitle"><a href="http://www.amazon.com/.*/dp/B[a-zA-Z0-9]*/ref=sr_1_[0-9]*\?ie=UTF8&s=dvd&qid=[0-9]*&sr=1-[0-9]*"> (.*) </a><span class="binding">?.~ .*</span><span class="binding"> ?\(<span class="format">DVD</span> - .*\)</span></div>
to get the movie title.

RegExhibit can be obtained here http://homepage.mac.com/roger_jolly/software/index.html
jelockwood is offline   Reply With Quote
Reply

Bookmarks

Tags
amazon, scraper, scrapers


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 10:41.


Protected by Akismet, We recommend WordPress blogs
Copyright © 2008, XBMC Project