PDA

View Full Version : Developing an Amazon Movie Scraper


jelockwood
2008-07-26, 16:41
I have recently started using XBMC (on a Mac) and found that while the IMDB scraper works well enough, there are many DVDs not on IMDB that are on Amazon.

[Note: While the examples below use a film title of "Soylent Green" I have manually searched IMDB using a browser to confirm other titles are definitely not listed.]

Surprisingly there is no existing Amazon scraper. As part of an effort to make one myself I started off by looking at the existing scrapers to see how they worked, and following on from this I made some initial efforts to convert the current FilmAffinity scraper to use English results rather than Spanish results (you can download a copy here if you are interested http://homepage.mac.com/jelockwood/.Public/filmaffinityen.zip).

While I have not yet got an Amazon scraper even partially working yet, I have found some important information about the format of the various URLs that Amazon uses.

1. Amazon itself normally replaces spaces in Title searches with a plus (+) symbol, however it does seem to also work with a space (or %20).

A search URL like the following entered in a web-browser all work

http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent+green&x=0&y=0
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent green&x=0&y=0
http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green&x=0&y=0

and indeed also the slightly shorter

http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=soylent%20green

2. The URL of a result is normally a rather messy and complicated format like this

http://www.amazon.com/Soylent-Green-John-Barclay/dp/B0016I0AJG/ref=sr_1_1?ie=UTF8&s=dvd&qid=1217077050&sr=1-1

as you can see there would appear to be two different ID numbers plus a text field. However I have been able to determine that the following much simpler form of the URL also works.

http://www.amazon.com/dp/B0016I0AJG/

Therefore we just need to extract the ID number beginning with a B (they all seem to begin with a B).

3. The thumbnail image normally has a URL of the form

http://ecx.images-amazon.com/images/I/51bU-puSlkL._SL500_AA240_.jpg

and the large image a URL of the form

http://ecx.images-amazon.com/images/I/51bU-puSlkL._SS500_.jpg

as you can see the ID number is totally different to anything previously used. However I have also found that the following URL produces the same large image and uses the main ID number from the original URL

http://ecx.images-amazon.com/images/P/B0016I0AJG.01.L.jpg

or the older alternative host name

http://images.amazon.com/images/P/B0016I0AJG.01.L.jpg

Note these forms of the URL must use a P rather than an I.

Based on all the above, would anyone care to assist by coming up with an initial Scraper by coding up the CreateSearchUrl and GetSearchResults sections? I will then try scraping the info fields.

PS. On a different topic, if one has a VIDEO_TS folder in a folder representing the name of the film one can use this folder name for IMDB scraping, however as mentioned not all the DVDs are listed on IMDB, I can see it should be possible to use an NFO file to provide at least some metadata but I am unsure of the correct naming and placement in this scenario.

e.g. /DVDs/Soylent Green/VIDEO_TS/

What should the NFO file be called and in which of the three possible folders (DVDs, Soylent Green, or VIDEO_TS) should it be placed?

blittan
2008-07-26, 17:30
There are more information on the wiki about scrapers.

jelockwood
2008-07-28, 13:04
There are more information on the wiki about scrapers.

Been there and read it already.

It may have enough information for people already expert in writing regex but for the majority of us it still does not help. What is really needed is some examples in the Wiki.

What I would suggest would be that for the CreateSearchUrl it should include an example Url and describe how it is created.

Likewise, for the GetSearchResults it should show a URL returned as a result of using the CreateSearchUrl and describe how the example regex extracts the needed information.

I am not expecting it to cover every eventuality or source, but it does not have any examples at all. This is why I found it much more helpful to look at an existing scraper (the FilmAffinity one) and compare the xml code with what appears in a web-browser doing the same thing.

Don't get me wrong I think the developers of XBMC have done a great job but like most/all open-source projects (especially Linux people) they assume all the users have the same level of programming expertise as themselves. I am not a full-time programmer but I have found bugs in some open-source projects and submitted successful fixes and I still find the documentation less than desirable.

Hmm, just had a thought, part of the difficulty is that the only way to 'test' a scraper is to use it within XBMC, and you don't get any detailed feedback, it either works or it does not. However I do have a utility to test regex against sample text, this might at least help me test the extracting portion.

Anyway, if any one else is interested, the original Amazon information I posted may assist in a group effort.

jmarshall
2008-07-28, 13:10
Totally agreed about the documentation. The problem is that it's only ever developers or very experienced users that bother writing any.

I'll see if I can get C-Quel to make some comments in this thread as to what the functions do and hopefully it can be used as a base or example that others can see and get some good use out of.

Cheers,
Jonathan

spiff
2008-07-28, 13:50
afaik c-quel already has a amazon scraper that is almost finished.

i answer all specific questions. but you need to grasp the principles from examples.


<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.amazon.com/s/ref=nb_ss_d?url=search-alias=dvd&field-keywords=\1" dest="3">
<expression noclean ="1"/>
</RegExp>
</CreateSearchUrl>


this is as basic as it gets - we get our url encoded search string passed in $$1, we select it all and append it at the appropriate spot to create the search url.

jelockwood
2008-07-28, 13:57
Totally agreed about the documentation. The problem is that it's only ever developers or very experienced users that bother writing any.

I'll see if I can get C-Quel to make some comments in this thread as to what the functions do and hopefully it can be used as a base or example that others can see and get some good use out of.

Cheers,
Jonathan

Just to make it clearer (for others) what I was referring to by examples (since there is a bit in the Wiki). For the example Regexc given, it should show what the resulting URL actually looks like that has been generated from the Scraper and Regexc.

Note to others the relevant Wiki page is http://www.xbmc.org/wiki/?title=Scraper.xml

On that page it shows an example GetSearchResults using JadedVideo but does not show what the full real URL was so you cannot compare before and after. Also it would help if they use the same example source for both the CreateSearchUrl and the GetSearchResults (they use different ones).

flipped cracker
2008-08-06, 03:12
Been there and read it already.

It may have enough information for people already expert in writing regex but for the majority of us it still does not help. What is really needed is some examples in the Wiki.

What I would suggest would be that for the CreateSearchUrl it should include an example Url and describe how it is created.

Likewise, for the GetSearchResults it should show a URL returned as a result of using the CreateSearchUrl and describe how the example regex extracts the needed information.

I am not expecting it to cover every eventuality or source, but it does not have any examples at all. This is why I found it much more helpful to look at an existing scraper (the FilmAffinity one) and compare the xml code with what appears in a web-browser doing the same thing.

Don't get me wrong I think the developers of XBMC have done a great job but like most/all open-source projects (especially Linux people) they assume all the users have the same level of programming expertise as themselves. I am not a full-time programmer but I have found bugs in some open-source projects and submitted successful fixes and I still find the documentation less than desirable.

Hmm, just had a thought, part of the difficulty is that the only way to 'test' a scraper is to use it within XBMC, and you don't get any detailed feedback, it either works or it does not. However I do have a utility to test regex against sample text, this might at least help me test the extracting portion.

Anyway, if any one else is interested, the original Amazon information I posted may assist in a group effort.
what utility is it that you're using to "test" the scraper?

spiff
2008-08-06, 12:39
there is no such utility available. there was one (/tools/Scrap) but "somebody" forgot to commit parts of the code and consequently it was lost in a hdd crash or something along that. at one point hopefully somebody will redo the utility as it is really very handy :/
however i feel it's a waste of my time personally, i have just this much coding time to put into the project and i much rather use it improving xbmc.

you can use any regular expression evaluator to test the extraction part, i have been using 'the regexp coach' myself (when on win32) or kregexpeditor in linux.

DonJ
2008-08-06, 15:42
That "somebody" was me, sorry about this.

jelockwood
2008-08-07, 04:30
there is no such utility available. there was one (/tools/Scrap) but "somebody" forgot to commit parts of the code and consequently it was lost in a hdd crash or something along that. at one point hopefully somebody will redo the utility as it is really very handy :/
however i feel it's a waste of my time personally, i have just this much coding time to put into the project and i much rather use it improving xbmc.

you can use any regular expression evaluator to test the extraction part, i have been using 'the regexp coach' myself (when on win32) or kregexpeditor in linux.

Unfortunately for me I am using Mac OS X and it seems the range of Regexp tools is not as good. I have tried using RegExhibit - pretty much the only one for Mac, but it does not seem to use exactly the same rules (as it appears Scrapers.xml use).

For example, the following genuine search result from Amazon which I believe to be what the scraper needs to search for -

<div class="productTitle"><a href="http://www.amazon.com/Soylent-Green-John-Barclay/dp/B0016I0AJG/ref=sr_1_1?ie=UTF8&s=dvd&qid=1217775403&sr=1-1"> Soylent Green </a><span class="binding"> ~ John Barclay, Whit Bissell, Jan Bradley, and Chuck Connors</span><span class="binding"> (<span class="format">DVD</span> - 2008)</span></div>

Seems with RegExhibit to need regex like this

<div class="productTitle"><a href="http://www.amazon.com/.*/dp/(B[a-zA-Z0-9]*)/ref=sr_1_[0-9]*\?ie=UTF8&s=dvd&qid=[0-9]*&sr=1-[0-9]*"> .* </a><span class="binding">?.~ .*</span><span class="binding"> ?\(<span class="format">DVD</span> - .*\)</span></div>

to get the important ID number (in some places [a-zA-Z0-9] does not seem to work), and like this

<div class="productTitle"><a href="http://www.amazon.com/.*/dp/B[a-zA-Z0-9]*/ref=sr_1_[0-9]*\?ie=UTF8&s=dvd&qid=[0-9]*&sr=1-[0-9]*"> (.*) </a><span class="binding">?.~ .*</span><span class="binding"> ?\(<span class="format">DVD</span> - .*\)</span></div>

to get the movie title.

RegExhibit can be obtained here http://homepage.mac.com/roger_jolly/software/index.html

ShortySco
2008-08-07, 05:24
Unfortunately for me I am using Mac OS X and it seems the range of Regexp tools is not as good.

Might be of help......

I have used some tools in the past for checking my regular expressions, many of the ones i found were internet/flash based (and i, assume), platform independant, sorry, i can't remember the names/places, but google will.

Shorty

spiff
2008-08-07, 11:09
remember; the scraper is a xml file so any special chars needs to be xml'ized for the regexp to function properly in the scraper (or rather, for the scraper xml to load correctly in the first place). this is why there are all these &lt; &amp; etc stuffs in the other scrapers (which i assume you use for reference)

jelockwood
2008-08-11, 14:25
remember; the scraper is a xml file so any special chars needs to be xml'ized for the regexp to function properly in the scraper (or rather, for the scraper xml to load correctly in the first place). this is why there are all these &lt; &amp; etc stuffs in the other scrapers (which i assume you use for reference)

Yep, I spotted that on the Wiki and have taken that in to consideration when trying RegExhibit. It was the fact that in some places [a-zA-Z0-9] works and others I have had to use .* that makes it difficult to know if my regex will be right. Also escaping characters like ~ (tilde), a question mark ?, a space character, and ( ) <- not actual regex but real parenthesis is confusing. These are not listed in the Wiki.

While I am at it, here are some questions I feel the Wiki does not adequately answer.

1. I am using folder names as the search criteria. The folder names include the year of the film, e.g. "Soylent Green (1973)". While IMDB works very well with all of that as a search string, Amazon does not like the year being included (with or without parenthesis) if you type that using a web-browser, suggesting it would equally not like it being sent by a scraper.

Does your typical scraper when using folder names like this, include the year when creating a search URL, or does it strip it off, if so how?

2. When a scraper looks at the search results, XBMC displays a list of titles found, but the scraper has to use the ID number to generate the URL to access the selected result. I am not clear from the Wiki where these two different steps are done, and how the results are linked. As you saw from my last post, I have found the relevant html code returned by Amazon and somewhat got regex code that can extract either the ID or the title.

(Oops, I thought I had already posted this reply, but came back later and discovered this message still open for editing in a tab in my web-browser.)

spiff
2008-08-11, 15:22
1) i'm not completely sure but afaik its done by the scraper. if not i will remedy that
2) no you dont have to use the id. you are free to list whatever url you want in the returned results from GetSearchResults. flow is simply;

we call CreateSearchUrl with buffers 1 set to the cleaned title (url encoded) and iirc buffer 2 set to the year - if this is not so that i will fixed. xbmc grabs the url, then calls GetSearchResults with buffer 1 set to the returned html. this function returns a xml list with the obtained results. remember, that's ALL the scraper does - it is translating some general html into a fixed xml format

jelockwood
2008-08-17, 19:55
The only way to test a scraper is for it to work almost completely. A bit of a catch22.

Despite the very poor documentation, it seems clear enough there are three main sections to a scraper.

1. Create the Search URL
2. Process the results to list the returned movies and let a user choose one
3. For the chosen movie get the meta data fields

I am reasonably clear on CreateSearchUrl. However the documentation for GetSearchResults sucks big time, especially as I now believe it has at least one major error in the documentation. So having sat down and looked at two existing working scrapers (IMDB and FilmAffinity) and what documentation there is in the Wiki I am going to break down the steps as I so far understand them and ask for confirmation and clarification of my understanding.

I will use the FilmAffinity GetSearchResults code as the example here. So firstly here is the exact code in that scraper.

<GetSearchResults dest="8">
<RegExp input="$$5" output="&lt;?xml version=&quot;1.0&quot; encoding=&quot;iso-8859-1&quot; standalone=&quot;yes&quot;?&gt;&lt;results&gt;\1&lt;/results&gt;" dest="8">
<RegExp input="$$1" output="\1" dest="7">
<expression>&lt;img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg"&gt;</expression>
</RegExp>
<RegExp dest="5" input="$$1" output="&lt;entity&gt;&lt;title&gt;\1 (\2)&lt;/title&gt;&lt;url&gt;http://www.filmaffinity.com/en/film$$7.html&lt;/url&gt;&lt;id&gt;$$7&lt;/id&gt;&lt;/entity&gt;">
<expression noclean="1">&lt;title&gt;([^&lt;]*)\(([0-9]*)\) - FilmAffinity</expression>
</RegExp>
<RegExp input="$$1" output="\1" dest="4">
<expression noclean="1">(&lt;b&gt;&lt;a href="/en/film.*)</expression>
</RegExp>
<RegExp dest="5+" input="$$1" output="&lt;entity&gt;&lt;title&gt;\2 (\3)&lt;/title&gt;&lt;url&gt;http://www.filmaffinity.com/en/film\1.html&lt;/url&gt;&lt;id&gt;\1&lt;/id&gt;&lt;/entity&gt;">
<expression repeat="yes" noclean="1,2">&lt;a href="/en/film([0-9]*).html[^&gt;]*&gt;([^&lt;]*)&lt;/a&gt;[^\(]*\(([0-9]*)</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSearchResults>

It seems fairly clear that the first RegExp line outputs the following

<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<results>
\1
</results>

where \1 is the content of variable 1

It is not clear what the second RegExp is doing other than it seems to be related to the unique ID number of the film on the website, here is what the second one seems to translate to.

<img src="http://www.filmaffinity.com/imgs/movies/full/[0-9]*/([0-9]*).jpg">

Interestingly, while this URL is valid and loads the film thumbnail, it does not seem to exist in the results returned by FilmAffinity.

The third RegExp is returning

<entity>
<title>\1 (\2)</title>
<url>http://www.filmaffinity.com/es/film$$7.html</url>
<id>$$7</id>
</entity>

Here we see a case that seems to clearly point to an error in the Wiki. The Wiki suggests

<entity>
<title>?</title>
<url>?</url>
<url>?</url>
</entity>

and makes no mention of </id>?</id> at all. Note: both the IMDB and FilmAffinity scrapers do use the ID tags.

I would guess that <title>?</title> is the title of the film as returned by the website, <url>?</url> is the URL to access details for that selected film, and <id>?</id> is a unique ID number of that film as used by the website.

I have no idea what purpose the fourth RegExp has, it appears in this case to return

(<b><a href="/en/film.*)

The fifth RegExp seems almost exactly the same as the third one.

<entity>
<title>\2 (\3)</title>
<url>http://www.filmaffinity.com/es/film\1.html</url>
<id>\1</id>
</entity>

So while it seems clear that the 3rd and 5th RegExp fill in the returned XML, it is still not clear to me what bit does the actual searching of the results to find the list of films returned by the website.

I would particularly like an explanation of what the second and fourth RegExp bits do.

spiff
2008-08-17, 23:18
while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input

jelockwood
2008-08-18, 01:57
while i never touched the filmaffinity scraper, all of this might not be 100 correct. however, since you want to use it as a reference...

\1 is NOT the contents of variable one, it is the first selection in the regexp - big diff. $$1 is the contents of variable 1. remember that we evaluate expressions in a LIFO order, meaning we evaluate the innermost expressions first.

"the second expression" (which will be the first one that gets evaluated) grabs the film id and stick it in buffer 7. however this one will only match IF we was redirected to a perfect match, i.e. we are on a film info page. the third expression will fill buffer 5 with this match IF we have one (the $$7 is used to indicate the contents of buffer 7, just like you do input="$$1" - the contents of buffer 1.

as i have always said the wiki is NOT an authorative documentation source. the <id> tag is an optional tag that entities may or may not fill. it is really handy on sites that uses id's to form urls etc.

the fourth expression is irrelevant, it does not help anything.

the fifth expression is the real meat if we are on a search page.

when that is done we return to the first expression to evaluate that. it takes the contents of buffer 5, selects it all and sticks a <results> tag around our matches. empty <expression> tags mean select anything in input

Thank you this reply has I feel helped a lot. I can now see how part of the fifth expression finds the results and then outputs the title and url etc. In the case of Amazon this needs a much longer and more complicated expression, as its results include each result in several slightly different forms and we only want to list each once. Here is just one raw result section which is in my opinion the most useful form.

<a href="http://www.amazon.com/Soylent-Green-John-Barclay/dp/B000VAHR0U/ref=sr_1_3?ie=UTF8&s=dvd&qid=1219010497&sr=1-3"><span class="srTitle">Soylent Green</span></a>

~ John Barclay, Whit Bissell, Jan Bradley, and Chuck Connors <span class="bindingBlock">(<span class="binding">DVD</span> - 2007)</span></td></tr>

B000VAHR0U is in this example the ID number and is used to form urls to access both artwork and individual DVD info, and in this case the title is Soylent Green. Amazon on the results page often do not list the real year of the film, instead listing the year that edition of the DVD was issued, making the year useless for searching purposes. Would simply ignoring the year in the fifth expression be ok and not including it in the built title? That is just returning "Soylent Green" rather than the useless "Soylent Green (2007)".

My first guess at an expression to search with for section 5 would be

<expression repeat="yes" noclean="1,2">www.amazon.com/[a-zA-Z0-9]*/dp/([B0-9]*)/ref=sr_[0-9]_[0-9]\?ie=UTF8&amp;s=dvd&amp;qid=[0-9]*&amp;sr=[0-9]-[0-9]&quot;&gt;&lt;span class=&quot;srTitle&quot;&gt;([a-zA-Z0-9]*)&lt;/span&gt;&lt;/a&gt;</expression>

Does this look correct to you? In particular the escaping of various characters? What is the correct way to escape a ? or a space (or is this not necessary)?

Note: It is possible for film titles to begin with a number or a letter, for example "2001 A Space Odyssey".

Gaarv
2008-08-18, 11:41
If executed, the regexp you gave wont work for the title, because you miss the space.

Maybe you did already, but I highly suggest you read the url pointed in the wiki : http://www.regular-expressions.info/repeat.html

Its also specified that laziness wont work, but it does to an extend ie :

[0-9A-Za-z .!#".. whatever]*
can be simply replaced by
[^<>]* meaning every characters repetition except "<" and ">". Pretty handy as long as you have a matching pattern to end the repetion.

Above, an application with the string you provided :

www.amazon.com/[^&lt;&gt;/]*/dp/([0-9A-Z]*) etc

Im not sure why you would want to obtain title and year in one regex, if you make several ones, theres less chances of mistakes.

spiff
2008-08-18, 13:01
lazyness now works, we have switched parser to pcre so you can be as lazy as you want :)

C-Quel
2008-08-19, 23:47
try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

jelockwood
2008-08-21, 01:30
try this....

http://pastebin.com/m657636a8

old but with cleanup + minor changes should work

Many, many, many thanks for pointing me towards this. I felt I was getting closer to getting the GetSearchResults working but had not yet succeeded. Yours does work.

For your information I found the following results from your unmodified Amazon scraper.

1. It successfully did a CreateSearchUrl
2. It only listed one result using its GetSearchResults
3. When that result was used it only succeeded in filling in "Studio", "Runtime" and obtaining the movie thumbnail.

I have so far 'improved' it by

1. Listing the entire page of results returned by Amazon (a maximum of twelve results). This was done by adding a repeat command to your GetSearchResults.
2. Filling in the movie "Title", "Year" (the proper film year not the DVD year), and the "Plot".
3. I have very slightly changed the filename you used for the movie thumbnail to one I believe will still return a result in a very few cases yours might not. My modified filename will always return the largest available artwork (usually 500 pixels) whereas yours would only get 500 pixel tall artwork and I believe a very few DVDs may not have artwork available that big.

While I have added/changed the code to also do "Directors" and "Actors", this is not working. My currently not working approach was to have a first regex to get the block listing all the actor(s) or director(s) and then a second regex which is supposed to extract the individual names from that block.

My efforts so far can be obtained from the following link

http://homepage.mac.com/jelockwood/.Public/amazonustest.zip

As far as I can see there is no available information on the Amazon product page to do MPAA rating (Amazon use a GIF and no text), nor a tagline or summary, genre, or writer. There might be a way of getting a rating (that is reader score) by using the following text align="absbottom" alt="4.5 out of 5 stars" height="12". Note: There are several entries of this text in a product page and we would always want to look only at the first.

If anyone else would like to help out it would be much appreciated. In particular getting the actor(s) and director(s) working is a priority.

For everyone's benefit this is what the block containing all the actors looks like

<li> <b>Actors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Charlton%20Heston">Charlton Heston</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Edward%20G.%20Robinson">Edward G. Robinson</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Dick%20Van%20Patten">Dick Van Patten</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Chuck%20Connors">Chuck Connors</a>, <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Joseph%20Cotten">Joseph Cotten</a></li>

And the Directors block is virtually identical

<li> <b>Directors:</b> <a href="/s?ie=UTF8&search-alias=dvd&field-keywords=Richard%20Fleischer">Richard Fleischer</a></li>

jelockwood
2008-08-23, 16:19
I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.

jelockwood
2008-08-23, 22:19
I made good progress today and have now managed to also get "Directors", "Actors", "Rating", and "Votes" all working!

I have even partially managed to get MPAA (aka. Certification) working. I say partially because the only text to search and match for in the Amazon HTML code is in lowercase and I would prefer to return and display it in blocks caps as is more usual. Does anyone have any suggestions on how to use regex to convert to block uppercase? Currently it returns "pg" rather than the desired "PG".

With this done the scraper will be pretty much finished in that I will have done as many fields as Amazon provide. I will then do more extensive testing (with more titles), and then make a second version for Amazon UK.

(There does not seem to be a way to edit ones own posts.)

I have now come up with a way of returning MPAA ratings in uppercase, it is rather brute force as I had to write a copy of the regex for each MPAA rating meaning there are multiple copies for MPAA but only one will ever match, then rather than using the usual \1 I hard coded the text to return, e.g. PG-13.

This now looks as complete as is possible with the information Amazon provide.


Hmm, just had a thought, even when one searches Amazon US it can return UK results (and vice versa), it will list the US results first generally as a better match. However if one does search Amazon US and happen to select a UK DVD then one would presumably get a UK Certification rather than an MPAA rating. I will be able to 'fix' this by adding yet more hard coded copies for both types of certification. This will be useful anyway since I want to do a UK version as well. I will probably leave it as an exercise for the reader to do Germany, France, Australia, etc.

C-Quel
2008-08-24, 17:27
Keep up the good work nice to see more hands on deck :)

jelockwood
2008-08-25, 20:43
The Amazon US scraper looks done now, and I have converted it to do Amazon UK. This was not as straight forward as it sounds as there were more differences than you think, even such minor ones as two spaces where the US one had a single space.

Despite this I have managed to get all the equivalent fields as the US one working with one exception. This is the information for the "Plot" field. Unfortunately Amazon UK make this more difficult in several ways, firstly the most appropriate block of text is much longer, longer than would be preferred. Secondly, it has html formatting mixed in with the text. As far as I can see the scraper would clean up and remove that html formatting but I am struggling to get a regex to extract the text since the embedded html code makes matching more difficult.

Here is what the raw section looks like

<b>Amazon.co.uk Review</b><br />
While <I>Soylent Green</I> may be one of the many <a href="/exec/obidos/tg/feature/-/298867/${0}">dystopian visions</a> of the future, the film stands out because it's one of the few titles that addresses current environmental issues head on. Adapted from Harry Harrison's novel <I>Make Room, Make Room</I>, it gives us a nightmarish vision of an over-populated, polluted future on the brink of collapse--a vision that gets uncomfortably closer every year. Charlton Heston as police officer Thorn investigates a murder in between suppressing food riots and uncovers the nightmarish truth about Soylent Green, the new foodstuff being sold to the poor. <p> The film neatly combines police procedural with conspiracy thriller. Heston's scenes are counterpointed by more elegiac ones in which the centenarian Edward G Robinson as his friend Sol broods on the world he has outlived--his death in a euthanasia chamber is a gloriously lachrymose moment, which he plays to the hilt. Heston, too, is good as Thorn, a morally equivocal cop who loots the apartments of the victims whose deaths he investigates--he's a man just getting by in an impossible world. <p> <B>On the DVD:</B> <I>Soylent Green</I> on disc comes with a commentary from director Richard Fleischer, the highpoint of which is a memorable description of what it was like to work with the brilliant ailing, entirely deaf Robinson. He is joined by Leigh Taylor-Young whose work on the film as heroine led to years of serious environmentalist commitment. It has a useful contemporary making-of documentary and touching shots of Robinson's 100th birthday party with telegrams from Sinatra and others. The feature itself is presented in anamorphic widescreen with its original mono sound. --<I>Roz Kaveney</I>

<br /><br />

(It will be easier to view if you copy and paste it in to an editor.)

All I need is a regex for the above and then both Amazon scrapers can be released.

Note: The URL to view the original Amazon UK page that above came from is

http://www.amazon.co.uk/Soylent-Green-Charlton-Heston/dp/B0000AISK8/ref=sr_1_1?ie=UTF8&s=dvd&qid=1219686117&sr=1-1

w00dst0ck
2008-08-26, 10:32
This works for me:

Review</b><br />(.*)--<I

w00dst0ck
2008-08-26, 14:17
Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Review</b><br />(.*?)--<

jelockwood
2008-09-20, 05:44
Tested with some titles at amazon.co.uk and found out that they use sometimes --<i> and --<I> at the end of the review.

Try this regex and clean it afterwards:
Review</b><br />(.*?)--<

Sorry for the delay in replying, been busy doing other things. Your code above does indeed seem to work, I had however before I came back and checked come up with this which also seems to work

<b>Amazon.co.uk Review</b><br />\n ([^\n]*)

I could strip off the author bit the same way as you like so

<b>Amazon.co.uk Review</b><br />\n ([^\n]*)--<

However I would not be surprised if there are some entries with no author listed and then it would fail to match.

Let me know if you see any drawbacks from mine.

w00dst0ck
2008-10-01, 18:11
Do you have an example url for that?

jelockwood
2008-12-05, 15:26
I finally got round to setting up a download page for the Amazon video Scrapers I have written (as discussed in this thread).

There is one for Amazon.com and a matching one for Amazon.co.uk, just to make it clear these are Video scrapers, not music or TV scrapers. You can download them as a Zip file from the following link, the Zip file includes the matching Scraper logos as well.

These have been tested on a Mac but should work on all XBMC platforms.

Note: I still recommend people normally should use the standard IMDB scraper first, but if like me, you have some DVDs which IMDB does not list (e.g. Simpsons Christmas Special) then these scrapers will help.

Here is the download link http://homepage.mac.com/jelockwood/scrapers.html

gyrene2083
2008-12-11, 04:40
jelockwood,

I am looking forward to testing your scraper out. I have been following this thread since Oct. I have found many of the DVD's on IMDB don't have coverart, where as Amazon does, as well as missing dvd information. I appreciate all your efforts, and I am looking forward to testing this out.

jelockwood
2008-12-11, 21:12
jelockwood,

I am looking forward to testing your scraper out. I have been following this thread since Oct. I have found many of the DVD's on IMDB don't have coverart, where as Amazon does, as well as missing dvd information. I appreciate all your efforts, and I am looking forward to testing this out.

If the only problem is cover art, then you could use the IMDB scraper, and manually select a local picture, or put a picture in the directory with a .tbn file extension. I wrote the scrapers because some titles are not listed at all on IMDB and I still wanted to include them in the XBMC library.

The download link is now live so you can give it a go.

spiff
2008-12-14, 16:34
both are now sitting in svn (r16563). cheers again!

jelockwood
2009-01-11, 07:10
I just tried using the Amazon scrapers I mostly wrote for the first time for several weeks, and damn they don't work any more for me.

Currently neither is finding any results (so it is not simply an issue of scraping info from a selected result). This was the original problem that I had (constructing a correct query in the scraper, and then getting/showing the list of results). This was originally solved by C-Quel generously providing his original Amazon scraper effort which I then finished off.

Could anyone else confirm whether the Amazon scrapers (either US or UK) are currently still working for them, and if so what DVD title they used successfully.

If on the other hand, other users confirm it is broken, would anyone be able to assist in diagnosing it?

What held me up last time, is that I could not (without a LAN packet sniffer) see what request the scraper sent out, and what result it got back from Amazon and then be able to see how far it got. Once I got past this and moved on to scraping the film info, this could be easily tested by seeing how many fields successfully returned results.

C-Quel
2009-01-12, 00:55
Try this...

change Get SearchResults from

imageColumn"[^:]*a href="([^"]*)"[^:]*[^>]*alt="([^"]*)"

productTitle"><a href="([^"]*)"> ([^<]*)</a>

or properly formatted

productTitle&quot;&gt;&lt;a href=&quot;([^&quot;]*)&quot;&gt; ([^&lt;]*)&lt;/a&gt;

might not be perfect as i simply glanced at amazon no tools to hand.

jelockwood
2009-01-12, 04:38
I have already thanked C-Quel (again) via a private message, but this fix does so far look successful. I will do some more testing and then put updated versions on my download page and issue a request for them to be included as updated and fixed versions in XBMC.

Many thanks again to C-Quel and everyone else who has helped out in the past.

Try this...

change Get SearchResults from

imageColumn"[^:]*a href="([^"]*)"[^:]*[^>]*alt="([^"]*)"

productTitle"><a href="([^"]*)"> ([^<]*)</a>

or properly formatted

productTitle&quot;&gt;&lt;a href=&quot;([^&quot;]*)&quot;&gt; ([^&lt;]*)&lt;/a&gt;

might not be perfect as i simply glanced at amazon no tools to hand.

vdrfan
2009-01-12, 09:47
I have already thanked C-Quel (again) via a private message, but this fix does so far look successful. I will do some more testing and then put updated versions on my download page and issue a request for them to be included as updated and fixed versions in XBMC.

Many thanks again to C-Quel and everyone else who has helped out in the past.

Please use our tracker instead and attach a unified diff to the previous scraper.

ultrabrutal
2009-01-12, 10:16
Amazon does not give permission to get info via http. They have a webservice to use which is legal, however you have to delete the info after 3 months... hehe this means that movies should automaticly start to disappear from the library if they were scanned via Amazon webservice scrapper ;)

spiff
2009-01-12, 13:38
both scrapers disabled in svn

nekrosoft13
2009-01-12, 23:27
Amazon does not give permission to get info via http. They have a webservice to use which is legal, however you have to delete the info after 3 months... hehe this means that movies should automaticly start to disappear from the library if they were scanned via Amazon webservice scrapper ;)

you gonna ruin everything

jelockwood
2009-01-13, 00:20
Ok, I did some more testing of this updated version and found a couple more issues.

1. There was a problem with processing the DVD title on some entries on Amazon.co.uk due to the fact some titles are formatted different to others, I believe I have successfully modified the scraper to better cope with this.

2. I took the opportunity to add support for scraping the DVD "Writers" information if available on the Amazon pages. This applies to both the Amazon.com and Amazon.co.uk versions.

I have put the updated versions at this URL for those keen to get it before it appears in the next XBMC release.

http://homepage.mac.com/jelockwood/scrapers.html

Gamester17
2009-01-13, 11:19
Thanks, however please (always) create a new ticket on trac for each new scraper update:
http://xbmc.org/trac (unified diff if possible, or better yet both diff and the full file).

Thanks again :grin:

jelockwood
2009-01-13, 12:30
Thanks, however please (always) create a new ticket on trac for each new scraper update:
http://xbmc.org/trac (unified diff if possible, or better yet both diff and the full file).

Thanks again :grin:

I have reopened and updated the original Trac I used to submit the first version. The purpose of the previous message (from me) was to let those people know what was happening who have been following this thread.

Gamester17
2009-01-13, 13:27
Please create new trac tickets for updates if and when the old ticket been closed, instead of reopening the old ticket, (only if the old ticket has never closed then it is OK to posts updates to it), this process is to make tracking management easier.

Thanks again! :nod:

C-Quel
2009-01-13, 20:07
Neither does iMDB

Amazon does not give permission to get info via http. They have a webservice to use which is legal, however you have to delete the info after 3 months... hehe this means that movies should automaticly start to disappear from the library if they were scanned via Amazon webservice scrapper ;)

ultrabrutal
2009-01-13, 20:42
Neither does iMDB

Which is probably why spiff disabled it and made themoviedb.org default scraper

Clumsy
2009-01-13, 23:21
*bangs his head in disbelief*

azido
2009-01-14, 11:30
Which is probably why spiff disabled it and made themoviedb.org default scraper

which should mean we're all need to create an account on themoviedb and tvdb and support them by adding our movies to that databases and finally get rid of those ***** (censored) commercial db-owners that create completely nonsense rules like deleting info after 3 months.. (when will the day come site owners treat us to remove our bookmarks to their sites after some time period? sick..)

ultrabrutal
2009-01-14, 13:28
which should mean we're all need to create an account on themoviedb and tvdb and support them by adding our movies to that databases and finally get rid of those ***** (censored) commercial db-owners that create completely nonsense rules like deleting info after 3 months.. (when will the day come site owners treat us to remove our bookmarks to their sites after some time period? sick..)

I totally agree, but I also understand their rules and ofcourse they must be respected. I have already created an account and started to contribute...

Nuka1195
2009-01-14, 16:03
ultrabrutal, would you be so kind as to inform plex http://forums.plexapp.com/ also inform them they should apply for their own appid and key for weather.com if they haven't yet. that's weather.coms tos.

i'm sure they would appreciate it.

ultrabrutal
2009-01-14, 16:14
nuka, no idea what you are talking about

azido
2009-01-14, 17:24
I totally agree, but I also understand their rules and ofcourse they must be respected. I have already created an account and started to contribute...

sure, i might have been overreacting a bit with my sarcasm. it's just that in general i have an anger against rules that doesn't make any other sense then just to treat people - which in my opinion this 3 months-rule of amazon is. it's like your grandma is no longer allowed to read fairytales to you, you have to read them personally - and IF she does, she has to forget what she was reading in a certain amount of time.. :;):

anyway, it's up to them to define their rules and same goes for imdb. i wonder if they ever had those eggbots on efnet on target which also parse imdb informations to irc-channels ~cough~

i've contributed a lot lately to tvdb.org and of course i will add my movies to themoviedb.org if they don't exist (or the german descriptions are missing yet). good to see a private alternative here.

ultrabrutal
2009-01-14, 17:29
"rules are ment to be broken"

"it's better to ask for forgiveness than to ask for permission"

"it's not a crime unless someone finds out"

I hate double standards and people acting like kids

azido
2009-01-14, 17:51
I hate double standards and people acting like kids

am i?

having an opinion is not acting like a kid :rolleyes:

ultrabrutal
2009-01-14, 17:57
I'm not talking about you here :) More the general reaction...

How IMDB chooses to react to violations we do not know. First they need to know there is a violation, secondly if they want to persuit it

XavHorneT
2009-01-22, 12:47
Hey jelockwood
I see your scraper, and i have a personnal question.
I've got some DVD Music, do you think it's possible that your scraper can be used for the "musicvideos" db. Because with your scraper, i will have some information on my DVD but i think it will be pretty cool to find my DVD in Video Clip than in movie section.
Thank you for your help.

jelockwood
2009-01-23, 16:38
Hey jelockwood
I see your scraper, and i have a personnal question.
I've got some DVD Music, do you think it's possible that your scraper can be used for the "musicvideos" db. Because with your scraper, i will have some information on my DVD but i think it will be pretty cool to find my DVD in Video Clip than in movie section.
Thank you for your help.

There are three main sections to XBMC - Movies, TV, and Music. Audio CDs (obviously) go in Music, but DVDs (of music) would not. DVDs of Music might themselves be split in to two sub-categories, films like "Ray" (about Ray Charles), or "Walk the Line" (about Johnny Cash) which should be considered as movies and therefore put in the movies section. However, DVDs containing music videos could be ripped in to individual videos and you could then let XBMC treat them as normal music videos rather than DVDs. (For example I plan to do this for a Red Hot Chilli Peppers DVD.)

In answer to your query, my scrapers as they currently stand will scrape music films like "Ray" or "Walk the Line", and should also do similar ones like the Rolling Stones "Shine a Light", and probably live performances. My scrapers do not currently do Audio CDs (even though these are listed on Amazon), but it looks like it would be fairly easy to modify the scrapers to create new (separate) versions specifically for that. My scrapers do not do music videos, I have not looked at this at all so I cannot yet comment on how difficult this would be.

I have been working on a project to rip (a lot) of DVDs on to a NAS for a friend, and have nearly finished that, I then will be doing the Audio CDs, and if (like for films) I find that there are titles missed by the default music scraper then I might get round to doing an Amazon version. However as XBMC now has an iTunes plugin I am also hoping that if all the information is already available via iTunes then this might not be necessary.

XavHorneT
2009-01-24, 16:33
Tank you for all your explains, i just ask, in case, you have to this type of evolution in mind.
It's was just because i have some DVD music (like Red Hot Chili Peppers - Live at Slane Castle), and i was thinking that it will be good to have some more informations and poster from amazon.
But, I understand that will be a hard work to make this, that why i prefer to ask you.
Thank you again.

joolz
2009-01-24, 18:48
Ultrabrutal, I dislike you and it is my opinion that everyone else does, too!

joolz
2009-02-09, 00:10
malloc, nice one :)