View Full Version : Quick Scraper Question (Hope so:))
Schenk2302
2009-04-22, 23:01
Hi everyone,
i try to make a scraper but can't get ahead with one step.
I use scrap.exe to test my scraper:
CreateSearchUrl returned is okay!
GetSearchResults returned is okay !
Details URL is okay !
but then the GetDetails returned: is nothing with the Error: Unable to parse details.xml
Here's my code:
<scraper name="TEST" content="movies" thumb="cinefacts.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" language="de">
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
<expression noclean="1"/>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="<entity><title>\3 \4</title><url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url></entity>" dest="5">
<expression repeat="yes">><a href="/kino/([0-9]*)/(.[^\/]*)/filmdetails.html">[^>]*(.[^<]*)</b></a><br>[^>]*[^\t]+\t+[^ ]+[^0-9]+([^<]+)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetSearchResults>
<GetDetails dest="3">
<RegExp input="$$5" output="<details>\1</details>" dest="3">
<!--Title -->
<RegExp input="$$1" output="<title>\1</title>" dest="5+">
<expression trim="1" noclean="1"><h1>([^<]*)</expression>
</RegExp>
</RegExp>
</GetDetails>
</scraper>
Maybe someone could have a quick look at this and tell me the direction to get it right.
Thanks so much in advance
Schenk
unfortunately scrap.exe is outdated and we lost the source.
and the reason it does not work is that you are missing the expression for the outermost RegExp in GetDetails, i.e.
....
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
Schenk2302
2009-04-22, 23:30
Hi Spiff,
thanks for your answer, that solved the problem with scrap.exe :)
But now i tried it in XBMC and it doesn't work. i know that scrap.exe is outdated but is there any chance to see at which point XBMC stuck with my scrapper or better why it not works. any scrapper logs??? At this point i have absolutely no clue where to start and find the error because with scrap.exe it's just fine. Thanks again for any hints or infos.
Greetz
Schenk
my answer depends on two things;
1) you speak c++ and can compile
or
2) you can compile
or
3) neither
Schenk2302
2009-04-22, 23:36
:grin:
maybe 2) better 3)
Could you explain why?
Thanks
Schenk
if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly - here it is; http://dureks.dyndns.org:8080/scraperlog.diff
3 means i don't have to do anything
:)
Schenk2302
2009-04-22, 23:40
if 1 i could gotten away with instructions
2 means i'll have to do a patch for you which i will do shortly
3 means i don't have to do anything
:)
2 sound like i could try
3 makes me crying because i want that Cinefacts Scraper working :)
Schenk2302
2009-04-22, 23:43
little side note:
i made a cinefacts.de scraper for MediaPortal but now switched to XBMC and would like to use it here. It was even hard for me to do this in MP, in XBMC i'm getting depressed because it's totally different :)
heh, different does not mean bad. don't give up, you'll get the hang of it =P
Schenk2302
2009-04-22, 23:53
Spiff, i know i'm kind of lazy yet but is there a compiled version with your patch to download or do i really have to compile by my own, what makes me really afraid :shocked:
that was the prerequisite for 2)
Schenk2302
2009-04-23, 23:16
Hey Spiff,
don't wanna waste your time but i got a question left. I'm getting on with my scraper and the first things are very good. But now i parse the genres and that work but the output is like Action, Thriller, Horror. How to get rid of the , ???
Thanks in advance
Schenk
<RegExp input="$$2" output="\1\2" dest="3">
<expression noclean="1,2" repeat="yes">(.*?),(.*)</expression>
</RegExp>
also you should use multiple <genre> tags so maybe something like this?
<RegExp input="$$2" output="<genre>\1</genre>" dest="3">
<expression noclean="1" repeat="yes">(.*?),</expression>
</RegExp>
Schenk2302
2009-04-26, 02:37
And again a question, sorry for that in advance:
if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:
Here's my code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts.de" content="movies" thumb="cinefacts.jpg" language="de">
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
<expression noclean="1"/>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="<entity><title>\3 (\4)</title><url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url></entity>" dest="5">
<expression repeat="yes">><a href="/kino/([0-9]*)/(.[^\/]*)/filmdetails.html">[^<]*<b title="([^"]*)" class="headline">[^<]+</b></a><br>[^<]+<br>+[^0-9]+([^<]*)</td></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetSearchResults>
</scraper>
thanks again for any hints!!!
Schenk
And again a question, sorry for that in advance:
if my movie title has a german umlaut in it (ä) the scraper can't find the movie but when i'm writing ae instead of the umlaut it will be found. i tried all the encoding stuff but can't find the answer:
Here's my code:
thanks again for any hints!!!
Schenk
try escaping the charachter code with '\xE4'
not sure if that's included in the regular expression engine though
you need to set the SearchStringEncoding on the CreateSearchUrl function
Schenk2302
2009-04-26, 16:40
you need to set the SearchStringEncoding on the CreateSearchUrl function
As always thanks for this one, Spiff !!!
Schenk2302
2009-04-26, 17:07
Is there something equal for the GetDetails Section, because the plot is displayed with the html tags for the umlauts???
Schenk2302
2009-04-28, 22:53
My <GetThumbnailLink dest="5"> outputs as many urls as covers but my <GetThumbnail dest="5"> only outputs the thumb from the first url. how to make all url' s outputted, means getting all thumbs??
thanks in advance and sorry for that poor english :)
Schenk
<details><thumbs><thumb>..</thumb><thumb>..</thumb></thumbs></details>
also see http://xbmc.org/forum/showthread.php?t=48643
Schenk2302
2009-04-29, 01:02
Thanks spiff but i can't follow you, maybe i got a blockade in my head now :)
This is how it looks right now:
<RegExp input="$$1" output="<url function="GetThumbnailLink">http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$1" output="<details><url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url></details>" dest="6">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
</RegExp>
</GetThumbnailLink>
<GetThumbnail dest="5">
<RegExp input="$$2" output="<details><thumbs>\1</thumbs></details>" dest="5+">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="2">
<expression>="/kino/plakat/([^"]*)"</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
I can't find out where to change this.
Cheers
Schenk
why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you
Schenk2302
2009-04-29, 14:25
why are you repeating in getthumbnaillink? i don't get what you are trying to achieve and hence it is impossible to help you
Thanks for answer spiff, trying to explain.
in the GetDetails i output one url like ...posters.html
in the GetThumbnailLink there are as many output url's as cover pages like ...poster_1.html poster_2.html etc.
i want all the outputted urls to be parsed for the cover and get them.
hope that will make things clearer for you and i really appreciate your help !!!
Thanx
Schenk
aha.
then you want
<RegExp input="$$1" output="<url function="GetThumbnailLink" cache="some.xml" >http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1></details>" dest="6">
<RegExp input="$$1" output=";<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7+">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
<RegExp input="" output="<url function="CollectThumbnails>"></url>
<expression/>
</RegExp>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8">
<expression>="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details> dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$8" output="<details><thumbs>\1</thumbs></details>" dest="">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
i'm sure it's full of typos but i'm at work. shows the idea anyways
Schenk2302
2009-04-29, 22:49
Thanks spiff,
a few question for me the stupid boy:
cache="some.xml" what´s that?
a few typos is okay, i think i found them but the empty buffers and dest; is this for real or are these typos too???
Thanks again, hope soon i'm finished with questioning and bothering you :)
Schenk
okay, the cache thing is actually to hack around a limitation i will lift soonish. you can only run a scraper function on a valid url.
if you set the cache property on a url, we cache to a local file with that name. usually it is used to run several functions on the same page. in this case we just need *some* valid url to run the last function on, and to avoid fetching anything we use the same url as the one we set the cache property on.
the expressions with the empty inputs are there on purpose. we need to return a valid xml from each function call, or the process stops. empty dest is a typo, it should be 2
Schenk2302
2009-04-29, 23:26
Hey spiff,
here's what i got now. didn't work and i don't know if i understand the cache thing right and there'll be many false things, i guess. maybe you could see what's wrong:
<RegExp input="$$1" output="<url function="GetThumbnailLink" cache="http://www.google.de" >http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1</details>" dest="6">
<RegExp input="$$1" output="<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7+">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
<RegExp input="" output="<url function="CollectThumbnails"></url>"
<expression/>
</RegExp>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8">
<expression>="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details>" dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$8" output="<details><thumbs>\1</thumbs></details>" dest="2">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
</scraper>
Thanx Schenk
<RegExp input="$$1" output="<url function="GetThumbnailLink" cache="some.xml">http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1</details>" dest="6">
<RegExp input="$$1" output="<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
</RegExp>
<RegExp input="" output="<url function="CollectThumbnails" cache="some.xml" >http://doesnt.matter</url>" dest="7+">
<expression/>
</RegExp>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8+">
<expression>="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details>" dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$8" output="<details><thumbs>\1</thumbs></details>" dest="2">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
</scraper>
Schenk2302
2009-04-30, 00:08
<RegExp input="$$1" output="<url function="GetThumbnailLink" cache="some.xml">http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1</details>" dest="6">
<RegExp input="$$1" output="<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
</RegExp>
<RegExp input="" output="<url function="CollectThumbnails" cache="some.xml" >http://doesnt.matter</url>" dest="7+">
<expression/>
</RegExp>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8+">
<expression>="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details>" dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$8" output="<details><thumbs>\1</thumbs></details>" dest="2">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
</scraper>
Error: Unable to parse GetThumbnailLink.xml
Schenk2302
2009-05-01, 00:56
Hey spiff, me again :=
I couldn't get this working and my head's gonna explode.
I found some new way to parse so i changed my code. I think it is the same but a lot easier.
<!--Poster URL-->
<RegExp input="$$1" output="<url function="GetPosters">http://www.cinefacts.de/kino/film/\1/\2/\3/\4/plakat.html</url>" dest="5+">
<expression repeat="yes">"/kino/film/([0-9]*)/([^\/]*)/([^\/]*)/([^\/]*)/plakat.html"\)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Poster-->
<GetPosters clearbuffers="no" dest="5">
<RegExp input="$$2" output="<?xml version=<details><thumbs>\1</thumbs></details>" dest="5+">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="2">
<expression repeat="yes">href="/kino/plakat/([^"]*)"</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetPosters>
</scraper>
Poster URL gives me (for example) two valid html pages.
Get Poster gives me two url's but outputs it seperat so only one is available in XBMC. Could you again decsribe for me dumb ass what to do to get all covers downloaded.
no. i have already explained it and if there is something i absolutely won't do, it is repeat myself
Schenk2302
2009-05-02, 02:20
no. i have already explained it and if there is something i absolutely won't do, it is repeat myself
I totally understand your point but i can't make this work.
<!--Poster URL-->
<RegExp input="$$1" output="<url function="GetThumbnailLink" cache="\1.xml" >http://www.cinefacts.de/kino/film/\1/\2/plakate.html</url>" dest="5+">
<expression repeat ="yes"><a href="/kino/film/([0-9]*)/([^\/]*)/plakate.html"></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Thumbnail-->
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1</details>" dest="6">
<RegExp input="$$1" output="<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
</RegExp>
<RegExp input="" output="<url function="CollectThumbnails" cache="\1.xml">http://www.cinefacts.de/kino/datenbank.html</url>" dest="7+">
<expression/>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8+">
<expression>href="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details>" dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$8" output="<details><thumbs>\1</thumbs></details>" dest="2">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
</scraper>
The GetThumbnailLink gives me the valid urls, so i think there must be something wrong after that.
If i change the GetThumbnail from 8+ to 5+ it outputs two seperate posters. i have no clue what else to do. i'd love to make this work too because i'm nearly finished with the scraper, i got even fanart to work.
Anyway, thanks again for all your help.
Schenk
hi,
one problem is that you set the cache to .xml in the collectthumbnails call. use a litteral filename.
other than that i do not see what could be wrong. are you sure the expressions are correct?
perhaps i could have the whole file so i could see what goes wrong?
Schenk2302
2009-05-02, 02:37
Hi spiff,
thanks i try to play a little more.
Yes I think the regex's are okay.
here's my file but with the old poster part.
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<scraper name="Cinefacts" content="movies" thumb="cinefacts.jpg" language="de">
<GetSettings dest="3">
<RegExp input="$$5" output="<settings>\1</settings>" dest="3">
<RegExp input="$$1" output="<setting label="Fanart" type="bool" id="fanart" default="true"></setting>" dest="5+">
<expression></expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetSettings>
<CreateSearchUrl dest="3" SearchStringEncoding="iso-8859-1">
<RegExp input="$$1" output="http://www.cinefacts.de/suche/suche.php?name=\1" dest="3">
<expression noclean="1"/>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8">
<RegExp input="$$1" output="<entity><title>\3 (\4)</title><url>http://www.cinefacts.de/kino/\1/\2/filmdetails.html</url></entity>" dest="5">
<expression repeat="yes">><a href="/kino/([0-9]*)/(.[^\/]*)/filmdetails.html">[^<]*<b title="([^"]*)" class="headline">[^<]+</b></a><br>[^<]+<br>+[^0-9]+([^<]*)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetSearchResults>
<GetDetails dest="3">
<RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><details>\1</details>" dest="3">
<!--Title-->
<RegExp input="$$1" output="<title>\1</title>" dest="5+">
<expression trim="1" noclean="1"><h1>([^<]*)</expression>
</RegExp>
<!--Original Title-->
<RegExp input="$$1" output="<originaltitle>\1</originaltitle>" dest="5+">
<expression><dt class="c1">Originaltitel:</dt>[^<]*<dd class="first">(.[^<]*)</dd></expression>
</RegExp>
<!--Genre-->
<RegExp input="$$1" output="\1" dest="4+">
<expression noclean="1">Genre:([^:]*)Deutschlandstart:</expression>
</RegExp>
<RegExp input="$$4" output="<genre>\1</genre>" dest="5+">
<expression repeat="yes" noclean="1" trim="1">>*[ A-Za-z]([^<>]*)</a></expression>
</RegExp>
<!--Director Film-->
<RegExp input="$$1" output="\1" dest="7+">
<expression noclean="1">Regie:([^:]*)Buch:</expression>
</RegExp>
<RegExp input="$$7" output="<director>\1</director>" dest="5+">
<expression repeat="yes" ><a href="[^"]*">([^<]*)</a></expression>
</RegExp>
<!--Actors-->
<RegExp input="$$1" output="\1" dest="7+">
<expression noclean="1">Darsteller:([^|]*)</expression>
</RegExp>
<RegExp input="$$7" output="<actor><name>\1</name><role>\2</role></actor>" dest="5+">
<expression repeat="yes">>([^<>]*)</a></td>+[^<]+<[^>]+> als([ A-Za-z]*)</expression>
</RegExp>
<!--Studio-->
<RegExp input="$$1" output="<studio>\1</studio>" dest="5+">
<expression>Studio:([^\.]*)\.</expression>
</RegExp>
<!--Year-->
<RegExp input="$$1" output="<year>\1</year>" dest="5+">
<expression></a> ([0-9]*) </dd></expression>
</RegExp>
<!--MPAA-->
<RegExp input="$$1" output="<mpaa>\1</mpaa>" dest="5+">
<expression>FSK:</dt>[^>]*>([^<]*)<</expression>
</RegExp>
<!--Runtime-->
<RegExp input="$$1" output="<runtime>\1</runtime>" dest="5+">
<expression>L.nge:</dt>[^>]*>([^<]*)<</expression>
</RegExp>
<!--Plot-->
<RegExp input="$$1" output="<plot>\1</plot>" dest="5+">
<expression>KURZINHALT</h2></li>[^>]*>*([^<]*)[</li>]</expression>
</RegExp>
<!--Writers-->
<RegExp input="$$1" output="\1" dest="6+">
<expression noclean="1">Buch:([^:]*)Musik:</expression>
</RegExp>
<RegExp input="$$6" output="<credits>\1</credits>" dest="5+">
<expression repeat="yes" ><a href="[^"]*">([^<]*)</a></expression>
</RegExp>
<!--Poster URL-->
<RegExp input="$$1" output="<url function="GetPosters">http://www.cinefacts.de/kino/film/\1/\2/\3/\4/plakat.html</url>" dest="5+">
<expression repeat="yes">"/kino/film/([0-9]*)/([^\/]*)/([^\/]*)/([^\/]*)/plakat.html"\)</expression>
</RegExp>
<!--IMDB URL-->
<RegExp input="$$1" output="<url function="GetIMDBid">http://akas.imdb.com/find?s=tt;q=\2 (\1)</url>" dest="5+">
<expression><h1>[^<]*</h1>[^0-9]*([0-9]*) </li>[^:]*:</dt>[^<]*<dd class="first">(.[^<]*)</dd></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<!--Poster-->
<GetPosters clearbuffers="no" dest="5">
<RegExp input="$$2" output="<?xml version=<details><thumbs>\1</thumbs></details>" dest="5+">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="2">
<expression repeat="yes">href="/kino/plakat/([^"]*)"</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>
</GetPosters>
<!--Get IMDB ID-->
<GetIMDBid dest="5">
<RegExp input="$$2" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"><details>\1</details>" dest="5">
<RegExp input="$$1" output="<url function="GetTMDBId">http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=\1&amp;api_key=57983e31fb 435df4df77afb854740ea9</url>" dest="2+">
<expression>/title/([t0-9]*)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetIMDBid>
<!-- Fanart -->
<GetTMDBId dest="5">
<RegExp conditional="fanart" input="$$1" output="<details><url function="GetTMDBFanart">http://api.themoviedb.org/2.0/Movie.getInfo?id=\1&amp;api_key=57983e31fb435df4df 77afb854740ea9</url></details>" dest="5">
<expression><id>([0-9]*)</id></expression>
</RegExp>
</GetTMDBId>
<GetTMDBFanart dest="5">
<RegExp input="$$2" output="<details><fanart url="http://themoviedb.org/image/backdrops">\1</fanart></details>" dest="5">
<RegExp input="$$1" output="<thumb preview="/\1/\2_poster.jpg">/\1/\2.jpg</thumb>" dest="2">
<expression repeat="yes"><backdrop size="original">http://www.themoviedb.org/image/backdrops/([0-9]*)/([^\.]*).jpg</backdrop></expression>
</RegExp>
<expression noclean="1">(.+)</expression>
</RegExp>
</GetTMDBFanart>
</scraper>
Thanks
Schenk
http://xbmc.org/trac/ticket/6485
works for me
Schenk2302
2009-05-02, 03:46
http://xbmc.org/trac/ticket/6485
works for me
You're my hero. Thanks so much for all your help!!!
I will now take a deep look into it and maybe change some details to make it better and trying to understand everything :)
The reason for this scraper is, that Moviemaze scraper is very good but only lists the cinema films not the dvd realeses and so on. Okay, that's it for now and again thank you very, very much.
Good Night
Schenk
Schenk2302
2009-05-03, 13:23
Hi spiff,
i want to make parsing for the IMDB ID better. In most cases it will work good and gets the right id but i stuck with one movie for example which gets the false ID.
The movie title is Beverly Hills Chihuahua. This code searches with the parsed original title and date, in this example for
Beverly Hills Chihuahua : South of the Border (AT) (2008)
<!--IMDB URL-->
<RegExp input="$$1" output="<url function="GetIMDBid">http://akas.imdb.com/find?s=tt;q=\2 (\1)</url>" dest="5+">
<expression><h1>[^<]*</h1>[^0-9]*([0-9]*) </li>[^:]*:</dt>[^<]*<dd class="first">(.[^<]*)</dd></expression>
</RegExp>
<expression noclean="1"/>
If i type this directly in the browser it will find the imdb results page and the regex will be good.
the prob that i have now is that here
<!--Get IMDB ID-->
<GetIMDBid dest="5">
<RegExp input="$$2" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"><details>\1</details>" dest="5+">
<RegExp input="$$1" output="<url function="GetTMDBId">http://api.themoviedb.org/2.0/Movie.imdbLookup?imdb_id=\1&amp;api_key=57983e31fb 435df4df77afb854740ea9</url>" dest="2+">
<expression>/title/([t0-9]*)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetIMDBid>
it inputs IMDB ID 0086960 and that is from Beverly Hills Cop and so the fanart will be wrong. The question now is. Where does it get this wrong ID and why, because the aka search result only show the right movie.
Thanks for any help
Schenk
Schenk2302
2009-05-04, 13:36
And again another question:
<!--IMDB Rating URL -->
<RegExp input="$$1" output="<url function="GetIMDBRating">http://akas.imdb.com/find?s=tt;q=\2 (\1)</url>" dest="5+">
<expression><h1>[^<]*</h1>[^0-9]*([0-9]*) </li>[^:]*:</dt>[^<]*<dd class="first">(.[^<]*)</dd></expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetDetails>
<GetIMDBRating dest="5">
<RegExp input="$$2" output="<details>\1</details>" dest="5+">
<RegExp input="$$1" output="<url function="GetRating">http://www.imdb.com/title/\1</url>" dest="2+">
<expression>/title/([t0-9]*)/</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetIMDBRating>
<GetRating dest="5">
<RegExp input="$$1" output="<rating>\1</rating><votes>\2</votes>" dest="5+">
<expression><b>([0-9.]+)/10</b>[^<]*<a href="ratings" class="tn15more">([0-9,]+) votes</a></expression>
</RegExp>
<expression noclean="1"/>
</GetRating>
</scraper>
all works well in scrap test but rating and votes are not shown in XBMC. Any reason why?
yes, ALL output from a scraper you want parsed needs the surrounding <details> tags.
and the number of expressions and RegExps doesnt add up.....
Schenk2302
2009-05-04, 13:59
Things could be so easy, thanks again it works now !!!
yes, ALL output from a scraper you want parsed needs the surrounding <details> tags.
maybe my english's not good enought but i don't understand the content of this...
and the number of expressions and RegExps doesnt add up....
Thanks so much for helping out!!!
<GetRating dest="5">
<RegExp input="$$1" output="<rating>\1</rating><votes>\2</votes>" dest="5+">
<expression><b>([0-9.]+)/10</b>[^<]*<a href="ratings" class="tn15more">([0-9,]+) votes</a></expression>
</RegExp>
<expression noclean="1"/>
</GetRating>
see; one <RegExp>, two <expression>
Schenk2302
2009-05-04, 14:59
<GetRating dest="5">
<RegExp input="$$1" output="<rating>\1</rating><votes>\2</votes>" dest="5+">
<expression><b>([0-9.]+)/10</b>[^<]*<a href="ratings" class="tn15more">([0-9,]+) votes</a></expression>
</RegExp>
<expression noclean="1"/>
</GetRating>
see; one <RegExp>, two <expression>
sorry, i'm to stupid; can't see where you wanna lead me :(
<expression> tags only makes sense wrapped in a <RegExp>. the latter one is surely NOT wrapped in a <RegExp> -> it's useless
Schenk2302
2009-05-04, 15:16
<expression> tags only makes sense wrapped in a <RegExp>. the latter one is surely NOT wrapped in a <RegExp> -> it's useless
that's what i thought you try to tell me but if i delete it, parsing didn't work (or maybe only didn't work with scrap.exe???)
Thanks
Schenk
any result from scrap.exe is irrelevant
Schenk2302
2009-05-04, 20:40
any result from scrap.exe is irrelevant
Just tried it without in XBMC and it doesn't work, if i add the line, it'll work :nod:, don't know why.
Schenk2302
2009-05-04, 21:04
Here we go again with stealing your time:
You know how i get the covers from cinefacts, because you set it up.
My question now: is it possible to add covers from a different site and get them all together from both sites???
thanks for get me learnig :)
yes, have a look at the imdb scraper for instance. or how you add the fanart for that matter....
Schenk2302
2009-05-04, 23:07
yes, have a look at the imdb scraper for instance. or how you add the fanart for that matter....
That's freaking me out, here's what i have so far:
<GetPosterLinkURL dest="5">
<RegExp input="$$2" output="<details>\1</details>" dest="5+">
<RegExp input="$$1" output="<url function="GetPosterURL">http://www.moviemaze.de/filme/\1/\2</url>" dest="2+">
<expression><a href="/filme/([0-9]+)/([^"]*)"</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetPosterLinkURL>
<GetPosterURL dest="5">
<RegExp conditional="poster" input="$$1" output="<details><url function="GetPoster">http://www.moviemaze.de/media/poster/\1/\2</url></details>" dest="5">
<expression><a href="/media/poster/([0-9]+)/([^"]*)"</expression>
</RegExp>
</GetPosterURL>
<GetPoster dest="5">
<RegExp input="$$2" output="<details><poster url="http://www.moviemaze.de/filme/\1/poster_lg\2.jpg</poster></details>" dest="5">
<RegExp input="$$1" output="<thumb>http://www.moviemaze.de/filme/\1/poster_lg\2.jpg</thumb>" dest="2">
<expression repeat="yes">/([0-9]+)/poster([0-9]+)</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetPoster>
There must be something wrong in the last section GetPoster, but as always i can't find the way.
Thanks
Schenk
there is no <poster> tag, where did you get that idea from?
Schenk2302
2009-05-04, 23:11
there is no <poster> tag, where did you get that idea from?
thought it's free. just replaced fanart.:no:
well, all the time spent the last days struggling with <thumbs> and <thumb> should make it clear how you add thumbs to the result
Schenk2302
2009-05-05, 23:04
Hey spiff,
after struggling the whole night and day, i just made it :) this now works and at the end it was easier then i thought the whole day!!! But i'm not finished disturbing you: please let me know if you could help me here.
<RegExp input="$$1" output="<url function="GetPosterLinkURL">http://www.moviemaze.de/suche/result.phtml?searchword=\1</url>" dest="5+">
Prob here is that it will onyl search for the first word.
for example: Das Hundehotel, it only search for Die, same for The Last..., it only search for The. Any way to change this???
Thanks again
Schenk
you need to run a replacement regexp, replacing ' ' with %20. something along this;
(grab the relevant title in, e.g. $$5)
<RegExp input="$$5" output="\1%20\2" dest="7">
<expression repeat="yes">(.*?) (.*)</expression>
</RegExp>
Schenk2302
2009-05-05, 23:33
you need to run a replacement regexp, replacing ' ' with %20. something along this;
(grab the relevant title in, e.g. $$5)
<RegExp input="$$5" output="\1%20\2" dest="7">
<expression repeat="yes">(.*?) (.*)</expression>
</RegExp>
As usual, i don't understand where to make this in here. is it an inner or outer or seperate regexp?
<!--Moviemaze Poster URL-->
<RegExp input="$$1" output="<url function="GetPosterLinkURL">http://www.moviemaze.de/suche/result.phtml?searchword=\1</url>" dest="5+">
<expression noclean="1"><h1>([^<]*)</expression>
</RegExp>
1) grab whatever you want to search for into a buffer (as i already stated).
<RegExp input="$$1" output="\1" dest="6">
<expression noclean="1"><h1>([^<]*)</expression>
</RegExp>
2. run the replacement regexp
<RegExp input="$$6" output="\1%20\2" dest="7">
<expression repeat="yes">(.*?) (.*)</expression>
</RegExp>
3. finally construct the url based on your new and shiny space-replaced title
<RegExp input="$$7" output="<url function="GetPosterLinkURL"gt;http://www.moviemaze.de/suche/result.phtml?searchword=\1</url>" dest="5+">
<expression noclean="1"/>
</RegExp>
Schenk2302
2009-05-06, 10:24
<RegExp input="$$1" output="\1" dest="6">
<expression noclean="1"><h1>([^<]*)</expression>
</RegExp>
<RegExp input="$$6" output="\1%20\2" dest="7">
<expression repeat="yes">(.*?) (.*)</expression>
</RegExp>
<RegExp input="$$7" output="<url function="GetPosterLinkURL"gt;http://www.moviemaze.de/suche/result.phtml?searchword=\1</url>" dest="5+">
<expression noclean="1"/>
</RegExp>
Thanks again spiff, i now understand and got it working but now i think it only search for e.g. Der letzte and not Der letzte Zug. Maybe i have to change the expression, but don't know what the old expression is doing for now (.*?) (.*)
-Schenk
well, you should know what that does - it is just a regular expression.
that being said; my bad. you want
<RegExp input="$$6" output="\1%20" dest="7">
<expression repeat="yes">([^ ]+)</expression>
</RegExp>
w00dst0ck
2009-05-06, 14:43
@Schenk2302:
Als ich den moviemaze.de scraper geschrieben habe und mich dadurch das erste mal mit RegEx auseinandersetzen musste, hat mir diese Seite weitergeholfen.
http://www.regex-tester.de/regex.html
Schenk2302
2009-05-06, 20:19
@Schenk2302:
Als ich den moviemaze.de scraper geschrieben habe und mich dadurch das erste mal mit RegEx auseinandersetzen musste, hat mir diese Seite weitergeholfen.
http://www.regex-tester.de/regex.html
Hi woodstock,
ja, Danke Dir, habe dort auch schon geschaut, nur manchmal fällt der Groschen einfach nicht.
Grüße
Schenk
Schenk2302
2009-05-06, 20:22
well, you should know what that does - it is just a regular expression.
that being said; my bad. you want
<RegExp input="$$6" output="\1%20" dest="7">
<expression repeat="yes">([^ ]+)</expression>
</RegExp>
Hey spiff,
got it working with your help. little problem i have now is:
Der%20letzte%20Zug%20
How to get rid of the last 20% ?? Another regexp?
Thanks
Schenk
Schenk2302
2009-05-07, 01:18
Okay, now i'm trying to do some cosmetics:
Sometimes the plot on the site i parse is written with umlauts (ä,ü,ö) in real like:
Anfänglich hält ...
sometimes it is written with tags like:
plötzlich
I tried any encoding and noclean stuff but can't get the second choice to display "plötzlich", instead it's the above "plötzlich"
Any hints ???
Thanks in advance
Schenk
run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.
Schenk2302
2009-05-07, 16:41
run a replacement regexp. or give me a list of tags that isn't cleaned/replaced properly and i'll add them to the list.
i think there's only:
# ä -> ä
# Ä -> Ä
# ö -> ö
# Ö -> Ö
# ü -> ü
# Ü -> Ü
# ß -> ß
thanks
Schenk2302
2009-05-08, 12:32
I tried everything but can't get this to work:
Here's the regexp:
KURZINHALT</h2></li>[^>]*>+(.*)</li>
Here's e.g. some text:
KURZINHALT</h2></li>
<li class="c1">Kult comes back! Starsky & Hutch sind wieder da!<br />
Jetzt erfahren die Buddies ihr Leinwandcomeback: Schrill, laut und jederzeit locker... Die amerikanischen Ausnahme-Comedians BEN STILLER („Verrückt nach Mary“, „Meine Braut, ihr Vater und ich“) und OWEN WILSON („Die Royal Tenenbaums“, „Shanghai Knights“) schlüpfen in die lässigen Outfits der Undercover-Agenten und heften sich an die Fersen des zwielichtigen Geschäftsmannes Reese Feldmann (VINCE VAUGHN) und dessen Freundin Kitty (JULIETTE LEWIS). Mit Hilfe ihres gerissenen Informanten Huggy Bear (SNOOP DOGG) und den entzückenden Cheerleadern Staci (CARMEN ELECTRA), Holly (AMY SMART) und Heather (BRANDIE RODERICK) wollen der angeknackste Starsky (Ben Stiller) und Womanizer Hutch (Owen Wilson) der Gerechtigkeit genügetun...</li>
The first & is displayed as &, but after that all umlauts are displayed like in the text.
Thanks in advance
Schenk
Schenk2302
2009-05-08, 16:14
maybe my above question was stupid again, but why is the & displayed correctly as & and the umlauts not ???:(
because it is within html tags. i have found the issue but had no time to test
Schenk2302
2009-05-08, 16:55
because it is within html tags. i have found the issue but had no time to test
Okay, thanks for taking care !!!
Schenk2302
2009-05-08, 22:40
hey spiff,
some little cosmetics i have found and think it's not just my scraper, tried ofdb too:
in xbmc, in the search results window umlauts and . and : are shown fine but i'm not able to display &, no matter which setting i try. Example is Starsky & Hutch and Fast & Furious, they're shown just without & (Starsky Hutch). This is happening in the title after parsing, too.
another question from me, sorry :=
in the output= am i allowed to put two different url functions in one line, after another because the use the same regexp???
Thanks
Schenk
& is not an allowed char in xml, nor in html.
if the pages hold litteral &'s they are not html compatible unless it is in CDATA or verbatim fields.
if they ARE in those fields, the scraper needs to handle the & -> & conversion.
and yes on the question. you can add thousands of fields to the xml at the same time if you so see fit. to the parser it's all just text.
Schenk2302
2009-05-11, 22:00
<GetPosterLinkURL dest="5">
<RegExp input="$$2" output="<details>\1</details>" dest="5+">
<RegExp input="$$1" output="<url function="GetPosterURL">http://www.moviemaze.de/filme/\1/\2</url>" dest="2+">
<expression><a href="/filme/([0-9]+)/([^"]*)"</expression>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetPosterLinkURL>
<GetPosterURL dest="5">
<RegExp input="$$1" output="<details><url function="GetPoster">http://www.moviemaze.de/media/poster/\1/\2</url></details>" dest="5+">
<expression><a href="/media/poster/([0-9]+)/([^"]*)"</expression>
</RegExp>
</GetPosterURL>
<GetPoster clearbuffers="no" dest="5">
<RegExp input="$$1" output=";<thumb>http://www.moviemaze.de/filme/\1/poster_lg\2.jpg</thumb>;" dest="10+">
<expression repeat="yes">/([0-9]+)/poster([0-9]+)</expression>
</RegExp>
</GetPoster>
<GetThumbnailLink clearbuffers="no" dest="6">
<RegExp input="$$7" output="<details>\1</details>" dest="6">
<RegExp input="$$1" output="<url function="GetThumbnail">http://www.cinefacts.de/kino/film/\1</url>" dest="7">
<expression repeat="yes" noclean="1"><a href="/kino/film/([^"]+)">[^<]*<img</expression>
</RegExp>
<RegExp input="" output="<url function="CollectThumbnails" cache="film.xml">http://www.cinefacts.de/kino/datenbank.html</url>" dest="7+">
<expression/>
</RegExp>
<expression noclean="1"/>
</RegExp>
</GetThumbnailLink>
<GetThumbnail clearbuffers="no" dest="5">
<RegExp input="$$1" output="<thumb>http://www.cinefacts.de/kino/plakat/\1</thumb>" dest="8+">
<expression>href="/kino/plakat/([^"]*)"</expression>
</RegExp>
<RegExp input="" output="<details></details>" dest="5">
<expression noclean="1"/>
</RegExp>
</GetThumbnail>
<CollectThumbnails dest="2">
<RegExp input="$$10$$8" output="<details><thumbs>\1</thumbs></details>" dest="2+">
<expression noclean="1"/>
</RegExp>
</CollectThumbnails>
Hey spiff,
this almost works but i got one problem. let me explain: if there are covers on cinefacts and moviemaze, all covers are shown. If there is only a cover in cinefacts, this is shown. But when there's only a cover at moviemaze, none is shown. I think it has something to do with the $$10$$8 but i don't know exactly and really appreciate if you could help here:)
Thanks so much
Schenk
Schenk2302
2009-05-12, 14:01
anyone could help with the above? :sad:
thanks in advance
Schenk
why are you concating to buffer 2 in the collectfunction?
i really do not have the time to help now, you do know that new builds log all the scraper output?
Schenk2302
2009-05-12, 14:55
Thanks for the answer spiff,
no i didn't know but upgraded now. it shows me the output when both moviemaze and cinefacts have covers. it shows nothing when only moviemaze got one.
for the collect section, that's what you gave me !
maybe, but i never said that the code i give you are exactly correct. i only give pointers to point out the concepts.
Schenk2302
2009-05-12, 17:13
maybe, but i never said that the code i give you are exactly correct. i only give pointers to point out the concepts.
spiff, you added the script as a ticket :grin:
i'm not even having this discussion