![]() |
|
|||||||
| Scraper Development Developers forum for meta data scrapers. Scraper developers only! Not for posting feature requests, bugs, or end-user support requests! |
![]() |
|
|
Thread Tools | Search this Thread | Display Modes |
|
|
#1 |
|
Junior Member
Join Date: Aug 2008
Posts: 25
![]() |
Hi there,
im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues: Director: Moviemaze.de HTML HTML Code:
<td valign=top class="standard"> <span class="fett">Regie:</span> </td> <td valign=top class="standard_justify"> Guillermo Del Toro </td> </tr> Code:
<RegExp input="$$6" output="<director>\1</director>" dest="5+"> <RegExp input="$$1" output="\2" dest="6"> <expression trim=2">Regie([^"]*)"standard_justify"(.*?)<</expression> </RegExp> <expression>([A-Za-z0-9 ,.]+)</expression> </RegExp> I decided to get the result in two steps because it's surrounded by tabs. Any ideas why XBMC displays TRIM? Actors: Moviemaze.de HTML HTML Code:
<tr> <td valign=top class="standard"> <span class="fett">Darsteller:</span> </td> <td valign=top class="standard_justify" width=100%> <a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane) </td> </tr> Code:
<RegExp input="$$2" output="<actor><name>\1</name><role>\2</role></actor>" dest="5+"> <RegExp input="$$1" output="\2" dest="2"> <expression repeat="yes">Darsteller:([^%]*)%>(.*?)</tr</expression> </RegExp> <expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression> </RegExp> German Umlaut [äöüß]: I can't grep words with characters like "german umlaute" / [äöüß]. I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9äÄöÖüÜ ,.]+) without success. Can somebody please help me? regards, w00dst0ck |
|
|
|
|
|
#2 |
|
Grumpy Bastard Developer
Join Date: Nov 2003
Posts: 7,715
![]() |
if that's a c&p;
Code:
<expression trim=2">Regie([^"]*)"standard_justify"(.*?)<</expression> actors; try adding the href stuff with a ?, i.e. make it optional umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
#3 | ||
|
Junior Member
Join Date: Aug 2008
Posts: 25
![]() |
thanx for reply!
Quote:
Quote:
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with: <?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> |
||
|
|
|
|
|
#4 |
|
Grumpy Bastard Developer
Join Date: Nov 2003
Posts: 7,715
![]() |
you set the encoding of the scraper xml file using exactly that kinda header as you just pasted
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
#5 | |
|
Member
Join Date: May 2008
Posts: 35
![]() |
Quote:
For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source : Heerf & # 2 5 2 ; hrer Ma & # 2 2 3 ; arbeit Heerführer Maßarbeit I had to space out the code to illustrate, of course they aren't needed.
__________________
XBMC Linux Ubuntu 8.04 - Antec Fusion Black Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350 Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go Sony Bravia KDL-40W4000 - Logitech Harmony 555 |
|
|
|
|
|
|
#6 |
|
Junior Member
Join Date: Aug 2008
Posts: 25
![]() |
|
|
|
|
|
|
#7 |
|
Member
Join Date: May 2008
Posts: 35
![]() |
I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea
Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))
__________________
XBMC Linux Ubuntu 8.04 - Antec Fusion Black Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350 Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go Sony Bravia KDL-40W4000 - Logitech Harmony 555 |
|
|
|
|
|
#8 |
|
Junior Member
Join Date: Aug 2008
Posts: 25
![]() |
Hi there and thanx for your help.
I've only one problem left with the actors part. HTML Code:
<tr> <td valign=top class="standard"> <span class="fett">Darsteller:</span> </td> <td valign=top class="standard_justify" width=100%> Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz </td> </tr> Code:
<!--Actors--> <RegExp input="$$2" output="<actor><name>\2</name><role>\5</role></actor>" dest="5+"> <RegExp input="$$1" output="\2" dest="2"> <expression trim="2" repeat="yes">Darsteller:([^%]*)%>(.*?)</tr</expression> </RegExp> <expression repeat="yes">(<a href\="[^>]*>)?(.*?)(</a>)?( \((.*?)\))?, </expression> </RegExp> but i think the XBMC scraper engine don't understand \t. Any idea how I can get solve this? |
|
|
|
|
|
#9 |
|
Grumpy Bastard Developer
Join Date: Nov 2003
Posts: 7,715
![]() |
\\t
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
#10 |
|
Junior Member
Join Date: Aug 2008
Posts: 25
![]() |
|
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|