XBMC Community Forum  

Go Back   XBMC Community Forum > Development > Scraper Development

Scraper Development Developers forum for meta data scrapers. Scraper developers only!
Not for posting feature requests, bugs, or end-user support requests!

Reply
 
Thread Tools Search this Thread Display Modes
Old 2008-08-19, 12:44   #1
w00dst0ck
Junior Member
 
Join Date: Aug 2008
Posts: 25
w00dst0ck is on a distinguished road
Question moviemaze.de scraper development - help needed

Hi there,

im currently developing a http://www.moviemaze.de scraper. I managed to scrap most of the provided informations but have some trouble with the following issues:

Director:

Moviemaze.de HTML
HTML Code:
									<td valign=top class="standard">
											<span class="fett">Regie:</span>
										</td>

										<td valign=top class="standard_justify">
										Guillermo Del Toro										</td>
									</tr>
my regex:
Code:
<RegExp input="$$6" output="&lt;director&gt;\1&lt;/director&gt;" dest="5+">
	<RegExp input="$$1" output="\2" dest="6">
	<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
	</RegExp>
	<expression>([A-Za-z0-9 ,.]+)</expression>
</RegExp>
scap.exe delivers the right results, but XBMC displays a TRIM as result.
I decided to get the result in two steps because it's surrounded by tabs.
Any ideas why XBMC displays TRIM?


Actors:

Moviemaze.de HTML
HTML Code:
									<tr>
										<td valign=top class="standard">
											<span class="fett">Darsteller:</span>

										</td>
										<td valign=top class="standard_justify" width=100%>
										<a href="/celebs/89/1.html">Christian Bale</a> (Bruce Wayne/ Batman), <a href="/celebs/114/1.html">Heath Ledger</a> (Joker), <a href="/celebs/127/1.html">Michael Caine</a> (Alfred), Maggie Gyllenhaal (Rachel Dawes), Gary Oldman (Lt. James Gordon), Aaron Eckhart (Harvey Dent), <a href="/celebs/47/1.html">Morgan Freeman</a> (Lucius Fox), Monique Curnen (Detective Ramirez), Ron Dean (Detective Wuertz), Cillian Murphy (Dr. Jonathan Crane)										</td>

									</tr>
My regex:
Code:
<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\1&lt;/name&gt;&lt;role&gt;\2&lt;/role&gt;&lt;/actor&gt;" dest="5+">
	<RegExp input="$$1" output="\2" dest="2">
		<expression repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
	</RegExp>
	<expression repeat="yes" trim="1">([A-Za-z \-]*)\(([A-Za-z \-]*)\)</expression>
</RegExp>
It works, but I miss the actors with a href link and I did not managed to find a solution.


German Umlaut [äöüß]:

I can't grep words with characters like "german umlaute" / [äöüß].
I've tried expressions like ([A-Za-z0-9äÄöÖüÜß ,.]+) and ([A-Za-z0-9&auml;&Auml;&ouml;&Ouml;&uuml;&Uuml; ,.]+) without success.


Can somebody please help me?

regards,
w00dst0ck
w00dst0ck is offline   Reply With Quote
Old 2008-08-19, 13:06   #2
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
missing a "

actors; try adding the href stuff with a ?, i.e. make it optional

umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2008-08-19, 14:27   #3
w00dst0ck
Junior Member
 
Join Date: Aug 2008
Posts: 25
w00dst0ck is on a distinguished road
Default

thanx for reply!

Quote:
Originally Posted by spiff View Post
if that's a c&p;
Code:
<expression trim=2">Regie([^&quot;]*)&quot;standard_justify&quot;(.*?)&lt;</expression>
missing a "
Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.


Quote:
Originally Posted by spiff View Post
umlaut might have to do with encoding - make sure the scraper is set to the encoding you are saving the file in (and also that it matches moviemaze's encoding)
How do i set the encoding of the scraper?
The moviemaze.de page is encoded with iso-8859-1 and my results are generated with:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
w00dst0ck is offline   Reply With Quote
Old 2008-08-19, 14:45   #4
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

you set the encoding of the scraper xml file using exactly that kinda header as you just pasted
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2008-08-19, 14:56   #5
Gaarv
Member
 
Join Date: May 2008
Posts: 35
Gaarv is on a distinguished road
Default

Quote:
Originally Posted by w00dst0ck View Post
Added the missing " but there is no difference. Also tried without the trim option. But that results in no information provided.
Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.

For umlaut, like spiff already said use ""so-8859-1" but be sure if you're using wilcard matches to include "&", "#" ";" and numbers because umlaut is displayed as it in the source :

Heerf & # 2 5 2 ; hrer
Ma & # 2 2 3 ; arbeit

Heerführer
Maßarbeit

I had to space out the code to illustrate, of course they aren't needed.
__________________

XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
Gaarv is offline   Reply With Quote
Old 2008-08-19, 16:32   #6
w00dst0ck
Junior Member
 
Join Date: Aug 2008
Posts: 25
w00dst0ck is on a distinguished road
Default

Quote:
Originally Posted by Gaarv View Post
Seems you're missing a ">" too after "standard_justify". Its not required to match, but it will be returned in /2 so thats a problem.
Thanx, that solves the problem.

The problem with the umlaut is solved by using (.*)

Next I will try to solve my href problem...
w00dst0ck is offline   Reply With Quote
Old 2008-08-19, 17:08   #7
Gaarv
Member
 
Join Date: May 2008
Posts: 35
Gaarv is on a distinguished road
Default

I did this rapidly, you will have to adapt ot limit its application in the whole page but it gave the idea

Code:
[^=",]*([A-Za-z0-9 ./]*\([A-Za-z0-9 ./]*\))
In the exemple you gave this match all actor names and role, with sometimes a </a> that needs to be cleaned
__________________

XBMC Linux Ubuntu 8.04 - Antec Fusion Black
Gigabyte MA78GM-S2H - AMD Athlon 64 X2 BE-2350
Corsair 2Go DDRII PC6400 - Samsung Spinpoint 500 Go
Sony Bravia KDL-40W4000 - Logitech Harmony 555
Gaarv is offline   Reply With Quote
Old 2008-08-20, 09:51   #8
w00dst0ck
Junior Member
 
Join Date: Aug 2008
Posts: 25
w00dst0ck is on a distinguished road
Default

Hi there and thanx for your help.
I've only one problem left with the actors part.
HTML Code:
									<tr>
										<td valign=top class="standard">
											<span class="fett">Darsteller:</span>

										</td>
										<td valign=top class="standard_justify" width=100%>
										Til Schweiger (Ludo), Nora Tschirner (Anna), Matthias Schweighöfer (Moritz), Alwara Höfels (Miriam), Jürgen Vogel (Jürgen Vogel), Rick Kavanian (Chefredakteur), Armin Rohde (Bello), Wolfgang Stumph, Barbara Rudnik (Lilli), Christian Tramitz										</td>
									</tr>

Code:
<!--Actors-->	
	<RegExp input="$$2" output="&lt;actor&gt;&lt;name&gt;\2&lt;/name&gt;&lt;role&gt;\5&lt;/role&gt;&lt;/actor&gt;" dest="5+">
		<RegExp input="$$1" output="\2" dest="2">
			<expression trim="2" repeat="yes">Darsteller:([^%]*)%&gt;(.*?)&lt;/tr</expression>
		</RegExp>
		<expression repeat="yes">(&lt;a href\="[^&gt;]*&gt;)?(.*?)(&lt;/a&gt;)?( \((.*?)\))?, </expression>
	</RegExp>
The 10 tabs in front of the first actor name are added in \2. I've tried to clear them out with [\t]{10}?
but i think the XBMC scraper engine don't understand \t.
Any idea how I can get solve this?
w00dst0ck is offline   Reply With Quote
Old 2008-08-20, 12:47   #9
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

\\t
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2008-08-20, 15:23   #10
w00dst0ck
Junior Member
 
Join Date: Aug 2008
Posts: 25
w00dst0ck is on a distinguished road
Default

Found another solution.

Will submit the working scraper at http://xbmc.org/trac/ticket/4563
w00dst0ck is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 22:01.


Protected by Akismet, We recommend WordPress blogs
Copyright © 2008, XBMC Project