XBMC Community Forum  

Go Back   XBMC Community Forum > Development > Scraper Development

Scraper Development Developers forum for meta data scrapers. Scraper developers only!
Not for posting feature requests, bugs, or end-user support requests!

Reply
 
Thread Tools Search this Thread Display Modes
Old 2008-08-28, 23:17   #1
pko66
Senior Member
 
Join Date: Dec 2006
Posts: 109
pko66 is on a distinguished road
Default Bug in scrap.exe

I think I've found a bug in scrap.exe; it may be in xbmc's parsing of scrapers but I think it is in scrap.exe, they have different behaviour when dealing with "cleaning" expressions (when you do not specify noclean="1"). It was turning me crazy...

I was trying to extract <genre> for the example scraper in "scraper for dummies". The interesting part of $$1 is:
Code:
...<font class = 'titulo3'>Género:</font><br>Terror / Thriller<br><br><font class = 'titulo3'>Nacionalidad:</font>...
the idea is this:
regexp1:
Code:
<RegExp input="$$1" output="\1" dest="9">
	<expression noclean="1">G.nero:(.[^:]*)Nacionalidad:</expression>
</RegExp>
Should store in $$9 this:
Code:
</font><br>Terror / Thriller<br><br><font class = 'titulo3'>
then regexp2:
Code:
<RegExp input="$$9" output="\1/" dest="7">
	<expression noclean="1">&gt;(.[^&lt;&gt;]*)&lt;</expression>
</RegExp>
should cut the innermost part and add "/" at the end, so store in $$7
Code:
Terror / Thriller/
and finally regexp3:
Code:
<RegExp input="$$7" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
	<expression repeat="yes" trim="1">([^/]*)/</expression>
</RegExp>
appends to $$8 this:
Code:
<genre>Terror</genre><genre>Thriller</genre>
It actually works in both scrap.exe and xbmc.

But in my first attempt, I forgot to add "noclean=1" in both regexp1: and regexp2:, it should not work because the expression in regexp2 does not resolve to anything and and since $$7 is not cleared, in regexp3 it will use the previous content and generate some random <genre> or nothing, that is what happens in xbmc, but in scrap.exe it actually worked and gave me correct results!!

It occurred then to me that, since cleaning should strip all html content, using noclean="1" in regexp1 should return directly "Terror / Thriller", and so this shorter version should do the same (stripping regexp2):

Code:
<RegExp input="$$7" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="8+">
	<RegExp input="$$1" output="\1/" dest="7">
		<expression>G.nero:(.[^:]*)Nacionalidad:</expression>
	</RegExp>
	<expression repeat="yes" trim="1">([^/]*)/</expression>
</RegExp>
in XBMC works like a charn, but in scrap.exe returns this:
Code:
<genre><</genre><genre>font><br>Terror</genre><genre>Thriller<br><br><font class = 'titulo3'></genre>
which proves that scrap.exe is not cleaning \1 in the inner regexp.

I'm using in the "scraper for dummies" the first, longer version with regexp 1, 2 and 3 because works in both cases, and so is better to not confuse people that may try it by hand.
pko66 is offline   Reply With Quote
Old 2008-08-29, 01:30   #2
jmarshall
Team-XBMC Developer
 
Join Date: Oct 2003
Posts: 15,076
jmarshall is on a distinguished road
Default

Unfortunately, scrap.exe is out of date, and no longer maintained. The original author lost the sources to his updated build.

This means that the only way to test it reliably at this point is directly from XBMC.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.


jmarshall is offline   Reply With Quote
Old 2008-08-29, 12:17   #3
pko66
Senior Member
 
Join Date: Dec 2006
Posts: 109
pko66 is on a distinguished road
Default

Ok, I will include that info in the wiki for everyone to know.

scrap.exe is still useful enough for some quick tests, I haven't found other bugs except that noclean issue.
pko66 is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 01:40.


Protected by Akismet, We recommend WordPress blogs
Copyright © 2008, XBMC Project