PDA

View Full Version : Can XBMC or scrapers pre-process file-names before preforming lookups?


VTurn
2007-10-14, 23:12
Hi,

XBMC has trouble finding the films in either IMDB or other scappers because it gets confused by the file naming convention I am using.

Let me explain : I name my files like this
Le Fabuleux destin d'Amélie Poulain (Amelie of Montmartre) - Jean-Pierre Jeunet (2001).avi
That's

original name
English name
Director
Year


The thing is... when I send this info to a scrapper, it gets confused. Is there a way I can "help" the scrapper by telling it which information is which?

Couldn't find anything in the Wiki.

V.

VTurn
2007-10-14, 23:26
I mean... Is there any other solution than playing with the scrapper XML file (http://www.xboxmediacenter.com/wiki/index.php?title=How_To_Write_Media_Info_Scrapers)?

V.

VTurn
2007-10-15, 01:14
Ok, I think I'm lost. I tried to modify the scrapper as follows:


Original:
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.allocine.fr/recherche/?motcle=\1/" dest="3">
<expression></expression>
</RegExp>
</CreateSearchUrl>

Modified
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.allocine.fr/recherche/?motcle=\1/" dest="3">
<expression trim="1" noclean="1">[^\(-]*</expression>
</RegExp>
</CreateSearchUrl>

But that does not seem to help. I guess that the scrapper engine would take my file name "title (original title) - director (year).avi" and apply the above regex, which should clean it up into "title ", then send it to the search engine.

Does not seem top work, though... :blush:

Any idea?

V.

SleepyP
2007-10-15, 03:38
isn't this what the RegEx in Advanced Settings is for, or does that only apply to TV shows?

VTurn
2007-10-15, 09:51
isn't this what the RegEx in Advanced Settings is for, or does that only apply to TV shows?
Well, given the name of the property in AdvancedSettings.xml (http://www.xboxmediacenter.com/wiki/index.php?title=AdvancedSettings.xml), I doubt it...
<tvshowmatching>

Contains regular expression to match the season and episode numbers in filenames.

V.

VTurn
2007-10-15, 09:59
OK, I've spent some time learning more about RegEx (never thought I'd need to someday), but now I am really confused as of why the below does not work, because it should really do what I want
<CreateSearchUrl dest="3">
<RegExp input="$$1" output="http://www.allocine.fr/recherche/?motcle=\1" dest="3">
<expression repeat="no" trim="1" noclean="1" clear="no">([^(]*)</expression>
</RegExp>
</CreateSearchUrl>

([^(]*) should match all the first characters until it finds the first '(', then return this group in the URL, then build the search string.

Any idea anybody?...

V.

spiff
2007-10-15, 10:34
search query input is url encoded

VTurn
2007-10-15, 14:48
search query input is url encoded
Thank you for yor reply, but I am not sure I understand what you mean.
Do you mean I should replace ([^(]*) with (%5B%5E(%5D*)?

Because according to Scraper.xml (http://www.xboxmediacenter.com/wiki/index.php?title=Scraper.xml) page, none of my characters above need encoding. Also, if I look at the reference imdb.xml (http://xbmc.svn.sourceforge.net/viewvc/xbmc/trunk/XBMC/system/scrapers/video/imdb.xml?view=markup) file from SVN, it does contains un-encoded regexps (but that is for other functions, not CreateSearchUrl).

I'll try this tonight, but if it works, I'm not sure I'll understand why... :sad:

V.

VTurn
2007-10-15, 23:52
Unfortunaltey, no luck :sad:

Anybody has any idea?

V.

spiff
2007-10-15, 23:52
INPUT, i.e. the contents of $$1

VTurn
2007-10-16, 09:53
INPUT, i.e. the contents of $$1
Thank you for taking time to reply, spiff.

Even if $$1 is URL encoded, the parenthesis still remains, so the file name "Le Fabuleux destin d'Amélie Poulain (Amelie of Montmartre) - Jean-Pierre Jeunet (2001).avi" becomes Le%20Fabuleux%20destin%20d'Am%E9lie%20Poulain%20(A melie%20of%20Montmartre)%20-%20Jean-Pierre%20Jeunet%20(2001).avi

If I apply the ([^(]*) regex to this string, I still get Le%20Fabuleux%20destin%20d'Am%E9lie%20Poulain%20, which should really return the proper movie in the search (this would be the resulting URL : http://www.allocine.fr/recherche/?motcle=Le%20Fabuleux%20destin%20d'Am%E9lie%20Poul ain%20)

By the way, is there any way to run the scraper engine in debug mode so that I can understand what it does and log the different values?

Again, thank you for your patience.

V.

spiff
2007-10-16, 10:14
there is a scraper development environment in tools/scrap

however its broken (there's a binary which works) and we have been shouting at the author for months but he just do not want to respond :/

otherwise, looking at the source itself is your best bet. this stuff would be taking place in CIMDB::GetURL()

VTurn
2007-10-19, 02:05
Issue resolved!

After I looked at the source based on your suggestion, I noticed that the content of $$1 is actually pre-processed a lot inside the code before it even reaches the scraper.

So I started XBMC in debug mode, and found out that the query URL is actually printed, so I finally got it.

In fact the file name "Le Fabuleux destin d'Amélie Poulain (Amelie of Montmartre) - Jean-Pierre Jeunet (2001).avi" is actually transformed to "Le Fabuleux destin d'Amélie Poulain (Amelie of Montmartre) Jean-Pierre Jeunet". Note that the <space><minus sign><space> in the middle becomes the <space><space><space>. So once it is escaped, it becomes %20%20%20.

Finally, I managed to get what I want with the following regular expression :
<expression>([^(]+)(%20%20%20|%20%28)</expression>

the %20%28 is here to strip out everything after the first round bracket, and in case the movie is already in English (like "The Rock - Michael Bay (1996).avi"I don't have this first bracket, so I use the %20%20%20 to get rid of the movie director name.

V.