PDA

View Full Version : OFDB scraper


morte0815
2007-02-12, 11:43
Hi,

This is my first "release" of the ofdb (germen version of imdb) scraper.

The mainfeatures work, but there are some problems:

- Umlauts are not readable (??siteencoding??)

- i parse the genres into individual tags (not only one genre tag; mabe one could change this in the scraper parser code (In databases you should not store lists in attributes :) )) If this will not be changed, then i will change the parser...

- atm no original title since there is no tag for it
- mpaa is only fetched if there is the movie was in the cinemas... (addition here: maybe some possibility to check if a regex failed. example:


<regex name"theregextocheck" .../>
<Regex name="aName" condition="theregextocheck"> <!--this one will only be called if the condition does match-->
cheers morte

spiff
2007-02-12, 12:29
i have no idea how ofdb expects its urls to be encoded. it certainly does not use normal url encoding (umlauts SHOULD be encoded, it does not accept that).

several <genre> tags, sure. i dont see why this is easier to do than the current / separated list though. as for storing them like that in the db, its to speed up the queries, constructing the / separated string each time takes time....

just add the original title as a tag, we will get it in there eventually.

as for those conditions, they are easily simulated using two regexps + clear="#ofbuffer". let one regexp grab the conditional block, clearing the buffer no matter what. then the next expression will either have nothing to look through (and hence fail), or you did grab something during the first expression so they will succeed....

morte0815
2007-02-12, 12:55
encoding the urls is the one thing... but the other thing is, that you cant read the results. in the list, where you can choose the the movie the umlauts work like a charm, but e.g. the plot is not readable if there are umlauts in it.

the thing with the genre is given through a db design pattern... it is normal not to store lists in an attribute. it also should make the genre-queries easier, since you do not have to split the result of the genre but only to iterate through all genre tags to find the movies with the needed genre...

the original title will be added, and the condition thing will be testet :)

thanks for the response

morte

sCAPe
2007-02-12, 13:43
spiff: Any chance of integrating this ofdb scraper to SVN when its stable and finished? Or should the users download this scraper by themselves?

Don't know how it is planned for all future scrapers?

sCAPe

spiff
2007-02-12, 13:45
yes ofc we will stick it in svn.

everything deemed to be of high enough quality will be stuck in svn. the more the merrier

morte0815
2007-02-12, 15:48
So here is the version with the original title (<originaltitle>)!

It also reorders the title: Matrix, the => The Matrix
If someone does not want this behaviour, then tell me and i write another version..

CYA Morte

asg
2007-02-12, 15:54
<RegExp input="$$1" output="&lt;horst&gt;\1&lt;/horst&gt;" dest="8">.. horst :D

Thanks
asciii

morte0815
2007-02-12, 15:56
well, thats some debugging tag... forgot to delete it.. but as long as it works... :)

:rolleyes:AND NO: MY NAME IS NOT HORST! :grin:

cheers morte

Solo0815
2007-02-12, 17:31
THX!
I'll try it later this evening and report back!

BTW: Horst is a nice name for a debugging tag ;) Are there Karl and Hans in the script, too *LOL* jk

Solo0815
2007-02-12, 18:06
encoding the urls is the one thing... but the other thing is, that you cant read the results. in the list, where you can choose the the movie the umlauts work like a charm, but e.g. the plot is not readable if there are umlauts in it.
Thats the only "bug" i noticed so far. Nice scraper!

spiff
2007-02-12, 20:31
Revision: 7816
http://svn.sourceforge.net/xbmc/?rev=7816&view=rev
Author: spiff_
Date: 2007-02-12 10:06:31 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
fixed: various encoding related stuff in scraper code. should fix ofdb.

Revision: 7817
http://svn.sourceforge.net/xbmc/?rev=7817&view=rev
Author: spiff_
Date: 2007-02-12 10:22:03 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
added: original title column to video database (still not accessible from ui).

Revision: 7819
http://svn.sourceforge.net/xbmc/?rev=7819&view=rev
Author: spiff_
Date: 2007-02-12 12:26:44 -0800 (Mon, 12 Feb 2007)

Log Message:
-----------
changed: allow multiple <genre>, <director> and <credits>< tags in scraper xml output.

now, what did you have in mind. just an info label or something more substantial?

also; please confirm my fixes, then feel free to hand me a dehorstified version and i'll add it to svn.

suggested change:

<!--Genre-->
<RegExp input="$$1" output="&lt;genre&gt;\1&lt;/genre&gt;" dest="5+">
<expression repeat="yes">view.php\?page=genre&amp;Genre=[^&quot;]+&quot;&gt;([^&lt;]*)&lt;</expression>
</RegExp>

morte0815
2007-02-12, 22:44
Hi,

an infolabel should be adequat for the original title.

THANKS for all the fixes!!
(i'll test it as soon asap)

dehorstified version :laugh::laugh::laugh:

the version without horst is attached :)

cheers morte

morte0815
2007-02-12, 23:10
Now the scraper seems to work!

StompSC
2007-02-13, 01:14
Hi,

@morte0815

Great Work!!!
A few hours ago I noticed that spiff made some changes to get ofdb work and i thougt to make a second try to get my scraper working, but a second SVN Checkout shows me that you've already done it!

And it works fine... with some problems for me... ???

The plot (Parsing on Inhaltsangabe.htm) works not for all movies

it works only for 1 out of 9 movies of 2007
and 6 out of 16 of 2006

Movies who don't work for example (2006):
"Catch a Fire", "Haus am See", "Depaerted", "Das Parfum", "Dead or Alive",
"Spiel der Macht"
The plot html page for these movies seems to be a bit different, but it's to late for me now to have a deep look...

@morte 0815 & spiff

There are sometimes problems with the Minus Character '-'.
It is only possible to fiind some movies by remove the '-' in the search string, but in the title on ofdb it is in the name then....
Exampe "Departed - Unter Feinden" will only be found if I remove the '-'. Then "Departed - Unter Feinden (2006)..." is found. In the Browser Search it works with the '-'. Could this be a problem with different but same looking characters?
Perhaps it is possible to remove these characters from the search string, this should fix the problem...

There are problems with "Der" too.
Example: "Das Haus am See" will only be found, if I manual remove the "Das". Then the movie is found: "Haus am See, Das / Lake House, The (2006)". This doesn't work in Browser too, so I think that's a problem with ofdb...

BTW:
Will the moviename.nfo File in the movie folder be observed for the url-search?

Disclaimer:
This is no criticism. I only try to help to make this great scraper to a fantastic scraper... :laugh:

Thanks,
StompSC

morte0815
2007-02-13, 13:17
:grin:1. thanks for your help. :grin:

2. the problems:
I simply didnt test movies where a linebreak occured in the plot. So here is an updated version.

The thing with "der, die, das, the": i know of the problem, but i do not have any idea how to fix this... i tried simply cutting off those words if they occure on the beginning, but then it didnt find "die hard". So if you have an idea please let me know.

the "-"-problem: i will try to fix this.

so long

morte!

StompSC
2007-02-13, 17:27
Great, I will try it this evening...

The 'der die das' problem is not a great problem, I Know it and I delete it manualy.

What's about the *.nfo file?

Background:
We have 4 xboxes in our Family-Net with a NAS for movie storing.
If I have a movie with matching problems I could create a *.nfo file in the movie folder for the other user (my parents for example).

Thanks, StompSC

morte0815
2007-02-13, 18:05
In your case you could just export the entire database to an xmlfile... copy that file onto the server and import it there... so you entire moviedatabase would be exactly the same on each xbox...

But i think there will be some additional functionality for your problem to make it more comfortable.

so long
morte

StompSC
2007-02-13, 19:57
The mirroring of the database is a option, but if I must do that manualy, that's no alternative.

I've testet the scraper with 70 movies out of 2006 and 2007 and it works fine!
Only 3 movies have no plot, but this ones have no plot on ofdb. :;):

some things:
The '-' is annoying, perhaps you could change the scraper to ignore the '-'
The "der, die, das" is sometimes a problem of the ofdb search engine, but I can't detect a regularity.
In many plots are some rectangle, perhaps some special characters? Often at the end of a sentence (format characters?). sometimes within a sentence.
Examples:
"Arthur und die Minimoys" - a long'-' is shown as a square
"
<br />
<br />
<br />
<br />
Mit Hilfe...
"
The <br /> are shown as squares.

But this is only a cosmetic problem, I'm happy that the german scraper works so fine!
Thank You morte0815 and spiff for that great work.

Now, I must go to my family cause today's my birtday... :grin:
But the scraper was on position one :D

Bye, StompSC

morte0815
2007-02-13, 20:19
:grin:HAPPY BIRTHDAY!:grin:

sCAPe
2007-02-14, 18:31
Hi morte,

yesterday i tried the scraper from SVN (so not the latest version in this thread).
I noticed that the german umlauts (ä,ö,ü) are not displaying correctly!?

Has this been fixed in your latest .xml in this thread or is this still an XBMC issue?

spiff
2007-02-14, 19:19
they work fine here. make sure you have the correct font (unicode).

morte0815
2007-02-14, 20:39
Did you use the latest xbmc version (from the svn) or just the scraper from the svn?

sCAPe
2007-02-16, 18:41
Now i updated to the latest SVN and it works like a charm! :grin: Thanks morte for this great scraper.

Although, i have a feature suggestion, but i don't know if this is possible:
spiff:
Would it be possible with the ofdb-scraper to download everything from ofdb, like it is now, but with a few additions from IMDB?
That is, because the ofdb database misses some information which the IMDB database has.
So in ofdb, we have almost every data filled, but except a few which are "not available". So these few could then be fetched from IMDB (as they also are only names and don't need to be translated..).

Don't know if this is possible?

spiff
2007-02-16, 19:02
yeh, chaining is on my list.

sCAPe
2007-02-16, 19:05
You are the man. :cool: Thanks.

morte0815
2007-02-18, 14:54
@sCAPe
I just updated the scraper to download information, that is not available from ofdb, from imdb... (actors, roles, credits, runtime, mpaa)...

sCAPe
2007-02-19, 14:17
Thanks morte for your work.

DonJ
2007-02-20, 18:26
morte0815 I've just added nfo support for scrapers, would be nice if you could update ur ofdb scraper to use it.

morte0815
2007-02-21, 11:56
Updated and commited!

if s.o. found an other url for links in the ofdb please post them, so i can update the scarper.

cheers

StompSC
2007-02-22, 10:46
Hi,

IN the SVN are some changes for *.nfo File parsing:

DonJ_ added: Scraper support for nfo files
morte0815 changed: added regexp for nfo-parsing
...

Where could I get some infos about using the new features?
Does the nfo File only countain a url as yet, or could I make different entrys for different parser.

Or better: Is it possible to make a choice which parser is to use by a setting in the nfo file?

It is possible to make the following ad on to the scraper sources?
A xml file in the movie folder, perhaps a "moviename.xml" with the detail infos, like:

<details>
<title>...</title>
<originaltitle>...</originaltitle>
<year>...</year>
<director>...</director>
<outline>...</outline>
<plot>...</plot>
<thumb>...</thumb>

and so on...

So, if this file exist, this file is used instead of a online request...

StompSC

DonJ
2007-02-22, 13:28
You simply put an url into an .nfo file (It is now to some extent documented on http://www.xboxmediacenter.com/wiki/index.php?title=Nfo). e.g.
http://www.imdb.com/title/tt0333766
or
http://www.ofdb.de/view.php?page=film&fid=167

The scraper tries to match url's to all scrapers of the content type a dir is set to. E.g. if you set the content type to movies all movie scrapers check the nfo file for a matching url. This means that nfo's override the scraper setting. e.g. A directory is set to use the imdb scraper but you have a german movie in it. Simply create a nfo for that movie with the ofdb link in it and you are sorted!

Concerning your feature request: Why would you want a static feature like this? The nfo feature should do everything you want. I really don't see a need for it.

StompSC
2007-02-22, 23:44
@DonJ

Thank's for the info and the wiki update, I'm now at the actual state of affairs... (It is easyer than I expected)...

I made the feature request cause - call me a pedant - for example there a still some characters in the plot that shouln't be there (See my older posting in this thread).
With such a file, I could manual edit the plot and fix it.
Cause we have four xboxes in our HomeNet, I've to fix it only once.
That's the reason...
And the file (I wish to have in the movie folder) is at the moment temporary created (by the scraper) and parsed by the aplication. So the aplication only must have a look for the file, use it, or start the scraper.
For a developer who knows the stuff it's only a few lines of code...
And for people who don't like to use it, there are no side effects, no xml file, no problem...

StompSC

jmarshall
2007-02-23, 00:56
StompSC: Correct that it'd only be a few lines of code. Please feel free to do up a patch. The code for loading from XML file is already there (utils/IMDb.cpp).

If you can't do that, or don't have the inclination, you can always export your db, edit the resultant XML file and then reimport.

Cheers,
Jonathan

sCAPe
2007-02-23, 21:12
morte0815:
I just found some bugs with the ofdb scraper.

1) When the automatic scan for a complete folder is running, afterwards i want to rescan a specific movie (because it was wrongly scanned). But now, when i rescan the correct movie (with manual selection) the Movie Poster (thumbnail) doesn't get updated.
All informations are updated correctly, but the thumbnail remains the same (previous one).

2) On some movies in the description of the movie, there are some strange "square" signs. I think it is because of a carriage-return character in the text .. But it happens only to some movies (example: blood diamond)

3) When i click on the first button (upper left corner) to switch to the cast-view (besetzung / schauspieler) there are no entries .. Are they not fetched also from OFDB ?

morte0815
2007-02-25, 11:55
Do you have the latest version from the svn?
If so, then i really have to look into it again :)
The problem: what skin do you use?
I Know that the cast-not-showing problem occurs if i use clearity. if i switch back to pm3 all works fine.

And the problem with the automatic scans... Isnt a problem of the scraper itself (i think the db is the bad guy here :) )

Please give me some time for fixing the scraper since i have to learn for my exams these days...

Cheers morte

sCAPe
2007-02-26, 12:53
Yeah, no prob.
I am using standard PMIII skin and the latest SVN (XBMC and OFDB scraper).

StompSC
2007-03-01, 21:48
@jmarshall

You've made the followig changes to the sources:

"added: .nfo parsing can now read xml files of the form outputted by scrapers or export from db."

I've taken a look in the NfoFile.cpp.....???

Is it only neccessary to rename the .xml File to .nfo.
So I could use a link in the .nfo file or the .nfo file is a xml file with the movie data. Is this correct.

StompSC

btw: Thanks for spending your time and brain and for changig the code! :nod:

jmarshall
2007-03-01, 23:04
.nfo is parsed as an xml file. So rename .xml -> .nfo. The format is identical to that output by the scrapers or by the video db export.