View Full Version : Filmweb scraper
Hi,
This is my first "release" of the filmweb (polish movie database side) scraper.
The mainfeatures work, but there are one big problem:
-xbmc don't show the polish chars at all:sad:
Filmweb - Scraper (http://smuto.w.interia.pl/filmwebraper.rar)
cheers smuto
i test this scraper after spiff fix
fixed: various encoding related stuff in scraper code. should fix ofdb.
this fix give me correct writing to database, but strings are not readable in GUI
filmweb.rar (http://smuto.w.interia.pl/filmweb.rar)
Revision: 7829
http://svn.sourceforge.net/xbmc/?rev=7829&view=rev
Author: spiff_
Date: 2007-02-13 16:16:50 -0800 (Tue, 13 Feb 2007)
Log Message:
-----------
fixed: if scraper says it returns utf8 formatted xml, obey.
you should have a look at the actors. and don't use tag line, use <originaltitle>
- i fix actar tag (whitespaces bug)
- i add <originaltitle> tag
Filmweb-scraper (http://smuto.w.interia.pl/filmweb.rar)
added to svn with some small modifications due to changes to buffer cleanings. also renamed it to filmweb.pl.
cheers
spiff
Rating tag issue
e.g
from scraper 8,73 (comma) are shows as 8.0 (dot) in GUI
one more
"votes" should be localized.
smuto
yes, it needs to conform to standard floating point numbers, i.e. use a . and not a ,
localization is added.
done
i hope this is the last update
@spiff - i want to thank you very much.:D
please update the scraper to reflect the <nfourl> changes
i made update, but no time for tests
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
scraper in this link filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
is only for tests to forum users (i need more testers)
last good working one i will always upload to SVN as a patche
smuto
oh rite i did commit the last one - will stick to the sf ones from now on
Hi! (siema)
I have many movies and many nfo files associated with them with urls to imdb and filmweb entries inside. Filmweb scraper works well olny for urls starting with www.filmweb.pl/Film,id...
Doesn't work for urls like http://frantic.filmweb.pl/ and I have them plenty...
Please update the regexp, probably the matching one in NfoUrl so that filmweb url would be correctly recognized in the nfo...
Regards,
dowiew
wiesz co nie wiem dlaczego ale nigdy nie zadziałał mi regexp na link tekstowy, a nie mam czasu na testy
filmweb tylko maskuje numer id
twój link
http://frantic.filmweb.pl/
to ten sam co ten z id
http://www.filmweb.pl/Film?id=1107
lub ten
http://www.filmweb.pl/Film,id=1107
sorki ale nie planuję zaktualizować scrapera, mam nadzieję że pliki nfo zaczną zawierać id
smuto
i hope i can use my native language in this topic
help!!
quite simple. output xml in the format
<actor>
<thumb>...</thumb>
<name>something</name>
<role>somethingelse</role>
</actor>
but, i don't have thumb url in cast. So i try with "url function"
first without luck, but i like this idea (mayby this should work in libary by "Set Actor Thumb"
<actor>
<thumb><url function="ActorLink">...</url></thumb>
<name>something</name>
<role>somethingelse</role>
</actor>
second also without luck
<actor>
<name>something</name>
<role>somethingelse</role>
</actor>
<url function="ActorLink">somethinglink</url>
function="ActorLink"
<actor>
<name>something</name>
<thumb>...</thumb>
</actor>
don't know by mayby i need same numerator
actor$1 -> function="ActorLink$1"
actor$2 -> function="ActorLink$2"
my WIP
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
smuto
dont add the actors at that point.
1) make sure all function you call dont clear buffers
2) make sure not to destroy the buffer which holds the id when it enters getdetails (# of htmls +1)
4) grab the url and chain once per actor
5) use the id to grab the role from the filmography list.
that should do, no?
i made a lite ver.of scraper for tests
filmweb_only_actor_test.xml (http://smuto.w.interia.pl/filmweb_actor_test.xml)
from scrap.exe for "Goodbye Bafana"
details.xml (http://smuto.w.interia.pl/details.xml)
ActorLink.xml (http://smuto.w.interia.pl/ActorLink.xml)
why in ActorLink.xml i have only one (last) entry , scrap visit all url's from details
Well looks like you dont repeat the thumb expression anyway.
scrap will only show you the last outputted xml.
in xbmc the actors will be pushed to a list for each returned xml
thx a lot
filmweb.xml with actor's thumb (http://smuto.w.interia.pl/filmweb.xml) - 100% working
but it becomes extremly slow - sometimes to collect url of thumbs, scraper visits more then 20 pages
so if someone want use it - just grab it from here
one more question
i edit TheTVDB.com scraper to match at first polish strings
tvdb-pl.xml (http://smuto.w.interia.pl/tvdb-pl.xml)
try to set encoding to ISO-8859-2 in scraper, but without success
A gui charset in langinfo.xml
<charsets>
<gui unicodefont="false">CP1250</gui>
<subtitle>CP1250</subtitle>
</charsets>
polish xbmc language strings are in "utf-8"
polish subtitle are mostly in CP1250
when i change gui charset to
<gui >ISO-8859-2</gui>
tvdb-pl scraper works perfect
What gui charset is for?
smuto
if returned xml is not utf8, it will be assumed to be gui charset and is converted from that to utf8.
if this is the best behaviour? not sure
as for the scraper being slow - not much we can do about that as long as the site is organized as it is...
update - add scraper settings
one week for tests before we add this to SVN
filmweb.xml- with settings (http://smuto.w.interia.pl/filmweb.xml)
have problem with encodings labels in scraper file
http://smuto.w.interia.pl/scraper_settings.jpg
is the way to add "Automatically grab actor thumbs" set to scraper settings window?
smuto
Just add a setting to the xml label="Auto Grab Actor Thumbs" id="autograb" type="bool" default="false"
duplicate your ActorLink have one input with conditional="autograb" (with thumb)
and the copy of ActorLink conditional="!autograb" but do not output <thumb></thumb>
EDIT: line 104, pos 239 change   to &nbsp;
i dont think it fits as a scraper setting. you see, if you do it in the scraper it means you won't return the urls at all. the global setting is whether or not to actually grab the thumbs, not whether or not to grab the urls. small but important difference here - if you disable it at scraper level it means you cannot grab them manually either... hence dual settings makes sense to me
i try to update the <nfourl>
can someone help me
for now i only use link with id
http://www.filmweb.pl/Film?id=999999
i try to add link with movie title to <nfourl>
http://movie.title.filmweb.pl/
this is my wip
<NfoUrl dest="3">
<RegExp input="$$1" output="http://www.filmweb.pl/Film?id=\1" dest="3">
<expression noclean="1">Film.id=([0-9]*)</expression>
</RegExp>
<RegExp input="$$1" output="http://\1.filmweb.pl" dest="3+">
<expression noclean="1">http://([^\/]+).filmweb.pl</expression>
</RegExp>
</NfoUrl>
but movie title regexp work for both url
how can i force scraper to use id, if it's present
easiest solution (i dont have time to analyze the regexp's).
output xml, i.e. <url>theurl</url>
first url block will take priority
thx a lot - it's working
add as a patch to SVN
can someone review and then commit this to the SVN?
i have a script before every location url
ex.
details.1.html (http://smuto.w.interia.pl/details.1.html)
how can i force scraper to skip this
smuto
i add "spoof" to url, mayby this help
u can test my wip scraper
filmweb.xml_test (http://smuto.w.interia.pl/filmweb.xml)
spoof is for setting the referer. it probably does the trick indeed. sorry for the late response
maybe it's not xbmc problem, but maybe u can help
Recently in movie info from filmweb scraper, accented characters are show as a entities
ex.
latin small letter o with acute
ó -> ó
is the way to fix this
smuto
hmm, it should convert those tags when you load the xml?
if not, make sure cleaning is performed on the field. latter would remove them though
with or without "noclean" i still have this same
ex
xbmc shows
który -> ktoacute;ry
in .xml from scrap.exe
który -> który
realy don't know what to do
i need to update SVN (small url function link fix)
but for now,this one is good for testing entitie
filmweb.xml_test (http://smuto.w.interia.pl/filmweb.xml)
good for test is "Kingdom of the Crystal Skull"
tag title is OK
tags outline & plot are wrong
for myself i edit source file HTMLUtil.cpp
edited HTMLUtil.cpp (http://smuto.w.interia.pl/HTMLUtil.cpp)
strReturn.Replace("–", "-;");
strReturn.Replace("ó", "ó");
it's working, but i hope u help to fix this for all polish users
smuto
i add fanart to filmweb scraper
i use polish wikipedia to migration from filmweb.id to imdb.id
we still have problem with entities, hope spiff find time to help us
u can test new scraper from here
filmweb.xml_test_scraper (http://smuto.w.interia.pl/filmweb.xml)
smuto
hi.
i see nothing wrong, nor any other way to handle this so i just commited your replaces along with the new scraper. please use trac in the future:)
spiff
@smuto: There are some problems with titles that start with numbers eg. "1410" or "27 dresses" - numbers are cut off from them. Fanart support is really great.
I hope that xbmc compilation with edited HTMLutil.cpp will be ready soon. At this moment you could put your compiled xbmc default.xbe at smuto.w.interia.pl (would be great for me, because i want to rescan my movie library and polish plots with no entity problems is something I look for...)
Eventhough entities has been fixed with changeset 15625 (http://xbmc.org/trac/changeset/15625) it seems that "oacute problem" still exists (I checked filmweb scraper on xbmc compilations 15640 and 15728). Smuto - do you agree with me?
@haken - u just need to update scraper
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
@spiff
i see nothing wrong, nor any other way to handle this so i just commited your replaces
but this is not a good idea - "oacute" & "ndash" are most popular
this mean i should add all entities to replaces
next in my queue are
strReturn.Replace(" ", "");
strReturn.Replace("’", "'");
smuto
i don't know why, but sometimes wikipedia search don't work
i change the way of scraping the link after search - please test
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
is the way to show in skin custom label?
something like this i need for testing
ListItem.IMDbID or ListItem.FilmwebID
smuto
@smuto: I think that there are some changes in filmweb.pl website - descriptions cannot be scraped and high-res posters also. I looked inside the scraper, but it is to complicated for me;)
Update: Scraper is ok! It was something else - now everything works perfect. I was surprised because each time earlier scraper worked or didn't work at all... Sorry!
Any chance to fix this scraper? I realy love it:laugh:. But its stop working for me now. Its find title but dont downloand any cover and any info about movie.:sniffle:
Any chance to fix this scraper? I realy love it:laugh:. But its stop working for me now. Its find title but dont downloand any cover and any info about movie.:sniffle:
This must be somthing with site cos now its working.
The problem seems to be because the website sometimes force users to see a welcome page. I was try to fix this by spoof. But this will be harder to than expected. So if someone wants to work on that, she/he is welcome.
smuto
The problem seems to be because the website sometimes force users to see a welcome page. I was try to fix this by spoof. But this will be harder to than expected. So if someone wants to work on that, she/he is welcome.
smuto
It looks , the site always is forcing to see a welcome page :/ i do refresh and every time i see a message :
Tool.doNotEscapeHTML ($simpleMarkupTool.renderMarkup($desc.description
Any ideas what is wrong ?
---added
I''ve done goolge for phrase doNotEscapeHTML and i've found many sites with this message ( eg. http://www.meg.ryan.filmweb.pl/FilmReview?id=483666&review.id=5968 ) . It looks , there is a problem with filmweb.pl site :/
nightman
2009-03-05, 23:10
It looks , the site always is forcing to see a welcome page :/
I''ve done goolge for phrase doNotEscapeHTML and i've found many sites with this message ( eg. http://www.meg.ryan.filmweb.pl/FilmReview?id=483666&review.id=5968 )
Any ideas what is wrong ?
Try to use the URL with "&" changed to "&". For me it doesn't show the welcome page and what is better it doesn't show the "doNotEscapeHTML" error.
zxcvbn1971
2009-05-15, 10:26
I'm afraid this scraper does not work anymore. Only 30% of movies are correctly added to XBMC database. Remaining movies are added with errors (no title or description - famous doNotEscapeHTML) or not added at all....
:(
hi,
something new about this scraper? It works for you or not?