PDA

View Full Version : Filmweb scraper


smuto
2007-02-14, 01:48
Hi,
This is my first "release" of the filmweb (polish movie database side) scraper.

The mainfeatures work, but there are one big problem:
-xbmc don't show the polish chars at all:sad:

Filmweb - Scraper (http://smuto.w.interia.pl/filmwebraper.rar)

cheers smuto

i test this scraper after spiff fix
fixed: various encoding related stuff in scraper code. should fix ofdb.

this fix give me correct writing to database, but strings are not readable in GUI

smuto
2007-02-14, 01:51
filmweb.rar (http://smuto.w.interia.pl/filmweb.rar)

spiff
2007-02-14, 01:51
404

spiff
2007-02-14, 02:17
Revision: 7829
http://svn.sourceforge.net/xbmc/?rev=7829&view=rev
Author: spiff_
Date: 2007-02-13 16:16:50 -0800 (Tue, 13 Feb 2007)

Log Message:
-----------
fixed: if scraper says it returns utf8 formatted xml, obey.

you should have a look at the actors. and don't use tag line, use <originaltitle>

smuto
2007-02-14, 16:23
- i fix actar tag (whitespaces bug)
- i add <originaltitle> tag

Filmweb-scraper (http://smuto.w.interia.pl/filmweb.rar)

spiff
2007-02-14, 19:12
added to svn with some small modifications due to changes to buffer cleanings. also renamed it to filmweb.pl.

cheers

spiff

smuto
2007-02-18, 12:39
Rating tag issue

e.g
from scraper 8,73 (comma) are shows as 8.0 (dot) in GUI

one more
"votes" should be localized.

smuto

spiff
2007-02-18, 13:04
yes, it needs to conform to standard floating point numbers, i.e. use a . and not a ,

localization is added.

smuto
2007-02-18, 13:25
done
i hope this is the last update

@spiff - i want to thank you very much.:D

spiff
2007-02-18, 13:57
latest is in svn.

spiff
2007-02-21, 13:06
please update the scraper to reflect the <nfourl> changes

smuto
2007-02-22, 20:06
i made update, but no time for tests
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)

smuto
2007-02-23, 13:13
scraper in this link filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)
is only for tests to forum users (i need more testers)

last good working one i will always upload to SVN as a patche

smuto

spiff
2007-02-23, 13:35
oh rite i did commit the last one - will stick to the sf ones from now on

dowiew
2007-04-12, 20:01
Hi! (siema)
I have many movies and many nfo files associated with them with urls to imdb and filmweb entries inside. Filmweb scraper works well olny for urls starting with www.filmweb.pl/Film,id...
Doesn't work for urls like http://frantic.filmweb.pl/ and I have them plenty...
Please update the regexp, probably the matching one in NfoUrl so that filmweb url would be correctly recognized in the nfo...
Regards,
dowiew

smuto
2007-04-12, 22:47
wiesz co nie wiem dlaczego ale nigdy nie zadziałał mi regexp na link tekstowy, a nie mam czasu na testy

filmweb tylko maskuje numer id
twój link
http://frantic.filmweb.pl/

to ten sam co ten z id
http://www.filmweb.pl/Film?id=1107

lub ten
http://www.filmweb.pl/Film,id=1107

sorki ale nie planuję zaktualizować scrapera, mam nadzieję że pliki nfo zaczną zawierać id

smuto

smuto
2007-04-12, 23:04
i hope i can use my native language in this topic

smuto
2007-12-09, 01:58
help!!

quite simple. output xml in the format

<actor>
<thumb>...</thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

but, i don't have thumb url in cast. So i try with "url function"

first without luck, but i like this idea (mayby this should work in libary by "Set Actor Thumb"

<actor>
<thumb><url function="ActorLink">...</url></thumb>
<name>something</name>
<role>somethingelse</role>
</actor>

second also without luck
<actor>
<name>something</name>
<role>somethingelse</role>
</actor>
<url function="ActorLink">somethinglink</url>

function="ActorLink"
<actor>
<name>something</name>
<thumb>...</thumb>
</actor>

don't know by mayby i need same numerator

actor$1 -> function="ActorLink$1"
actor$2 -> function="ActorLink$2"

my WIP
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)

smuto

spiff
2007-12-09, 12:17
dont add the actors at that point.

1) make sure all function you call dont clear buffers
2) make sure not to destroy the buffer which holds the id when it enters getdetails (# of htmls +1)
4) grab the url and chain once per actor
5) use the id to grab the role from the filmography list.

that should do, no?

smuto
2007-12-10, 02:34
i made a lite ver.of scraper for tests
filmweb_only_actor_test.xml (http://smuto.w.interia.pl/filmweb_actor_test.xml)

from scrap.exe for "Goodbye Bafana"

details.xml (http://smuto.w.interia.pl/details.xml)

ActorLink.xml (http://smuto.w.interia.pl/ActorLink.xml)

why in ActorLink.xml i have only one (last) entry , scrap visit all url's from details

C-Quel
2007-12-10, 11:19
Well looks like you dont repeat the thumb expression anyway.

spiff
2007-12-10, 13:44
scrap will only show you the last outputted xml.

in xbmc the actors will be pushed to a list for each returned xml

smuto
2007-12-11, 15:10
thx a lot

filmweb.xml with actor's thumb (http://smuto.w.interia.pl/filmweb.xml) - 100% working

but it becomes extremly slow - sometimes to collect url of thumbs, scraper visits more then 20 pages

so if someone want use it - just grab it from here

one more question

i edit TheTVDB.com scraper to match at first polish strings
tvdb-pl.xml (http://smuto.w.interia.pl/tvdb-pl.xml)
try to set encoding to ISO-8859-2 in scraper, but without success

A gui charset in langinfo.xml
<charsets>
<gui unicodefont="false">CP1250</gui>
<subtitle>CP1250</subtitle>
</charsets>

polish xbmc language strings are in "utf-8"
polish subtitle are mostly in CP1250

when i change gui charset to
<gui >ISO-8859-2</gui>
tvdb-pl scraper works perfect

What gui charset is for?
smuto

spiff
2007-12-11, 15:14
if returned xml is not utf8, it will be assumed to be gui charset and is converted from that to utf8.
if this is the best behaviour? not sure

as for the scraper being slow - not much we can do about that as long as the site is organized as it is...

smuto
2008-01-19, 13:36
update - add scraper settings

one week for tests before we add this to SVN

filmweb.xml- with settings (http://smuto.w.interia.pl/filmweb.xml)



have problem with encodings labels in scraper file
http://smuto.w.interia.pl/scraper_settings.jpg
is the way to add "Automatically grab actor thumbs" set to scraper settings window?

smuto

C-Quel
2008-01-19, 13:44
Just add a setting to the xml label="Auto Grab Actor Thumbs" id="autograb" type="bool" default="false"

duplicate your ActorLink have one input with conditional="autograb" (with thumb)

and the copy of ActorLink conditional="!autograb" but do not output <thumb></thumb>

EDIT: line 104, pos 239 change &nbsp to &amp;nbsp;

spiff
2008-01-26, 14:00
i dont think it fits as a scraper setting. you see, if you do it in the scraper it means you won't return the urls at all. the global setting is whether or not to actually grab the thumbs, not whether or not to grab the urls. small but important difference here - if you disable it at scraper level it means you cannot grab them manually either... hence dual settings makes sense to me

smuto
2008-02-14, 15:39
i try to update the <nfourl>
can someone help me

for now i only use link with id
http://www.filmweb.pl/Film?id=999999

i try to add link with movie title to <nfourl>
http://movie.title.filmweb.pl/

this is my wip

<NfoUrl dest="3">
<RegExp input="$$1" output="http://www.filmweb.pl/Film?id=\1" dest="3">
<expression noclean="1">Film.id=([0-9]*)</expression>
</RegExp>
<RegExp input="$$1" output="http://\1.filmweb.pl" dest="3+">
<expression noclean="1">http://([^\/]+).filmweb.pl</expression>
</RegExp>
</NfoUrl>


but movie title regexp work for both url
how can i force scraper to use id, if it's present

spiff
2008-02-14, 15:43
easiest solution (i dont have time to analyze the regexp's).

output xml, i.e. <url>theurl</url>

first url block will take priority

smuto
2008-02-14, 16:21
thx a lot - it's working

add as a patch to SVN

smuto
2008-02-23, 18:38
can someone review and then commit this to the SVN?

smuto
2008-05-25, 22:58
i have a script before every location url

ex.
details.1.html (http://smuto.w.interia.pl/details.1.html)

how can i force scraper to skip this

smuto

smuto
2008-05-26, 20:06
i add "spoof" to url, mayby this help
u can test my wip scraper

filmweb.xml_test (http://smuto.w.interia.pl/filmweb.xml)

spiff
2008-05-26, 23:46
spoof is for setting the referer. it probably does the trick indeed. sorry for the late response

smuto
2008-06-10, 16:21
maybe it's not xbmc problem, but maybe u can help

Recently in movie info from filmweb scraper, accented characters are show as a entities

ex.
latin small letter o with acute
ó -> &oacute;

is the way to fix this
smuto

spiff
2008-06-10, 16:39
hmm, it should convert those tags when you load the xml?
if not, make sure cleaning is performed on the field. latter would remove them though

smuto
2008-06-11, 22:35
with or without "noclean" i still have this same

ex
xbmc shows
który -> ktoacute;ry

in .xml from scrap.exe
który -> kt&oacute;ry

realy don't know what to do

i need to update SVN (small url function link fix)
but for now,this one is good for testing entitie
filmweb.xml_test (http://smuto.w.interia.pl/filmweb.xml)

good for test is "Kingdom of the Crystal Skull"
tag title is OK
tags outline & plot are wrong

smuto
2008-07-08, 22:51
for myself i edit source file HTMLUtil.cpp
edited HTMLUtil.cpp (http://smuto.w.interia.pl/HTMLUtil.cpp)

strReturn.Replace("&ndash;", "-;");
strReturn.Replace("&oacute;", "ó");


it's working, but i hope u help to fix this for all polish users

smuto

smuto
2008-09-14, 22:23
i add fanart to filmweb scraper

i use polish wikipedia to migration from filmweb.id to imdb.id

we still have problem with entities, hope spiff find time to help us

u can test new scraper from here
filmweb.xml_test_scraper (http://smuto.w.interia.pl/filmweb.xml)

smuto

spiff
2008-09-18, 01:28
hi.

i see nothing wrong, nor any other way to handle this so i just commited your replaces along with the new scraper. please use trac in the future:)

spiff

haken
2008-09-24, 19:22
@smuto: There are some problems with titles that start with numbers eg. "1410" or "27 dresses" - numbers are cut off from them. Fanart support is really great.
I hope that xbmc compilation with edited HTMLutil.cpp will be ready soon. At this moment you could put your compiled xbmc default.xbe at smuto.w.interia.pl (would be great for me, because i want to rescan my movie library and polish plots with no entity problems is something I look for...)

haken
2008-10-01, 20:47
Eventhough entities has been fixed with changeset 15625 (http://xbmc.org/trac/changeset/15625) it seems that "oacute problem" still exists (I checked filmweb scraper on xbmc compilations 15640 and 15728). Smuto - do you agree with me?

smuto
2008-10-03, 20:09
@haken - u just need to update scraper
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)

@spiff

i see nothing wrong, nor any other way to handle this so i just commited your replaces

but this is not a good idea - "oacute" & "ndash" are most popular
this mean i should add all entities to replaces
next in my queue are

strReturn.Replace("&nbsp;", "");
strReturn.Replace("&rsquo;", "'");

smuto

smuto
2008-10-07, 10:04
i don't know why, but sometimes wikipedia search don't work

i change the way of scraping the link after search - please test
filmweb.xml (http://smuto.w.interia.pl/filmweb.xml)

is the way to show in skin custom label?

something like this i need for testing
ListItem.IMDbID or ListItem.FilmwebID

smuto

haken
2008-11-01, 09:50
@smuto: I think that there are some changes in filmweb.pl website - descriptions cannot be scraped and high-res posters also. I looked inside the scraper, but it is to complicated for me;)

Update: Scraper is ok! It was something else - now everything works perfect. I was surprised because each time earlier scraper worked or didn't work at all... Sorry!

Neku
2009-02-07, 12:01
Any chance to fix this scraper? I realy love it:laugh:. But its stop working for me now. Its find title but dont downloand any cover and any info about movie.:sniffle:

Neku
2009-02-07, 22:47
Any chance to fix this scraper? I realy love it:laugh:. But its stop working for me now. Its find title but dont downloand any cover and any info about movie.:sniffle:


This must be somthing with site cos now its working.

smuto
2009-02-18, 13:34
The problem seems to be because the website sometimes force users to see a welcome page. I was try to fix this by spoof. But this will be harder to than expected. So if someone wants to work on that, she/he is welcome.

smuto

waszka
2009-02-23, 23:59
The problem seems to be because the website sometimes force users to see a welcome page. I was try to fix this by spoof. But this will be harder to than expected. So if someone wants to work on that, she/he is welcome.

smuto

It looks , the site always is forcing to see a welcome page :/ i do refresh and every time i see a message :
Tool.doNotEscapeHTML ($simpleMarkupTool.renderMarkup($desc.description

Any ideas what is wrong ?

---added
I''ve done goolge for phrase doNotEscapeHTML and i've found many sites with this message ( eg. http://www.meg.ryan.filmweb.pl/FilmReview?id=483666&review.id=5968 ) . It looks , there is a problem with filmweb.pl site :/

nightman
2009-03-05, 23:10
It looks , the site always is forcing to see a welcome page :/

I''ve done goolge for phrase doNotEscapeHTML and i've found many sites with this message ( eg. http://www.meg.ryan.filmweb.pl/FilmReview?id=483666&review.id=5968 )

Any ideas what is wrong ?



Try to use the URL with "&" changed to "&amp;". For me it doesn't show the welcome page and what is better it doesn't show the "doNotEscapeHTML" error.

zxcvbn1971
2009-05-15, 10:26
I'm afraid this scraper does not work anymore. Only 30% of movies are correctly added to XBMC database. Remaining movies are added with errors (no title or description - famous doNotEscapeHTML) or not added at all....
:(

wojak
2009-07-30, 09:53
hi,
something new about this scraper? It works for you or not?