PDA

View Full Version : Allocine.fr (TV Shows) scraper


The_Dogg
2007-03-23, 17:25
I'm working on a TV Show scraper for allocine.fr.

I'm down to the episode list, but I have a little problem:

I use the scrap.exe tool to test it, and when the tool get the links for the episode list, there is a "&" sign that gets lost, let me show you:

</status><premiered>
7 Aošt 2005</premiered><episodeguide><url>http://www.allocine.fr/series/episodes_gen_csaison=1511&cserie=513.html</url>
<url>http://www.allocine.fr/series/episodes_gen_csaison=2450&cserie=513.html</url></episodeguide></details>
Episodelist URL 1:http://www.allocine.fr/series/episodes_gen_csaison=1511cserie=513.html
Episodelist URL 2:http://www.allocine.fr/series/episodes_gen_csaison=2450cserie=513.html
GetEpisodeListInternal 2 returned :
GetEpisodeList returned :
Error: Unable to parse episodelist.xml

this is the output of the scrap.exe tool.

You can see that in the <details> tag the URL are OK :
<url>http://www.allocine.fr/series/episodes_gen_csaison=1511&cserie=513.html</url>
but when the tool says "Episodelist URL" the & sign is lost in the link, causing a near empty page on the website.
Episodelist URL 1:http://www.allocine.fr/series/episodes_gen_csaison=1511cserie=513.html

and here is the code from the scraper.xml
<RegExp input="$$8" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="5+">
<RegExp input="$$2" output="&lt;url&gt;http://www.allocine.fr/series/episodes_gen_csaison=\1&amp;cserie=$$4.html&lt;/url&gt;" dest="8">
<expression repeat="yes">&quot;/series/casting_gen_csaison=([0-9]*)&amp;cserie=$$4.html&quot; class=&quot;link1&quot;>[0-9]&lt;/a&gt;</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>

I tried replacing the $amp; with only &, i tried putting it twice (&amp;&amp; and &&) the & sign never shows up. but when i try to change the &amp; with &quot; the " sign appears where I need it, only the &amp; that doesnt seems to work.

any help would be appreciated.

The_Dogg

The_Dogg
2007-03-23, 18:08
After a little more research I found the way to have the missing & show :)


I had to put &amp;amp;

so the resulting scraper code is:

<RegExp input="$$8" output="&lt;episodeguide&gt;\1&lt;/episodeguide&gt;" dest="5+">
<RegExp input="$$2" output="&lt;url&gt;http://www.allocine.fr/series/episodes_gen_csaison=\1&amp;amp;cserie=$$4.html&lt;/url&gt;" dest="8">
<expression repeat="yes">&quot;/series/casting_gen_csaison=([0-9]*)&amp;cserie=$$4.html&quot; class=&quot;link1&quot;>[0-9]&lt;/a&gt;</expression>
</RegExp>
<expression noclean="1"></expression>
</RegExp>

:laugh:

spiff
2007-03-23, 18:37
reason for this is: you are in an xml document. and you return xml.... each time xml is parsed, you need &amp; or it will be stripped due to being a nonvalid xml char....