XBMC Community Forum  

Go Back   XBMC Community Forum > Development > Scraper Development

Scraper Development Developers forum for meta data scrapers. Scraper developers only!
Not for posting feature requests, bugs, or end-user support requests!

Reply
 
Thread Tools Search this Thread Display Modes
Old 2009-01-12, 13:43   #1
asphinx
Team Arcade
 
asphinx's Avatar
 
Join Date: Nov 2008
Location: Finland
Posts: 231
asphinx is on a distinguished road
Default How do you make a site scrapeable?

This might sound like a very uneducated question, but how and with what markup language should one write a website so that it's;
a) easy to scrape, well organized..
b) one is still able to style it, whether it's with css or xls or...

I don't need/except some to give me a complete tutorial, how to make, but more some pointers as to which markup I should look into.

I've tried looking at thetvdb.com and themoviedb.org, but I epically fail to understand with what it's been written.
__________________
Fiinix Design presents: Posters, for Movies, TV Shows, Games, Arcade etc.

» Latest Poster-Pack: The Silhouettes, TV Shows
» Upcoming Poster-Pack: To Be Announced

» Game & Emulator Poster, request here
» Movie/TV Genre Poster, here
asphinx is offline   Reply With Quote
Old 2009-01-12, 13:46   #2
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

both of those offer xml based api's.

the only thing needed to make a site scrapeable is a pattern that can be described using regular expressions. repeatability is the key..
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2009-01-12, 13:57   #3
asphinx
Team Arcade
 
asphinx's Avatar
 
Join Date: Nov 2008
Location: Finland
Posts: 231
asphinx is on a distinguished road
Default

Thank you for the swift reply..

Ok, I've already tried to make a mockup of the structure, using xml. (not final). So my follow-up question will be simply be, how does this look? From a scraping view-point?
(Thinking about creating a gamedb)


Code:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="simple.xsl" ?>

<GAME_LIBRARY>

<GAME>
<GAME_ID>id of current game in list</GAME_ID>

<!-- Could possible use the XBE ID tag??? or just a general id to tag everything for easy scraping -->

<GAME_TITLE>
	<HEADER>Title</HEADER>
	<NAME>Halo: Combat Evolved</NAME>
	<IMG_URL>path to poster image</IMG_URL>
</GAME_TITLE>

<GAME_DEVELOPER>
	<HEADER>Developer(s)</HEADER>
	<NAME>Bungie Studios</NAME>
	<IMG_URL>path to logo</IMG_URL>
</GAME_DEVELOPER>

<GAME_PUBLISHER>
	<HEADER>Publisher(s)</HEADER>
	<NAME>Microsoft Game Studios</NAME>
	<IMG_URL>path to logo</IMG_URL>
</GAME_PUBLISHER>

<GAME_PLATFORM>
	<HEADER>Platform(s)</HEADER>
	<NAME>Xbox</NAME>
	<IMG_URL>path to logo</IMG_URL>
	<NAME>PC</NAME>
	<IMG_URL>path to logo</IMG_URL>
</GAME_PLATFORM>

<GAME_RELEASED>
	<HEADER>Released</HEADER>
	<YEAR>2001</YEAR>
	<MONTH>November</MONTH>
	<DATE>15</DATE>
</GAME_RELEASED>

<GAME_GENRE>
	<HEADER>Genre</HEADER>
	<SHORTHAND>FPS</SHORTHAND>
	<LONG>First-person Shooter</LONG>
</GAME_GENRE>

<GAME_SYNOPSIS>
	<HEADER>Synopsis</HEADER>
<SYNOPSIS>
Enter the mysterious world of Halo, an alien planet shaped like a ring. As mankind's super soldier Master Chief, you must uncover the secrets of Halo and fend off the attacking Covenant. During your missions, you'll battle on foot, in vehicles, inside, and outside with alien and human weaponry. Your objectives include attacking enemy outposts, raiding underground labs for advanced technology, rescuing fallen comrades, and sniping enemy forces. Halo also lets you battle three other players via intense split screen combat or fight cooperatively with a friend through the single-player missions.Enter the mysterious world of Halo, an alien planet shaped like a ring. As mankind's super soldier Master Chief, you must uncover the secrets of Halo and fend off the attacking Covenant. During your missions, you'll battle on foot, in vehicles, inside, and outside with alien and human weaponry. Your objectives include attacking enemy outposts, raiding underground labs for advanced technology, rescuing fallen comrades, and sniping enemy forces. Halo also lets you battle three other players via intense split screen combat or fight cooperatively with a friend through the single-player missions.
</SYNOPSIS>
</GAME_SYNOPSIS>

<!-- COMMENT//Possibility of dynamicly aquiring from gamerankings.com? -->

<GAME_RATING>
	<TEXT>95</TEXT>
	<IMG_URL>path to maybe generated img file</IMG_URL>
</GAME_RATING>

<!-- COMMENT//From gamerankings.com, percent rating from 0-100% -->

<GAME_VGCRS>
	<HEADER>Rated</HEADER>
	<ESRB>M</ESRB>
	<ESRB_URL>path to image of rating</ESRB_URL>
	<BBFC>15</BBFC>
	<BBFC_URL>path to image of rating</BBFC_URL>
	<PEGI>16+</PEGI>
	<PEGI_URL>path to image of rating</PEGI_URL>
	<USK>16</USK>
	<USK_URL>path to image of rating</USK_URL>
	<OFLC>MA15+</OFLC>
	<OFLC_URL>path to image of rating</OFLC_URL>
</GAME_VGCRS>

<GAME_VGCRS_DESC>
	<HEADER>not sure why this would be required</HEADER>
	<TEXT_1>Blood and gore</TEXT_1>
	<IMG_1>path to blood and gore image</IMG_1>
	<TEXT_2>Violence</TEXT_2>
	<IMG_2>path to violence image</IMG_2>
</GAME_VGCRS_DESC>
</GAME>

</GAME_LIBRARY>
__________________
Fiinix Design presents: Posters, for Movies, TV Shows, Games, Arcade etc.

» Latest Poster-Pack: The Silhouettes, TV Shows
» Upcoming Poster-Pack: To Be Announced

» Game & Emulator Poster, request here
» Movie/TV Genre Poster, here
asphinx is offline   Reply With Quote
Old 2009-01-12, 14:01   #4
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

my eyes hurt, please drop the upper case. don't see the point of the <header> entries, not that it matters since they can easily be skipped.

the platform tags should be xml'ized, i.e. just have multiple
<platform>
..
</platform>
<platform>
..
</platform>

instead of using img_url then name then img_url then name... much easier to parse and much more xml'ish.

brief overlook only mind you
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2009-01-12, 14:22   #5
asphinx
Team Arcade
 
asphinx's Avatar
 
Join Date: Nov 2008
Location: Finland
Posts: 231
asphinx is on a distinguished road
Default

dropping the upper case now..

the header -tags is for display purposes, thought it would be better if they were also scrape able. But if they are unnecessary then by all means they will be cut out. No need to have data that isn't going to be scraped anyway.

(Now to just figure out how to display images in browser based on URL data only)

Going to review the platform -tag and consequently the game_vgcrs -tag for better formatting. Thank you!

Other than that we're good, this is scrape able with ease?
__________________
Fiinix Design presents: Posters, for Movies, TV Shows, Games, Arcade etc.

» Latest Poster-Pack: The Silhouettes, TV Shows
» Upcoming Poster-Pack: To Be Announced

» Game & Emulator Poster, request here
» Movie/TV Genre Poster, here
asphinx is offline   Reply With Quote
Old 2009-01-12, 14:23   #6
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

yeah, xml is piece of cake
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2009-01-12, 14:30   #7
asphinx
Team Arcade
 
asphinx's Avatar
 
Join Date: Nov 2008
Location: Finland
Posts: 231
asphinx is on a distinguished road
Default

Maybe for you. Mind you I have never done anything with XML so this is a first for me. I like a challenge though so..

A final quicky if I may though? what formatting do you need to use to make the URL path in the XML easy to scrape but also so that it displays in browser?
I can't wrap my head around it, it seems easy to make it scrape able but in browser view it just displays the path, not the actual image (styling with CSS)
__________________
Fiinix Design presents: Posters, for Movies, TV Shows, Games, Arcade etc.

» Latest Poster-Pack: The Silhouettes, TV Shows
» Upcoming Poster-Pack: To Be Announced

» Game & Emulator Poster, request here
» Movie/TV Genre Poster, here
asphinx is offline   Reply With Quote
Old 2009-01-12, 14:32   #8
spiff
Grumpy Bastard Developer
 
spiff's Avatar
 
Join Date: Nov 2003
Posts: 7,715
spiff is on a distinguished road
Default

url format does not matter. you just need to remember, you are storing xml, so you need to escape special chars, in particular;

& -> &amp;
" -> &quot; (prob not relevant in a url)
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.
spiff is offline   Reply With Quote
Old 2009-01-12, 14:42   #9
asphinx
Team Arcade
 
asphinx's Avatar
 
Join Date: Nov 2008
Location: Finland
Posts: 231
asphinx is on a distinguished road
Default

I'll keep that in mind. Yes I'm aware that formatting doesn't matter when storing URL data, but it matters when I want to be able to display it in a browser as well. And that's what I'm having issues with currently, then again, I should probably ask somewhere else for that...
__________________
Fiinix Design presents: Posters, for Movies, TV Shows, Games, Arcade etc.

» Latest Poster-Pack: The Silhouettes, TV Shows
» Upcoming Poster-Pack: To Be Announced

» Game & Emulator Poster, request here
» Movie/TV Genre Poster, here
asphinx is offline   Reply With Quote
Old 2009-01-12, 18:46   #10
Gamester17
Team-XBMC Project Manager
 
Gamester17's Avatar
 
Join Date: Sep 2003
Location: Sweden
Posts: 10,582
Gamester17 will become famous soon enough
Default For you information...

Quote:
Originally Posted by asphinx View Post
I've tried looking at thetvdb.com and themoviedb.org, but I epically fail to understand with what it's been written.
Both thetvdb.com and themoviedb.org originally used the same open source website framework as their base and that website framework is available for anyone to download and use for free, see => http://sourceforge.net/projects/tvdb
Quote:
TVDB - Online TV Database

A web/XML interface and database schema for managing TV series information and user-submitted graphics. Will be interfaced by a number of HTPC plugins and software. Currently used by plugins for Meedio, Media Portal, and XBox Media Center.
Why reinvent the wheel if you don't have to?
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting.
Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules.
For troubleshooting and bug reporting please make sure you read this first.

Last edited by Gamester17; 2009-01-12 at 18:49.
Gamester17 is offline   Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 04:55.


Protected by Akismet, We recommend WordPress blogs
Copyright © 2008, XBMC Project