![]() |
|
|||||||
| Scraper Development Developers forum for meta data scrapers. Scraper developers only! Not for posting feature requests, bugs, or end-user support requests! |
![]() |
|
|
Thread Tools | Search this Thread | Display Modes |
|
|
#1 |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
This is the first "chapter" (the only one so far) of the "course" to learn scraper creation that I'm writing as I learn how to make them, my intention is to incorporate it to the wiki when it is finished, please give me your opinion about it:
Scrape creation for dummies Chapter one First, some very important reference information, not to read it right now but keep the URLs on hand... Introduction to scraper creation: http://xbmc.org/wiki/?title=How_To_W..._Info_Scrapers Reference to scraper structure: http://xbmc.org/wiki/?title=Scraper.xml Tool to test scrapers: http://xbmc.org/wiki/?title=Scrap (Download NOW both files, scrap.exe & libcurl.dll) Some info about regular expressions: http://xbmc.org/wiki/?title=Regular_...Ex%29_Tutorial More info on regular expressions from wikipedia: http://en.wikipedia.org/wiki/Regex I. How a scraper works In a nutshell: 1) If there is movie.nfo, use it (section NfoUrl) and then go to step 5 2) Otherwise, with the file's name generate a search URL (section CreateSearchUrl) ang get the results pagechrome://informenter/skin/marker.png 3) With the results generate a listing (section GetSearchResults) that has for each "candidate" movie a user-friendly denomination and one (or more) associate URLs 4) Show the listing to the user for him to choose and select the associate URL(s) 5) Get the URL's content and extract from it (section GetDetails) the apropriate data for the movie to store in videodb Each one of that four sections is made as a RegExp entry that has this structure: Code:
<RegExp input=INPUT output=OUTPUT dest=DEST>
<expression>EXPRESSION</expression>
</RegExp>
OUTPUT is a string that is build up by the RegExp DEST is the name of the buffer where OUTPUT will be stored EXPRESSION is a regular expression that somehow manipulates INPUT to extract from it information as "fields". If EXPRESSION is empty, automatically a field "1" is created which contains INPUT Here a "buffer" is just a memory section that is used for communication between each section and the rest of XBMC. There are ten buffers named 1 to 9. To express the *content* of a buffer you use "$$n", where n is the number of the buffer. The fields get extracted from the input by EXPRESSION just by selecting patterns with "(" and ")" and get named as numbers sequentially; the first one is \1, the second \2 up to a maximum of 9. A very easy example: Code:
<RegExp input="$$1" output="\1" dest="3">
<expression></expression>
</RegExp>
- The output will be stored in buffer 3 - As expression is empty, all the input ($$1) will be stored on field \1 - As output is simply \1, al its content will be used for output, that is, $$1 So, the end result will be that the content of buffer 1 will be stored on buffer 3 If you do not know anything about regular expressions, this is the moment to make a quick study of the principles of them from the references above. Another example, this time we use a string as input and use a very simple regular expression to select part of it Code:
<RegExp input="Movie: The Dark Knight" output="The title is \1" dest="3">
<expression>Movie: (.*)</expression>
</RegExp>
II. The most important sections in a scraper Now, let's have a look into the 3 "important" sections: CreateSearchUrl, GetSearchResults and GetDetails. first there is some basic information about them we need to know. CreateSearchUrl must generate (into buffer 3) the URL that will be used to get the listing of possible movies. To do that, you need the name of file selected to be scraped and that is stored by XBMC in buffer 1. GetSearchResults must generate (in buffer 8) the listing of movies (in user-ready form) and their associate URLs. The result of downloading the content of the URL generated by CreateSearchResult is stored by XBMC in buffer 5. The listing must have this structure: Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<results>
<entity>
<title></title>
<url></url>
</entity>
<entity>
<title></title>
<url></url>
</entity>
</results>
Once the user has selected a movie, the associated URL(s) will be downloaded. Last, GetDetails must generate (in buffer 3) the listing of detailed information about the movie in the correct format, using for that the content of the URL(s) selected from GetSearchResults. The first one will be in $$1, the second in $$2 and so on. The structure that the listing must have is this: Code:
<details>
<title></title>
<year></year>
<director></director>
<top250></top250>
<mpaa></mpaa>
<tagline></tagline>
<runtime></runtime>
<thumb></thumb>
<credits></credits>
<rating></rating>
<votes></votes>
<genre></genre>
<actor>
<name></name>
<role></role>
</actor>
<outline></outline>
<plot></plot>
</details>
- Some fields can be missing (?) - <thumb> contains the URL of the image to be downladed later - <genre>, <credits>, <director> and <actor> can be repeated as many times as needed Some important details to remember: 1) When you need to use some special characters into the regular expression, do not forget to "scape" them: \ -> \\ ( -> \( . -> \. etc. 2) Since the scraper itself is a XML file, the characters with meaning in XML cannot be used directly and so must be uses its aliases: & -> & < -> < > -> > " -> " ' -> ' 3) If you use non-ASCII characters in your XML to be used in the output (umlauts, ñ, etc), they must be coded as iso-8859-1 III. Our first working scraper Now, with all that information, let's create our first scraper. Just create a "dummy.xml" file with this content, study it a little, it should be fairly easy to understand with what we already know: Code:
<scraper name="dummy" content="movies" thumb="imdb.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<NfoUrl dest="3">
<RegExp input="$$1" output="\1" dest="3">
<expression></expression>
</RegExp>
</NfoUrl>
<CreateSearchUrl dest="3">
<RegExp input=$$1 output="<url>http://www.nada.com</url>" dest="3">
<expression></expression>
</RegExp>
</CreateSearchUrl>
<GetSearchResults dest="8">
<RegExp input="$$1" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results><entity><title>Dummy</title><url>http://www.nada.com</url></entity>" dest="8">
<expression></expression>
</RegExp>
</GetSearchResults>
<GetDetails dest="3">
<RegExp input="$$1" output="<details><title>The Dummy Movie</title><year>2008</year><director>Dummy Dumb</director><top250></top250><mpaa></mpaa><tagline>Some dumb dummies</tagline><runtime></runtime><thumb></thumb><credits>Dummy Dumb</credits><rating></rating><votes></votes><genre></genre><actor><name>Dummy Dumb</name><role>The dumb dummy</role></actor><outline></outline><plot>Some dummies doing dumb things</plot></details>" dest="3">
<expression></expression>
</RegExp>
</GetDetails>
</scraper>
To test it in windows, put in any directory the files scrap.exe and libcurl.dll that are referenced at the beginning of this lesson and the dummy.xml file and then execute for example: Code:
scrap dummy.xml "Hello, world" You can also try it in a "real" XBMC, just copy dummy.xml to XBMC\system\scrapers\video, restart XBMC, choose any directory from your sources that contains a video file not incorporated into the library, "set content" of the directory to use "dummy" as scraper and finally select "movie info" over the video file. All our fake data will be incorporated into the video database. |
|
|
|
|
|
#2 |
|
Team-XBMC Project Manager
Join Date: Sep 2003
Location: Sweden
Posts: 10,582
![]() |
Great stuff! Can you please add this to the XBMC Online Manual (wiki) as well?
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
#3 |
|
Grumpy Bastard Developer
Join Date: Nov 2003
Posts: 7,715
![]() |
- you have 20 buffers to play with
- yes you can skip contents of tags (or the whole tags) - you can have any encoding as long as you tag the scraper using it (remember, the scraper is a xml file so it obeys the <?xml tag), or the returned xml.
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
#4 | ||
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
Quote:
Ok, so you use their content using $$1 ... $$20, I suppose. Just like I thought (and I tested with the second scraper). I will modify the dummy.xml to simplify it prior to posting in the wiki. Quote:
I have the second chapter almost ready, in it I rewrite step by step the culturalia scraper; the site is really easy to manage but at the same time allow some interesting treatments, I am learning a lot. It is not ready yet for two reasons: - I do not know who did the original culturalia scraper and I would like to include some attribution (also I hope he/she doesn't mind me using his/her work) - I am not entirely satisfied with the regular expressions used in some fields, they work on all examples I tried but are far from "elegant", and since this will go to the wiki for everyone to see... My culturalia scraper, as the original, cannot obtain thumbnails since culturalia's web server block them, I could probably hack the connections (that may be for chapter four) but instead I plan to implement scraping them from IMDB (that was my original intention when I decided to learn the craft of scraper making). That will be, if I am capable enough, the content of chapter three. |
||
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
Here is first version of chapter 2. It is not ready yet, I have some doubts (how does clear="yes" work when dest is something like "8+"?). Also, although the scraper works OK with scrap.exe, when trying it for real in xbmc it behaved differently, it did not scrape correctly the "genres" field... somehow scrap.exe and xbmc do not make the same interpretation of the scraper.
Chapter two Now that we know how to create a skeleton scraper, let's re-create a real one. I've chosen one fairly simple, the one used to scrape the spanish site culturalia.es (in fact the URL is www.culturalianet.com). First of all, we must know how works the site we intend to write the scraper for. ?? who did the original culturalia scraper? I would like to include a "thank you" hereOpen www.culturalianet.com. To perform a search, write "la noche es nuestra" (spanish title for "we own the night") in the "buscar:" box in the top of the page. When you press the "Buscar" ("Search") button, the URL opened is: Code:
http://www.culturalianet.com/bus/resu.php?texto=la+noche+es+nuestra&donde=1 For example: Code:
<RegExp input="$$1" output="http://www.culturalianet.com/bus/resu.php?texto=\1&donde=1" dest="3"> <expression></expression) </RegExp> Now we must understand how the results page is formatted; for that, the function "View selection source" of firefox is very useful. Just select the end of the header of the listing and some of the first entries and "view selection source", this is what I get: Code:
Se han encontrado 249 artículos. Se muestran del 1 al 25. <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=1">26</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=2">51</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=3">76</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=4">101</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=5">126</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=6">151</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=7">176</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=8">201</a> <a href="resu.php?donde=1&texto=la%20noche%20es%20nuestra&muestro=9">226</a> </td></tr><tr><td><b><a href="../art/ver.php?art=29405" target="_top">Noche es nuestra, La.</a></b></td></tr> <tr><td colspan="2"><i>We Own the Night</i>. De James Gray (2007)</td></tr><tr><td><b><a href="../art/ver.php?art=23798" target="_top">10 + 2: La noche mágica.</a></b></td></tr> <tr><td colspan="2"><i>10 2: La noche mágica</i>. De Miquel Pujol Lozano (2000)</td> Code:
<expression repeat="yes"> The ID we get from: Code:
<a href="../art/ver.php?art=29405" target="_top"> Code:
<a href='../art/ver.php\?art=([0-9]*)' target='_top'> Code:
(.[^<]*)\.</a> Code:
[^<i>]*(.[^<]*) Code:
<\i>\. De (.[^\(]*) Code:
\(([0-9]*)\) Code:
<expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^<i>]*(.[^<]*)<\i>\. De (.[^\(]*)\(([0-9]*)\)</expression> \1 ID of the movies's article in culturalianet.com \2 Spanish title \3 Original title \4 Director's name \5 Movie's year of first exhibition Each of our <entity> will have a <name> in the form: 'Noche es nuestra, la' (We own the night) de James Gray (2007) or, with our actual fields: '\2' (\3) de \4 (\5) Also there will be a <url> generated by: Code:
http://www.culturalianet.com/art/ver.php?art=\1 Code:
<GetSearchResults dest="8"> <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8"> <RegExp input="$$1" output="<entity%gt;<title>'\2' (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity>" dest="5"> <expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^<i>]*(.[^<]*)<\i>\. De (.[^\(]*)\(([0-9]*)\)</expression> </RegExp> <expression noclean="1" /> </RegExp> </GetSearchResults> also, and this is a XML standard, you can shorten empty XML clauses like <expression></expression> by writing instead: <expression/> So, how does XBMC execute this? it goes to the inner regexp and using input="$$1" (the content of our search url), applies to it expression and generates our fields: Code:
<expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>[^<i>]*(.[^<]*)<\i>\. De (.[^\(]*)\(([0-9]*)\)</expression> that generates this output to buffer 5: Code:
<entity><title>'\2' (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity> Then, the outer regexp gets executed, it uses as input $$5 that has just been generated; it does not modify anithin (empty <expression> means all input goes to \1) but remember to use the noclean clause to maintain the necessary formatting. Simply takes all the <entity>s generated and insert them in the correct xml structure: Code:
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results> |
|
|
|
|
|
#6 |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
Now, XBMC will show the user the list of movies and one will be selected. The associated URL, the article page of the movie, will be downloades and fed to buffer 1, and that we need to parse to extract the information we need.
Go now to a movie article, like http://www.culturalianet.com/art/ver.php?art=29405 and select and look at the underlying HTML code. Very much like we did when parsing the search results page, we must detect the patterns in the page that allows us to select the correct fields and then use them to build our <details> XML structure. Some parts are fairly straightforward, like title, duration, plot or year; this expression extracts the spanish title, the original title and the year into fields 1, 3 and 2 respectively: Code:
'titulo2'>(.[^\<]*)\. \(([0-9]*)\)</font></u><br><br><i>(.[^<]*)</i> Code:
<title>\1 (\3)</title><originaltitle>\3</originaltitle><year>\2</year> Code:
<RegExp input="$$1" output="\1" dest="7"> <expression noclean="1">Actores:([^:]*)Productor:</expression> </RegExp> Code:
<expression noclean="1" repeat="yes">>(.[^<>]*)<</expression> Code:
<actor><name>\1</name><role></role></actor> Code:
<RegExp input="$$7" output="<actor><name>\1</name><role></role></actor>" dest="8"> <RegExp input="$$1" output="\1" dest="7"> <expression noclean="1">Actores:([^:]*)Productor:</expression> </RegExp> <expression noclean="1" repeat="yes">>(.[^<>]*)<</expression> </RegExp> Something we haven't used before, when we want the output to append to an existing buffer, not overwriting it, we simply write, for example, dest="7+" We generate all the items into the "8" buffer and use the "7" buffer as temporary in each regexp. For the final version of the scraper, we use some different attribtes for <expression>, with this meaning: repeat="yes" -> will repeat the expression as long as there are matches noclean="1" -> will NOT strip html tags and special charactes from field 1. Field can be 1 ... 9. By default, all fields are "cleaned" trim="1" -> trim white spaces of field 1. Field can be 1 ... 9 clear="yes" -> if there is no match for the expression, dest will be cleared. By default, dest will keep it previous value ?NOTE: what happens when using clear="yes" if dest is "8+" ?So, without further ado, this is the whole scraper. The extraction of the different fileds is similar to the "actors" field we saw. This is just one of many possible ways of getting the info and probably not the best one, it is slightly different to the original culturalia.xml scraper. One additional comment: to avoid trouble with some special characters ("é" in "Género", for example) that can get different encodings depending on your text editor and can be difficult to type, I'm using a dot instead, since a non-scaped dot means "any character" when used in regular expressions. Code:
<scraper name="Culturalia.es V2" content="movies" thumb="culturalia.gif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <NfoUrl dest="3"> <RegExp input="$$1" output="http://www.culturalianet.com/art/ver.php?art=\1" dest="3"> <expression noclean="1">art/ver\.php\?art=([0-9]*)</expression> </RegExp> </NfoUrl> <CreateSearchUrl dest="3"> <RegExp input="$$1" output="http://www.culturalianet.com/bus/resu.php?texto=\1&donde=1" dest="3"> <expression noclean="1"/> </RegExp> </CreateSearchUrl> <GetSearchResults dest="8"> <RegExp input="$$5" output="<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?><results>\1</results>" dest="8"> <RegExp input="$$1" output="<entity><title>\2 (\3) de \4 (\5)</title><url>http://www.culturalianet.com/art/ver.php?art=\1</url></entity>" dest="5"> <expression repeat="yes"><a href='../art/ver.php\?art=([0-9]*)' target='_top'>(.[^<]*)\.</a>.[^\(]*<i>(.[^<]*)</i>\. De (.[^\(]*) \(([0-9]*)\)</expression> </RegExp> <expression noclean="1"/> </RegExp> </GetSearchResults> <GetDetails dest="3"> <RegExp input="$$8" output="<details>\1</details>" dest="3"> <!-- Titles, year !--> <RegExp input="$$1" output="<title>\1 (\3)</title><originaltitle>\3</originaltitle><year>\2</year>" dest="8"> <expression trim="1" noclean="1">'titulo2'>(.[^\<]*)\. \(([0-9]*)\)</font></u><br><br><i>(.[^<]*)</i></expression> </RegExp> <!-- Director's names !--> <RegExp input="$$7" output="<director>\1</director>" dest="8+"> <RegExp input="$$1" output="\1" dest="7"> <expression noclean="1">Director:([^:]*)Actores:</expression> </RegExp> <expression noclean="1" repeat="yes">>(.[^<>]*)<</expression> </RegExp> <!-- Runtime !--> <RegExp input="$$7" output="<runtime>\1 minutos</runtime>" dest="8+"> <RegExp input="$$1" output="\1<" dest="7"> <expression noclean="1" clear="yes">Duraci.n:(.*)minutos</expression> </RegExp> <expression noclean="1" trim="1">>(.[^<>]*)<</expression> </RegExp> <!-- Thumbnail !--> <RegExp input="$$1" output="<thumb>http://www.culturalianet.com/imatges/articulos/\1-1.jpg</thumb>" dest="8+"> <expression>imatges/articulos/([0-9]*)-</expression> </RegExp> <!-- Credits !--> <RegExp input="$$7" output="<credits>\1</credits>" dest="8+"> <RegExp input="$$1" output="\1" dest="7"> <expression noclean="1">Gui.n:([^:]*)Fotograf.a:</expression> </RegExp> <expression noclean="1" repeat="yes">>(.[^<>]*)<</expression> </RegExp> <!-- Genres !--> <RegExp input="$$7" output="<genre>\1</genre>" dest="8+"> <RegExp input="$$9" output="\1" dest="7"> <RegExp input="$$1" output="\1/" dest="9"> <expression>G.nero:([^:]*)Nacionalidad:</expression> </RegExp> <expression>>(.[^<>]*)<</expression> </RegExp> <expression repeat="yes" trim="1">(.[^/]*)/</expression> </RegExp> <!-- Actors !--> <RegExp input="$$7" output="<actor><name>\1</name><role></role></actor>" dest="8+"> <RegExp input="$$1" output="\1" dest="7"> <expression noclean="1">Actores:([^:]*)Productor:</expression> </RegExp> <expression noclean="1" repeat="yes">>(.[^<>]*)<</expression> </RegExp> <!-- Plot !--> <RegExp input="$$1" output="<plot>\1</plot>" dest="8+"> <expression>Sinopsis:</b><br>([^=]*)<br><br></expression> </RegExp> <expression noclean="1"/> </RegExp> </GetDetails> </scraper> |
|
|
|
|
|
#7 |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
The first two chapters (with some corrections) are now in the wiki, unfortunately, introducing the code with <xml> to format it correctly, makes the articles very wide and that affects not only the code but the text also, and makes everything very unpleasant to read... does anybody know how to mend that?
|
|
|
|
|
|
#8 |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
This is the link to the article: HOW-TO Write Media Info Scrapers (the complete dummies guide)
|
|
|
|
|
|
#9 | |
|
Team-XBMC Project Manager
Join Date: Sep 2003
Location: Sweden
Posts: 10,582
![]() |
Thank you very much pko66 (and spiff of course) but...
Quote:
http://xbmc.org/wiki/?title=HOW-TO_W...mmies_guide%29 http://xbmc.org/wiki/?title=HOW-TO_W...ntroduction%29 http://xbmc.org/wiki/?title=Scraper.xml Can at least the first two listed above be merged into one single article called "HOW-TO write Media Info Scrapers"?
__________________
Always read the XBMC online-manual, FAQ and search the forum before posting. Do not e-mail XBMC-Team members directly asking for support. Read/follow the forum rules. For troubleshooting and bug reporting please make sure you read this first. |
|
|
|
|
|
|
#10 | |
|
Senior Member
Join Date: Dec 2006
Posts: 109
![]() |
Quote:
The other how to, the one I've renamed as "introduction" can very well be merged (in fact, already is a little, since was one of my main sources of information). My idea is first to finish the "course" (probably two more chapters) and then modify it incorporating all the info from "introduction" that is missing and then simply erase it. BTW, is there something that can be done to the horizontal scroll bar? it makes the article pretty unreadable
|
|
|
|
|
![]() |
| Bookmarks |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|