PDA

View Full Version : Any plans for reformating the Scraper XML format any time soon?


Nicezia
2009-04-25, 12:44
Only reason Why I ask is because i'm developing a scraper editor/tester for the current format of the XML...

Sneak Peek: http://i694.photobucket.com/albums/vv306/Nicezia/temp.jpg
So far its awfully simple but foran amateur programmer behind the scenes its pretty complex.

I've already implemented all of the options in the class (minus the custom functions which I'm having a difficult time wrapping my headd around just how i want to go about implementing those, and so far nesting RegExp's is not exactly as complex as the actual XML allows)


But if I follow my todo list it should be able to create the simpler movie scrapers by Thursday or Friday...

So again my question is, do you foresee me having to relearn the structure of your scraper XML anytime in the next few months?

Schenk2302
2009-04-25, 13:17
Hi,

it seems you could read my mind. I'm really looking forward to your editor :)

Thanks in advance

Schenk

vdrfan
2009-04-25, 13:39
Cool. Nope, scrapers won't change except of regexp stuff, but the layout (backend) willl most likely stay the same. Any plans of making it for other platforms as well? Maybe MONO is the way to go.

Nicezia
2009-04-25, 20:26
Yeah, actually I've just made a copy of the class written in c++ considering my C++ isn't really up to par, the c++ version may be a little longer in the making...
but doing it in visual basic for now is giving me insight eough to be able to visualize how to port it to c++ (sans .NET). the only real problem i see so far in making it cross platform is
A) breaking my reliance on linq.... (but linq makes XML so easy....)
B) I have no experience in building a gui without visual studio designer (i am looking into information on x-windows programming though)



I'll look into cross platforming & particularly Mono as soon as i can get a working version out, first things first.

rwparris2
2009-04-25, 21:26
Looks cool!
I started writing a tester in python, but haven't gotten very far with it.

If you know any python maybe we could work together? I don't have any experience with c#/c++ so I wouldn't be much help there.

Also, IMO a GUI for writing scrapers isn't really necessary. It is nice, but being crossplatform is nicer :) Editing the scraper's XML directly isn't hard if you have a simple way to test it.

spiff
2009-04-25, 21:33
great stuff!

the scraper xml format will not change, however i will change the default to matching regular expressions case insensitive and add a tag to indicate sensitive matching (see #6262). also i plan some minor changes on the returned xml format, but that doesn't really matter for you.

the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Nicezia
2009-04-25, 23:05
Looks cool!
I started writing a tester in python, but haven't gotten very far with it.

If you know any python maybe we could work together? I don't have any experience with c#/c++ so I wouldn't be much help there.

Also, IMO a GUI for writing scrapers isn't really necessary. It is nice, but being crossplatform is nicer :) Editing the scraper's XML directly isn't hard if you have a simple way to test it.

Sure it isn't neccessary... but I started this because i got tired of making the tiny stupid mistakes, like leaving a semicolon off the end of an entity. First i wrote a little program to replace entities in a string of text. Then my thinking turned into it would be nice to be able to see what's going into what buffer as i'm building my sections, and then i figured why not take it all the way....

As for python, that's my next area of study, (thinking about making a python plugin for XBMC for libpurple so instant messaging can be integrated into XBMC)

great stuff!

the scraper xml format will not change, however i will change the default to matching regular expressions case insensitive and add a tag to indicate sensitive matching (see #6262). also i plan some minor changes on the returned xml format, but that doesn't really matter for you.

the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Cool I've already allowed for expanding to support new attributes (Assuming you plan to use the case insensitivity as a conditional attribute)

rwparris2
2009-04-25, 23:24
What are your thoughts on this thing working crossplatform? Will you be able to get away from linq? Can you make the gui and backend seperate so that it can be run from a command line?

I need to know whether to continue with my python implementation or not... linux & osx users could use a similar tool as well :)

Nicezia
2009-04-25, 23:33
What are your thoughts on this thing working crossplatform? Will you be able to get away from linq? Can you make the gui and backend seperate so that it can be run from a command line?

I need to know whether to continue with my python implementation or not... linux & osx users could use a similar tool as well :)

Definately on the cross platforming, plan on making a console implementation of it as well (my initial reason for rewriting the class in C)

When it gets to that point however i might need some help as far as setting it up so that it can compile in different enviornments (makefiles and all that since i have never worked with compiler options before .. my only programming experience is with Visual Studio flavors... ) I'm going to switch the entire project to Mono as suggested as soon as i finish coding it in Visual Basic... seeing as it has Linux & Mac implementations that should solve the cross platforming problem effectively.

It would be shortsighted of me not to make it cross platform, since i actually use XBMC through my Linux box.

Nicezia
2009-04-25, 23:43
the key to the nesting stuff and other functions is recursive code (see CIMDB::InternalGetDetails).

Thanks a million for that reference!!!! I should be able to implement nesting fully now

althekiller
2009-04-25, 23:49
May I suggest changing the name to "ScrapeMe"? :)

No sense in promoting all the uneducated folk calling them "scrappers" in the forums.

Nicezia
2009-04-26, 04:44
The name of it is ScrapeMe, that was just a typo when i was making created the project and i've just been to busy coding to go back and fix it, as you'll notice in the actual running window the name is correct (I got the name cause i was listening to Nirvana's "Rape Me" when i came up with the idea!

Nicezia
2009-04-26, 08:19
Okay i have a few questions about the how the scraper xml communicates with XBMC

when it starts execution of nested statements does it start from the deepest nested RegExp or from the outer RegExp.

I tried reading the C++ source code, but its a bit complex for me to read, with lots of stuff i still don't understand.

I've already got the tester working at the root level RegExp and already managed to get a entire scraper running on the tester all the way through (a really simple one i wrote no custom functions and no nested expresssions


<RegExpA>
<RegExpB>
<RegExpC>
<expression/>
</RegExpC>
<expression/>
</RegExpB>
<espression/>
</RegExpA>


would A or C be the first to execute?


And is there 9 buffers total to save data or 9 buffers per expression?

spiff
2009-04-26, 14:15
it's evaluated as a lifo, last in, first out, i.e. innermost first, so C then B then A.

there is a total of 20 buffers and they are global to the scraper parser.
usually these are cleared after executing a function, unless the clearbuffers="no" param is set.
the reason for having this parameter avail, is that it allows for passing info between functions that is executed after each others (i.e. <url function="foo"..> chains)

Nicezia
2009-04-26, 22:20
so if i'm understanding right there are 20 global buffers (cleared between functions unless specified) and there are 9 buffers available for RegExp captures, and execution of expressions works its way backwards towards the root expression?

Last question i need to ask is about noclean... what exact html is stripped if this is NOT set?

spiff
2009-04-27, 11:15
yes yes and yes. exactly like i have already explained. the nine buffers available for regexp captures is a property of your regexp parser, 9 is the minimal it must support.

i can hardly be more precise here than the code here, CHTMLUtil::RemoveTags

Nicezia
2009-04-28, 16:20
Sorry to bother you one more time, but i'm just about done with my regExp engine, and there's actually one more thing i need to know, i wouldn't ask but i'm not so good at c++ at all and the regexp engine looks all greek to me, (other things like the httputil, and xml utils were easy as pie to read through.. )

do you make use of nested parenthesis in regular expressions in the scrapers?

spiff
2009-04-28, 16:36
they can so you must support it yah

Schenk2302
2009-05-21, 23:59
Hi Nicezia,

are you planing to release the editor in near future?

Thanks in advance

Schenk

Nicezia
2009-05-22, 05:53
Hi Nicezia,

are you planing to release the editor in near future?

Thanks in advance

Schenk


The Editor will be an extension of my ScraperXML (https://sourceforge.net/projects/scraperxml/) library, so my first goal is to support all scraper types (And A few i've come up withon my own) in that library before actually moving on to making the editor... but now that i have the (nearly all of the) base methods coded into the library its just a matter of accounting for the different functions of other type scrapers and development of the library is moving pretty fast. So I will say, (and don't hold me to this) in a months time i should be able to actually get to developing the scraper editor.

Even though its not really neccessary, i may even just turn the whole scraper editor project into an xml editor so one can test the scrapers as you're making changes without having to use an extra program for the process.