PDA

View Full Version : Scraping helper tools for plugin coders


queeup
2009-09-04, 09:31
Scraping

When writing plugins, part of your job is getting content from a website. If you're in luck, everything you need is nicely presented in one or more RSS feeds. If this is not the case you'll have to extract the info you need from the webpages within a website. To do this you have to dive into the HTML source of a webpage. Fortunately there are some handy tools and add ons available to help you find the data you want.

Firefox Add-ons
XPath Checker (https://addons.mozilla.org/en-US/firefox/addon/1095)
To gather data from a webpage in your Python code, you use a lot of xpath expressions. With this add-on you can easily try your expressions and check if they collect the right data before implementing it in your Python code.
http://wiki.plexapp.com/images/thumb/6/6f/Xpath_checker.png/450px-Xpath_checker.png

XPather (https://addons.mozilla.org/en-US/firefox/addon/1192)
Requires: DOM inspector plugin (https://addons.mozilla.org/en-US/firefox/addon/6622)
...generates XPaths while browsing or inspecting HTML/XML/*ML documents; evaluates your XPaths and inspects the results; extracts the content.

The XPather is a simple Firefox extension that integrates both with the browser and its DOMInspector. Thus, is't very lightweight and cross-platform. It is valuable mainly as a web/XML-app development and hacking tool.

HttpFox (https://addons.mozilla.org/en-US/firefox/addon/6647)
HttpFox logs all HTTP connections and gives a nice overview of what files are being downloaded. This can help you find paths to or URLs from images and videos.
http://wiki.plexapp.com/images/thumb/7/78/HttpFox.png/450px-HttpFox.png

Webdeveloper Toolbar (https://addons.mozilla.org/en-US/firefox/addon/60)
This toolbar offers a *lot* of options. A couple of nice ones that might come in handy are:


Outline > Outline Current Element: draws a rectangle around the mouse hoverd item and displays the path to the element in the document

http://wiki.plexapp.com/images/thumb/1/18/Webdeveloper_toolbar_outline.png/450px-Webdeveloper_toolbar_outline.png


Information > Display Element Information: gives you information about the element itself, but also about its parent and child elements

http://wiki.plexapp.com/images/thumb/b/b2/Webdeveloper_toolbar_info.png/450px-Webdeveloper_toolbar_info.png


View Source > View Generated Source: Can't find an element you're looking for? Chances are the element was dynamically added to the webpage (for instance a Flash player that was put in on page load with the popular SWFObject javascript). With this option you can view the source after javascripts may have changed it.


Firebug (https://addons.mozilla.org/en-US/firefox/addon/1843) & Firebug Extensions (http://getfirebug.com/extensions/index.html)
Firebug integrates with Firefox to put a wealth of development tools at your fingertips while you browse. You can edit, debug, and monitor CSS, HTML, and JavaScript live in any web page. Firebug have also nice extensions.
https://addons.mozilla.org/en-US/firefox/images/p/11826/

Fiddler Wed Debugger (http://www.fiddler2.com/fiddler2/)
Fiddler is a Web Debugging Proxy which logs all HTTP(S) traffic between your computer and the Internet. Fiddler allows you to inspect all HTTP(S) traffic, set breakpoints, and "fiddle" with incoming or outgoing data. Fiddler includes a powerful event-based scripting subsystem, and can be extended using any .NET language.

Fiddler is freeware and can debug traffic from virtually any application, including Internet Explorer, Mozilla Firefox, Opera, and thousands more.
http://i32.tinypic.com/4q56dc.png

Internet Explorer Developer Toolbar (http://www.microsoft.com/windows/internet-explorer/developers.aspx)
Built in on IE 8. For IE 6 & 7 here (http://www.microsoft.com/downloadS/details.aspx?familyid=E59C3964-672D-4511-BB3E-2D5E1DB91038&displaylang=en)
Developers have all of the tools that they need, right out of the box without the need to download or install anything. By simply hitting the F12 key to bring them up, developers have access to:

DOM manipulation & HTML tree view
CSS tracing
JavaScript debugging
JavaScript profiling


Source : Plex Wiki (http://wiki.plexapp.com/index.php/Scraping_helper_tools)

stacked
2009-09-04, 10:41
Thanks for the info. Live HTTP Headers (https://addons.mozilla.org/en-US/firefox/addon/3829), Firebug (https://addons.mozilla.org/en-US/firefox/addon/1843), and Wireshark (http://www.wireshark.org/) are also pretty useful.

sansat
2009-09-04, 18:18
Thanks for information on these tools.

I tried the web developer toolbar for below link which uses javascript:

http://www.rajshri.com/midpage.aspx?cntid=33663

when we use regular source, and we try to go to pages 2, 3 etc in above link, it will still give source of only first page episodes- by using the web developer toolbar- view source-> generated source - it gives me source after java script has executed and shows source of episodes listed in other pages92.3 etc) which is very nice, but how can we get URL of each page in the match=re.compile('') - commands ? or what would be the method to follow for such sites where the parent URL, has listed pages inside frames or java script etc..

Please let me know.

Thanks

sansat
2009-09-08, 17:34
Anyone on a solution for javascript as per above situation ?

jmarshall
2009-09-09, 02:20
Thanks for taking the initiative guys! Most useful.

Cheers,
Jonathan

rwparris2
2009-09-09, 02:39
Thanks for the info. Live HTTP Headers (https://addons.mozilla.org/en-US/firefox/addon/3829), Firebug (https://addons.mozilla.org/en-US/firefox/addon/1843), and Wireshark (http://www.wireshark.org/) are also pretty useful.
Agreed, wireshark and firebug make an awesome combo.

The XPath Checker (https://addons.mozilla.org/en-US/firefox/addon/1095) looks useful, thanks for that one!
? or what would be the method to follow for such sites where the parent URL, has listed pages inside frames or java script etc..
That information is coming from somewhere. Using firebug, wireshark, and attempting to read the javascript itself are the best methods to find out where.

Dan Dare
2009-09-09, 16:38
Fiddler Wed Debugger
http://www.fiddler2.com/fiddler2/

Internet Explorer 8 - integrated Developer Tools (F12)
http://www.microsoft.com/windows/internet-explorer/default.aspx

Internet Explorer 6 & 7 - Internet Explorer Developer Toolbar
http://www.microsoft.com/downloadS/details.aspx?familyid=E59C3964-672D-4511-BB3E-2D5E1DB91038&displaylang=en

queeup
2009-09-09, 18:41
Update:
added Fiddler, IE Dev Toolbar and XPather

Dan Dare
2009-09-09, 18:44
@queeup
Should try the 2.2 beta Fiddler as well, it has a few improvements.

sansat
2009-09-09, 19:18
Can anyone please let me know how to get video information shown under each pages which is using javascript using python ?

link :

http://www.rajshri.com/landing.aspx?lgid=274&catid=257

Using above tools like firebug etc, I am able to get the video links after the java script is executed, but How do I code in python to get the information of videos after a java script is executed for example below video reference is found from page 3 of above link after the JavaScript is executed - How do I code in python to get this information ?



<a href="moviesmidpage.aspx?cntid=9196" id="ctlLandingContent_rptContent_ctl02_ahrefHeader" class="red_txt12" style="padding-left: 5px;">
Hera Pheri</a>


Please let me know .

Thanks

queeup
2009-09-09, 21:04
Sorry @sansat but this is not Help and Support forum :( Please try Plugin/Script (Python) Help and Support (http://xbmc.org/forum/forumdisplay.php?f=27) with new thread.