GSoC 2012 Statistics gathering

Posted By: Team XBMC on Jun 20, 2012 in Site News

Hi everyone!

As you all may or may not know I’ve been tasked to clean up parts of the scraping process in my GSoC project. A first step to this is to gather some information so that we statistically can say what formats your media is formatted in: do you usually have dots instead of spaces, do most add year to the folder name of a movie, etc. Many of you might alter this to fit xbmc so we want to somehow catch those we do not handle so we can try and find some pattern in those formats as well!

To do this, we’ve created a script that you can find in the Program Addons folder of your XBMC install called “Statistics gathering for scraping GSoC 2012.” What the script will gather is the bare essential metadata if scanned into the library, title, year, runtime, tvshow, episode number and season number etc. The file path is also stored as it is what generates this data and of great importance to know, saying that I want to note that we will NOT store any username or password information in any part of these URLs, this script is not meant to track the users and it cannot track you! The data is uploaded to a server which we on the team can view and no data about your system except the metadata and file urls are uploaded, no binary data from the files of any kind!

Even with these assurances, we know that not everyone will be willing to send in that limited information, so we’ve made this addon “opt in.”  In order to participate in the study, you have to install the addon from the Program Addons folder in XBMC. I urge anyone, those displeased with how our scraper handles their media and those where it works perfectly, to install this addon if you want to help do your part to make XBMC even more awesome. The more information we have, the better our scrapers will be.

Once again, the addon is called “Statistics gathering for scraping GSoC 2012,” and it may be found in the Program Addons section.

Cheers, Tobias

Update: There’s a (known) bug in the current version of the addon which makes it fail on start. It will be fixed ASAP. We are sorry for the inconvenience.

Update 2: The bug has been fixed and the updated addon is now available from the official repository. Before you install it, make sure it’s version 0.0.4. If not you can either force refresh XBMC’s official repository (using the context menu) or wait a few hours until it happens automatically. Once again, we are sorry for the inconvenience.

Update 3: The server is brought down while I go through and backup the data. You all might be interested to know that you almost took down the server with all the data :) So thank you all so much, you have given me more than I ever hoped for! I will post updates ASAP!

Share on reddit


Discussion - 43 Comments

  • DasMarx Jun 20, 2012 

    Using Frodo Alpha 2 and can’t install this plugin. After clicking on install it will show “downloading 0%” for 1 sec.

  • skybot Jun 20, 2012 

    i cant install this addon :( dont know why i use openelec last build and i want to help.

  • Ryan Jun 20, 2012 

    Thing is all my files have been renamed to work with the current scrapping methods, so would my uploaded data be any use?

  • Toby Jun 20, 2012 

    I used to rigorously rename all my movies, but for some time now, I have stopped doing this. Since the scrapers have been working so well, I just leave the filename to whatever it is set to by the torrent creator. So now I have a mix of nicely renamed files and original-torrent-named-files.
    Not so for music, I keep strict order there.

    Will my statistics data still be helpful?

  • Konrad Jun 20, 2012 

    Great idea. I usually keep movie title to the minimum (title only), which obviously results in the occasional error, which is fine for me.

  • Nogood5 Jun 20, 2012 

    Always room for improvement so sure I will install the adding tonight.

  • Ryan Jun 20, 2012 

    Will the data collected be publicly available/ I for one would be interested in seeing the stats.

  • D0nR0s4 Jun 20, 2012 

    Weird, I get a script failed error. Anyway to find out what failed?

  • Martijn Jun 20, 2012 

    All data is welcome even though you have no problems with scraping.

  • Martijn Jun 20, 2012 

    @skybot
    There was a small problem with adding it to repo which should be fixed. Version should be 0.0.3

  • Basker Jun 20, 2012 

    Great idea, but one small question. Is this data going to be private and protected? And will we have any kind of feedback on what they gather?

  • Basker Jun 20, 2012 

    I guess I cannot count.

  • Montellese Jun 20, 2012 

    There’s a (known) bug in the current version (0.0.3) of the addon which makes it fail on start. It will be fixed ASAP. We are sorry for the inconvenience.

  • Joey Jun 20, 2012 

    Hate to be a little bit paranoid here, but what will Team-XBMC do if MPAA, RIAA, DOJ, or the FBI gets a court order and demand you hand over the data you collect from this statistics gathering?

    Can you somehow guarantee those of us that “opt in” for this statistics study will never get traced back to our IP and ISP? If you at least can promise us that you will not store any of our IP and ISP, not even temporary, then I for one will “opt in” and offer my support.

    Also, could you maybe add an auto-uninstall option based on a date or days active to this addon? Would be great if this addon was only active for say 30-days after installation and then automatically uninstall itself :)

  • Montellese Jun 20, 2012 

    I’m not the author of the script but I have seen the stored data during a test run and there’s no IP or anything in there. It’s just the filename and some metadata like title etc. Furthermore any username/passwords in file paths are removed before the data is uploaded.

    Concerning auto-uninstall, this is not possible in XBMC. Just install the addon, run it once and uninstall it. There’s no need to run it twice as it would actually falsify the overall data.

  • Eyaz Jun 20, 2012 

    Please can you help me, I’ve got xbmc on my iPad 3 when streaming it throw apple tv 3 it’s not full screen how do I sort that out please help me thanks

  • Joe T Jun 20, 2012 

    Any way you can improve ability to scrape movies labeled in morse code? XBMC never seems to get correct info for “… .- -. – .- .- -. -.. – …. . .. -.-. . -.-. .-. . .- — -… ..- -. -. -.– ” I have to manually create nfo :(

  • Stephan Jun 20, 2012 

    I’d tried the plugin, but on the screen on which you have to select which source directories you want to submit the list of directories is too long for my display (running 1080p res) and the list isn’t scrollable. This has to be solved otherwise at least my data will be useless/counterproductive.

  • Montellese Jun 20, 2012 

    The bug has been fixed and the updated addon is now available from the official repository. Before you install it, make sure it’s version 0.0.4. If not you can either force refresh XBMC’s official repository (using the context menu) or wait a few hours until it happens automatically. Once again, we are sorry for the inconvenience.

  • XtuX Jun 20, 2012 

    First thing to do when i get home from work !

  • Joey Jun 20, 2012 

    Montellese :Concerning auto-uninstall, this is not possible in XBMC. Just install the addon, run it once and uninstall it. There’s no need to run it twice as it would actually falsify the overall data.

    Maybe in the future you could have a “run-once” option for those types of addons then? ;)

    “This addon will only run once after installation then automatically uninstall itself when done”

  • Hasu0bs Jun 20, 2012 

    On an openELEC Generic PVR version with AudioEngine it crashes when i select Next.
    Here is a log:
    http://sprunge.us/YGfL

    My Files are on my Desktop Computer accessed over Windows smb share…

  • HTPC Champion Jun 20, 2012 

    I just spread your message on our german blog!

    http://www.htpc-blogger.de/?p=749

    Greetings
    Rene

  • Anonymous Jun 20, 2012 

    @Montellese
    Great point, I think the project author should definitely clarify that it should only be run once. OR a super slick upgrade to the current version, would be the ability to know what files it has already uploaded and not to upload those ones twice.

  • Tristan Jun 21, 2012 

    Sent! Always happy to send you guys stats and info!

  • ponpat Jun 22, 2012 

    I would love to help, even if most part of my library has been renamed to work, I still have some folders and files I have not renamed and which do not work. But when I start the script I allways get an “Script Error” without usefull informations. Any other way to help you with the Stats? Think I used very common name styles before I renamed my files/folders (Folder for series, Foder for a Season in it Episodes with number and name. Or a folder for every episode with a subtitle and sample and stuff like that, based on from where I got my files).

  • monkeysweat Jun 22, 2012 

    i’ve always been sloppy with my movie names and just let it be and only renamed the folder if it wasn’t going to be easy for me to find it later.

    Tv shows I used to be very anal and make sure they were all s**e** but now I let it fly,, if it scrapes, its all good :)

    music i’ve never found a setup that works, but then again ive never tried, haha

  • topfs2 Jun 22, 2012 

    @Ryan
    All data is important, perfectly scraped is usable if we want to train an AI to do some parts of the scraping process.

    @Ryan
    It will not be publically available, since its private data. Statistics and graphs we might show publically, if of interest.

    @Joey
    We store no binary data and all is filepaths and urls. So there is _no_ way to tell if you have named an empty file a certain way or if you have a movie. I.e. the data is unusable to them. Also, all usernames and passwords are stripped and we do not keep submitters IP. If your using all internal paths there should be nothing trackable, atleast if all your files and the URLs only is within a LAN. If you are uncertain if your paths could be trackable of a government with the data, i.e. if your urls lead to externally viewable paths without password, then dont submit the data.

    Hope that answered some questions!
    Cheers,
    Tobias

  • angel129116 Jun 22, 2012 

    Sent! Totally painless and anything to help with this amazing software!

  • jak0p Jun 23, 2012 

    just a quick hint befor sending in my data:
    I’m using The MovieDB for scraping movies. Being German means to have special chars like ä ü ö in titles which are usually replaced by ae ue oe. So far so good. Scraping with these names like :
    Maenner sind Schweine (http://www.themoviedb.org/search?query=Maenner+sind+schweine)
    will hit 0 results (tMDB wont re-replace ae oe ue) but
    Männer sind Schweine (http://www.themoviedb.org/search?query=M%C3%A4nner+sind+schweine)

    will get the right result. Would be very very convenient if the scraper would do both scrapes (i.e. first scan for ue oe ae and then silently replace ue oe ae with ü ö and ä and do another scrape) and then show both results.

    (I know thi maybe the wrong place to post but, hey its all about scrapers and how to optimize them ;)

  • Paxxi Jun 24, 2012 

    Tried running the script yesterday and it crashed while scraping.

    Here’s the info from xbmc.log, send me a pm if you need more info or want me to test something

    22:29:58 T:7068 ERROR: D:\Program\XBMC 11.0\system\python\Lib\urllib.py:1224: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode – interpreting them as being unequal
    res = map(safe_map.__getitem__, s)
    22:29:58 T:7068 ERROR: Error Type:
    22:29:58 T:7068 ERROR: Error Contents: u’\xc2′
    22:29:58 T:7068 ERROR: Traceback (most recent call last):
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\default.py”, line 12, in
    sm.doModal()
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\state.py”, line 17, in doModal
    self.active.doModal()
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\states.py”, line 122, in doModal
    extraction.extractVideoFilesFromDirectory(files, videoFiles, source["file"], unscrapedIsCanceled, midProgress)
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\extraction.py”, line 169, in extractVideoFilesFromDirectory
    extractVideoFilesFromDirectory(files, videoFiles, f["file"], isInterrupted)
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\extraction.py”, line 169, in extractVideoFilesFromDirectory
    extractVideoFilesFromDirectory(files, videoFiles, f["file"], isInterrupted)
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\extraction.py”, line 171, in extractVideoFilesFromDirectory
    path = removeFromStackAndRecurse(f["file"])
    File “D:\Program\XBMC 11.0\portable_data\addons\script.statistics.gsoc2012.scraper\url.py”, line 17, in removeFromStackAndRecurse
    newnetloc_list.append(quote_plus(removeFromStackAndRecurse(netloc_unquoted)))
    File “D:\Program\XBMC 11.0\system\python\Lib\urllib.py”, line 1230, in quote_plus
    s = quote(s, safe + ‘ ‘)
    File “D:\Program\XBMC 11.0\system\python\Lib\urllib.py”, line 1224, in quote
    res = map(safe_map.__getitem__, s)
    KeyError: u’\xc2′

  • Plasstech Jun 25, 2012 

    Will do when I get home, Anyone running the SQL setup? Hoping it will work properly with it. I guess we will see

    -Pete

  • topfs2 Jun 25, 2012 

    @Anonymous
    Its fine running it twice, it will not store perfect duplicates,, but this logic is done serverside .

  • Brad Jun 25, 2012 

    Is there a forum thread where everyone can keep track of this project or make suggestions?

  • Anonymous Jun 26, 2012 

    topfs2 :@Ryan All data is important, perfectly scraped is usable if we want to train an AI to do some parts of the scraping process.
    @Ryan It will not be publically available, since its private data. Statistics and graphs we might show publically, if of interest.
    @Joey We store no binary data and all is filepaths and urls. So there is _no_ way to tell if you have named an empty file a certain way or if you have a movie. I.e. the data is unusable to them. Also, all usernames and passwords are stripped and we do not keep submitters IP. If your using all internal paths there should be nothing trackable, atleast if all your files and the URLs only is within a LAN. If you are uncertain if your paths could be trackable of a government with the data, i.e. if your urls lead to externally viewable paths without password, then dont submit the data.
    Hope that answered some questions!Cheers,Tobias

    monkeysweat :i’ve always been sloppy with my movie names and just let it be and only renamed the folder if it wasn’t going to be easy for me to find it later.
    Tv shows I used to be very anal and make sure they were all s**e** but now I let it fly,, if it scrapes, its all good :)
    music i’ve never found a setup that works, but then again ive never tried, haha

    @Eyaz

    @ponpat

    @topfs2

    @topfs2

    topfs2 :@Anonymous Its fine running it twice, it will not store perfect duplicates,, but this logic is done serverside .

  • Sylus Jun 26, 2012 

    @brad

    take a look at the forum/ development /GSoC 2012

  • Buddha Jun 26, 2012 

    I took 5 minutes to read the naming conventions for movies and tv series/episodes in the XBMC wiki when I first downloaded XBMC.

    I use a program called ” The Renamer” to automatically rename my material to satisfy those requirements.

    My scrapes are probably, over two years, running at about 99.9 percent success rate. The couple of problems I’ve had have been when TVDB or IMBD mucked about with their code and you guys sorted it out.

    As far as I’m concerned the scrapers work fine if you name your stuff properly.

  • V-Turn Jun 26, 2012 

    I’ve got an error while running the script, see http://pastebin.com/BdQ8RycB

  • Eric the Red Jun 27, 2012 

    Personally I think XBMC should move away from using file/folder to identify files, it’s a big whopping mess when you move files around or rename them, forcing you to rescan the library and clean it (this, at least, could/should be combined in a single action – while you add media, prune stuff which isn’t there anymore since you’re scanning anyway).

    For the rest the scrapers are running pretty much flawless for me, like Buddha I also use a renaming tool to stick to the common conventions, which are what I use anyway. Things like ” – SXXEXX” are just convenient.

    It would be nice if XBMC ignored everything between [] brackets, but I don’t want to get into the whole “did you download that from the internet OMG” discussion.

  • MacGyver Jun 28, 2012 

    I use EmberMM and Sickbeard to create .nfos, .tbns, and -fanart.jpgs that sit along-side all my media. So I don’t think I am using the scrapers at all. Will my data be useful, or screw up your data?

    • topfs2 Jul 01, 2012 

      All data is valuable, and data that you have corrected manually is even more valuable as that might show file formats xbmc do not understand but we could work on making it understand! So please upload :)

  • Tobbe Jun 30, 2012 

    i would also like to contribute, but like for some other people here the script fails even at version 0.0.4 =/

  • Arucard Jul 03, 2012 

    @Eric the Red
    I’m not sure what alternative you had in mind, but if you wanted to move towards identifying your media according to file hashes you’d also need a scraping source that stores the file hash for all file variants (SD, HDTV, Web-DL, 1080p bluray, dvdrip, different encoders, etc.) of an episode/movie. This is currently not the case with most scraping sources (I only know of anidb.net). Other methods for identifying media would similarly require corresponding changes in all supported scraper sources, which is why this is probably quite difficult to do.

    It might be useful for XBMC to calculate and remember the file hashes for files being scraped though, so any duplicate entries can be removed when they have the same hash value. This would cause the library updates to be much slower when adding files though, since the hash value would need to be generated for every file, even files that were only renamed (since the hash would be required to know whether or not it is the same file as the existing one)

    On another note, you can already clean your library automatically when you update your library with the option in , which should be used in your advancedsettings.xml.
    http://wiki.xbmc.org/index.php?title=AdvancedSettings.xml#.3Cvideolibrary.3E
    One of the reasons why this isn’t enabled by default is that people can have networked media sources for their library which can become temporarily unavailable. These sources would then be unintentionally removed from the library.

About XBMC

XBMC is a free and open source media player application developed by the XBMC Foundation, a non-profit technology consortium. XBMC is available for multiple operating-systems and hardware platforms, featuring a 10-foot user interface for use with televisions and remote controls. It allows users to play and view most videos, music, podcasts, and other digital media files from local and network storage media and the internet.