The first thing I came across was using COM to connect to an Internet Explorer instance from python, use it to navigate back and forth and get the HTML content as it’s interpreted by IE’s engine. This had 3 major drawbacks:
Since we want to migrate everything from our windows servers to linux, it would be pointless to go with this approach, since I’d have to rewrite in a month or so. Let aside the “lameness” of the technologies involved And I’m looking for a solution that doesn’t require an opened browser window,
mainly because it should work on servers without X because I don’t want to (GTK doesn’t work without X – credits go to Alex Novac – and yes, it was retarded of me to think otherwise).
The next thing I’ve tried was a solution based on python that relied on integration with Gecko via hulahop. I must admit that I couldn’t get it to work under Ubuntu Jaunty Jackalope, due to incompatibilities in the system’s libraries. I’m sure that with enough time and patience, it can be pursued to work. But, as I didn’t had any, I’ve moved on and tried pywebkitgtk. This proved to be quite okay (I’m not a Safari fan) and it worked out of the box.
First of all, install pywebkitgtk. Under Ubuntu, you can do it directly from the repository:
sudo apt-get install python-webkitgtk libwebkit-1.0-1 libwebkit-dev
…it will attempt to install a lot of other stuff, linked libraries and so on. Just say yes
After the installation is complete, it’s generally a good idea to test it! The following code should display a window with Google’s first page in it:
#!/usr/bin/env python import gtk import webkit window = gtk.Window() view = webkit.WebView() view.open('http://www.google.com') window.add(view) window.show_all() window.connect('delete-event', lambda window, event: gtk.main_quit()) gtk.main()
…if it doesn’t, maybe you did something wrong. See if all the packages are in their place. For the conversation’s sake, let’s assume it worked move on. As I said in the first paragraph, I wan to load a webpage, wait for it to execute all the JS in it and take the generated HTML source. A strange problem with pywebkitgtk is that nor the WebView object, nor the encapsulated WebFrame object don’t have a “get_html()” method or something similar. Really, there is no clean way to get the site’s content. But, fortunately, on pywebkitgtk’s wiki. I’ve found this hack that does just that:
class WebView(webkit.WebView): def get_html(self): self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;') html = self.get_main_frame().get_title() self.execute_script('document.title=oldtitle;') return html
As previously stated, in my application I didn’t want to have a browser window open and with GTK is possible to run your app without calling window.show() or window.show_all(). Long story short, this is how I did it:
#!/usr/bin/env python import sys, threads # kudos to Nicholas Herriot (see comments) import gtk import webkit import warnings from time import sleep from optparse import OptionParser warnings.filterwarnings('ignore') class WebView(webkit.WebView): def get_html(self): self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;') html = self.get_main_frame().get_title() self.execute_script('document.title=oldtitle;') return html class Crawler(gtk.Window): def __init__(self, url, file): gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala gtk.Window.__init__(self) self._url = url self._file = file def crawl(self): view = WebView() view.open(self._url) view.connect('load-finished', self._finished_loading) self.add(view) gtk.main() def _finished_loading(self, view, frame): with open(self._file, 'w') as f: f.write(view.get_html()) gtk.main_quit() def main(): options = get_cmd_options() crawler = Crawler(options.url, options.file) crawler.crawl() def get_cmd_options(): """ gets and validates the input from the command line """ usage = "usage: %prog [options] args" parser = OptionParser(usage) parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from') parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to') (options,args) = parser.parse_args() if not options.url: print 'You must specify an URL.',sys.argv,'--help for more details' exit(1) if not options.file: print 'You must specify a destination file.',sys.argv,'--help for more details' exit(1) return options if __name__ == '__main__': main()
Download it, try it out. I worked wonders for me and I hope it will prove useful to other people too…