I’ve been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk isn’t the best way to to things out there and that my first solution to the problem sucks :( Yes, the sad truth…

Yesterday, I tried to put the application on the server – a Debian Lenny machine without X. And this is where it all broke down. I don’t want to install Xorg on this machine just so that a small script will work, so I’ve looked for alternatives ways to run the script. One of the first alternatives I’ve found was Xvfb. which, according to Wikipedia

In the X Window System, Xvfb or X virtual framebuffer is an X12 server that performs all graphical operations in memory, not showing any screen output. From the point of view of the client, it acts exactly like any other server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to even have a screen or any input device. Only a network layer is necessary.

…should get the job done. But it didn’t. While running under Xvfb, GTK kept throwing segmentation faults and crashing the whole script.

I was faced with the following decision: spend hours or perhaps days trying to see why Xvfb and GTK make such uneasy bed fellows or rewrite a 50 lines crawler script. I knew from my previous research on the matter that python also had binding with WebKit and Qt, so I’ve gave it a try. And it proved to be a much better solution than GTK.

QT to the rescue

Although I’m a Gnome/GTK fan, I must admit that Qt is a much better candidate for this job. First of all, it has extensive documentation, whereas pywebkitgtk’s is scarce. And, the second being that it works in my particular case, which can prove to be a huge advantage ;)

Under Ubuntu and Debian, you can install the package by simply typing…

sudo apt-get install python-qt4 libqt4-webkit

…in the console. And you’re done. You can run applications with python and Qt. The rewritten crawler code is:

#!/usr/bin/env python

import sys
import signal

from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

class Crawler( QWebPage ):
	def __init__(self, url, file):
		QWebPage.__init__( self )
		self._url = url
		self._file = file

	def crawl( self ):
		signal.signal( signal.SIGINT, signal.SIG_DFL )
		self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading )
		self.mainFrame().load( QUrl( self._url ) )

	def _finished_loading( self, result ):
		file = open( self._file, 'w' )
		file.write( self.mainFrame().toHtml() )
		file.close()
		sys.exit( 0 )

def main():
	app = QApplication( sys.argv )
	options = get_cmd_options()
	crawler = Crawler( options.url, options.file )
	crawler.crawl()
	sys.exit( app.exec_() )

def get_cmd_options():
	"""
		gets and validates the input from the command line
	"""
	usage = "usage: %prog [options] args"
	parser = OptionParser(usage)
	parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
	parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')

	(options,args) = parser.parse_args()

	if not options.url:
		print 'You must specify an URL.',sys.argv[0],'--help for more details'
		exit(1)
	if not options.file:
		print 'You must specify a destination file.',sys.argv[0],'--help for more details'
		exit(1)

	return options

if __name__ == '__main__':
	main()

This time it really works. I feel warm and fuzzy on the inside ;)