Downloading a page’s content with python and WebKit

Posted on Tuesday, July 7th, 2009 under , , ,

I’ve been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk isn’t the best way to to things out there and that my first solution to the problem sucks :( Yes, the sad truth…

Yesterday, I tried to put the application on the server – a Debian Lenny machine without X. And this is where it all broke down. I don’t want to install Xorg on this machine just so that a small script will work, so I’ve looked for alternatives ways to run the script. One of the first alternatives I’ve found was Xvfb. which, according to Wikipedia

In the X Window System, Xvfb or X virtual framebuffer is an X12 server that performs all graphical operations in memory, not showing any screen output. From the point of view of the client, it acts exactly like any other server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to even have a screen or any input device. Only a network layer is necessary.

…should get the job done. But it didn’t. While running under Xvfb, GTK kept throwing segmentation faults and crashing the whole script.

I was faced with the following decision: spend hours or perhaps days trying to see why Xvfb and GTK make such uneasy bed fellows or rewrite a 50 lines crawler script. I knew from my previous research on the matter that python also had binding with WebKit and Qt, so I’ve gave it a try. And it proved to be a much better solution than GTK.

QT to the rescue

Although I’m a Gnome/GTK fan, I must admit that Qt is a much better candidate for this job. First of all, it has extensive documentation, whereas pywebkitgtk’s is scarce. And, the second being that it works in my particular case, which can prove to be a huge advantage ;)

Under Ubuntu and Debian, you can install the package by simply typing…

sudo apt-get install python-qt4 libqt4-webkit

…in the console. And you’re done. You can run applications with python and Qt. The rewritten crawler code is:

#!/usr/bin/env python
 
import sys
import signal
 
from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
 
 
class Crawler( QWebPage ):
	def __init__(self, url, file):
		QWebPage.__init__( self )
		self._url = url
		self._file = file
 
	def crawl( self ):
		signal.signal( signal.SIGINT, signal.SIG_DFL )
		self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading )
		self.mainFrame().load( QUrl( self._url ) )
 
	def _finished_loading( self, result ):
		file = open( self._file, 'w' )
		file.write( self.mainFrame().toHtml() )
		file.close()
		sys.exit( 0 )
 
def main():
	app = QApplication( sys.argv )
	options = get_cmd_options()
	crawler = Crawler( options.url, options.file )
	crawler.crawl()
	sys.exit( app.exec_() )
 
def get_cmd_options():
	"""
		gets and validates the input from the command line
	"""
	usage = "usage: %prog [options] args"
	parser = OptionParser(usage)
	parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
	parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
 
	(options,args) = parser.parse_args()
 
	if not options.url:
		print 'You must specify an URL.',sys.argv[0],'--help for more details' 
		exit(1)
	if not options.file:
		print 'You must specify a destination file.',sys.argv[0],'--help for more details'
		exit(1)
 
	return options
 
if __name__ == '__main__':
	main()

This time it really works. I feel warm and fuzzy on the inside ;)

Related posts

6 Responses to “Downloading a page’s content with python and WebKit”

Trackbacks (1)

Comments (5)

  1. • Drake •

    That’s awesome you got a Qt version working! – it works fine for me on Ubuntu.
    Do you know a way to make Webkit wait until any AJAX calls have completed before emitting the loadFinished() signal?

  2. Tudor

    I’m afraid that’s not possible, because the loadFinished is a standard event that gets fired when the page has finished loading, whereas additional AJAX request are not. It’s even harder because AJAX calls are asynchronous, and even if a function was called, you don’t when the AJAX request will finish.

    Perhaps you can look for some changes the AJAX calls will make in the page’s DOM. Monitor the HTML code inside for these kind of changes and start when you detect them.

  3. • Drake •

    > Perhaps you can look for some changes the AJAX calls will make in the page’s DOM

    yeah that would be a reasonable workaround.

    To start the AJAX calls I would need to trigger certain JavaScript functions, either directly, or indirectly through eg button click events.
    I suppose I could do that by modifying the HTML to put the necessary JavaScript functions in the onLoad event, but that is a hack. Do you know the proper way to trigger JavaScript events?

  4. Guilherme

    OH, Man! Oh, man! Oh, man!

    You just made my day a lot happier! Thank you!
    A millon times, thank you! =)

  5. Hello. This article was very useful for me but I had to create something that fit my needs. Here[1]’s the source, as a contibution.

    [1]: http://github.com/emyller/webkitcrawler

Leave a Reply