Pywebkitgtk under Debian Lenny

Posted on Wednesday, July 8th, 2009 under , ,

If you’ve read the previous post, you know that I consider using python with Webkit and Qt a much better solution than using GTK, so that’s what I recommend. But if you want to give it a try with GTK and need to install pywebkitgtk under Debian Lenny, this is what you need to do:

  1. Open /etc/apt/sources.list and append the following lines to the file.

    # Unstable Sid
    deb http://http.us.debian.org/debian/ unstable main contrib non-free 
    # Unstable Sources
    deb-src http://http.us.debian.org/debian/ unstable main contrib non-free
  2. Run apt-get update in the shell, so that apt will become aware of the changes in the sources.list file
  3. Run apt-get install pywebkitgtk

Note that you have to be logged in as root or be able to sudo.

Downloading a page’s content with python and WebKit

Posted on Tuesday, July 7th, 2009 under , , ,

I’ve been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk isn’t the best way to to things out there and that my first solution to the problem sucks :( Yes, the sad truth…

Yesterday, I tried to put the application on the server – a Debian Lenny machine without X. And this is where it all broke down. I don’t want to install Xorg on this machine just so that a small script will work, so I’ve looked for alternatives ways to run the script. One of the first alternatives I’ve found was Xvfb. which, according to Wikipedia

In the X Window System, Xvfb or X virtual framebuffer is an X12 server that performs all graphical operations in memory, not showing any screen output. From the point of view of the client, it acts exactly like any other server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to even have a screen or any input device. Only a network layer is necessary.

…should get the job done. But it didn’t. While running under Xvfb, GTK kept throwing segmentation faults and crashing the whole script.

I was faced with the following decision: spend hours or perhaps days trying to see why Xvfb and GTK make such uneasy bed fellows or rewrite a 50 lines crawler script. I knew from my previous research on the matter that python also had binding with WebKit and Qt, so I’ve gave it a try. And it proved to be a much better solution than GTK.

QT to the rescue

Although I’m a Gnome/GTK fan, I must admit that Qt is a much better candidate for this job. First of all, it has extensive documentation, whereas pywebkitgtk’s is scarce. And, the second being that it works in my particular case, which can prove to be a huge advantage ;)

Under Ubuntu and Debian, you can install the package by simply typing…

sudo apt-get install python-qt4 libqt4-webkit

…in the console. And you’re done. You can run applications with python and Qt. The rewritten crawler code is:

#!/usr/bin/env python
 
import sys
import signal
 
from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
 
 
class Crawler( QWebPage ):
	def __init__(self, url, file):
		QWebPage.__init__( self )
		self._url = url
		self._file = file
 
	def crawl( self ):
		signal.signal( signal.SIGINT, signal.SIG_DFL )
		self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading )
		self.mainFrame().load( QUrl( self._url ) )
 
	def _finished_loading( self, result ):
		file = open( self._file, 'w' )
		file.write( self.mainFrame().toHtml() )
		file.close()
		sys.exit( 0 )
 
def main():
	app = QApplication( sys.argv )
	options = get_cmd_options()
	crawler = Crawler( options.url, options.file )
	crawler.crawl()
	sys.exit( app.exec_() )
 
def get_cmd_options():
	"""
		gets and validates the input from the command line
	"""
	usage = "usage: %prog [options] args"
	parser = OptionParser(usage)
	parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
	parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
 
	(options,args) = parser.parse_args()
 
	if not options.url:
		print 'You must specify an URL.',sys.argv[0],'--help for more details' 
		exit(1)
	if not options.file:
		print 'You must specify a destination file.',sys.argv[0],'--help for more details'
		exit(1)
 
	return options
 
if __name__ == '__main__':
	main()

This time it really works. I feel warm and fuzzy on the inside ;)

Pywebkitgtk – execute Javascript from python

Posted on Thursday, June 18th, 2009 under , , , , ,

Python Last week I’ve got a new assignment at my job: a crawler that was supposed to periodically visit some sites and download their content. Sounds simple, isn’t it? Well, it’s not. Mainly because we want to also get all the flash content and some of it is inserted with Javascript, via various libraries like SWFobject or directly with document.write in some cases. I needed a snapshot of how the page actually looks like when the user is looking at it in a browser.

This meant that I had to get the content *after* all the javascripts contained in page finished execution. In developer language, this means after the window.onload event takes place. And, of course, I also needed a Javascript interpreter. So any attempt to use wget/cURL/file_get_contents was destined to fail from the start. I needed browser power :) So I’ve googled around for some.

The first thing I came across was using COM to connect to an Internet Explorer instance from python, use it to navigate back and forth and get the HTML content as it’s interpreted by IE’s engine. This had 3 major drawbacks:

  • it requires Internet Explorer
  • it requires Microsoft Windows
  • it requires an opened IE window

Since we want to migrate everything from our windows servers to linux, it would be pointless to go with this approach, since I’d have to rewrite in a month or so. Let aside the “lameness” of the technologies involved :) And I’m looking for a solution that doesn’t require an opened browser window, mainly because it should work on servers without X because I don’t want to :P (GTK doesn’t work without X – credits go to Alex Novac – and yes, it was retarded of me to think otherwise).

This solution wasn’t good enough, so I kept looking and came across the HtmlUnit Java library. This library is used to write tests in Java for web based applications. Pretty cool. And not so much. Although Java was once my one true love, after all these years spent with scripting languages, declaring variables, compiling the code, writing only OOP code and so on seemed a little…unfamiliar. But it takes more than anApiWithReallyLongCamelCasedClassNames to stop me, so I’ve installed Eclipse and made some tests. Disappointing! The library isn’t very tolerant with messy HTML and Javascript, and, since nobody out there, in the real world, actually abides to W3C recommendations, this library it’s somewhat useless in my case.

The next thing I’ve tried was a solution based on python that relied on integration with Gecko via hulahop. I must admit that I couldn’t get it to work under Ubuntu Jaunty Jackalope, due to incompatibilities in the system’s libraries. I’m sure that with enough time and patience, it can be pursued to work. But, as I didn’t had any, I’ve moved on and tried pywebkitgtk. This proved to be quite okay (I’m not a Safari fan) and it worked out of the box.

After spending several days searching the web, reading articles and trying out different softwares, I decided to share my findings with the world and write a tutorial on how to get the content of a page in python *after* its javascript finished execution. Here it goes:

First of all, install pywebkitgtk. Under Ubuntu, you can do it directly from the repository:

sudo apt-get install python-webkitgtk libwebkit-1.0-1 libwebkit-dev

…it will attempt to install a lot of other stuff, linked libraries and so on. Just say yes :P
After the installation is complete, it’s generally a good idea to test it! The following code should display a window with Google’s first page in it:

#!/usr/bin/env python
 
import gtk
import webkit
 
window = gtk.Window()
view = webkit.WebView()
view.open('http://www.google.com')
window.add(view)
window.show_all()
window.connect('delete-event', lambda window, event: gtk.main_quit())
 
gtk.main()

…if it doesn’t, maybe you did something wrong. See if all the packages are in their place. For the conversation’s sake, let’s assume it worked move on. As I said in the first paragraph, I wan to load a webpage, wait for it to execute all the JS in it and take the generated HTML source. A strange problem with pywebkitgtk is that nor the WebView object, nor the encapsulated WebFrame object don’t have a “get_html()” method or something similar. Really, there is no clean way to get the site’s content. But, fortunately, on pywebkitgtk’s wiki. I’ve found this hack that does just that:

class WebView(webkit.WebView):
    def get_html(self):
        self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
        html = self.get_main_frame().get_title()
        self.execute_script('document.title=oldtitle;')
        return html

It executes a javascript that takes the content of the whole document and stores it in the title. And since there is a get_title() method that returns the title’s content, this workaround gets the job done. Kind of lame, but it suffices.

As previously stated, in my application I didn’t want to have a browser window open and with GTK is possible to run your app without calling window.show() or window.show_all(). Long story short, this is how I did it:

#!/usr/bin/env python
import sys, threads # kudos to Nicholas Herriot (see comments)
import gtk
import webkit
import warnings
from time import sleep
from optparse import OptionParser
 
warnings.filterwarnings('ignore')
 
class WebView(webkit.WebView):
	def get_html(self):
		self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
		html = self.get_main_frame().get_title()
		self.execute_script('document.title=oldtitle;')
		return html
 
class Crawler(gtk.Window):
	def __init__(self, url, file):
		gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
		gtk.Window.__init__(self)
		self._url = url
		self._file = file
 
	def crawl(self):
		view = WebView()
		view.open(self._url)
		view.connect('load-finished', self._finished_loading)
		self.add(view)
		gtk.main()
 
	def _finished_loading(self, view, frame):
		with open(self._file, 'w') as f:
			f.write(view.get_html())
		gtk.main_quit()
 
def main():
	options = get_cmd_options()
	crawler = Crawler(options.url, options.file)
	crawler.crawl()
 
def get_cmd_options():
	"""
		gets and validates the input from the command line
	"""
	usage = "usage: %prog [options] args"
	parser = OptionParser(usage)
	parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
	parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
 
	(options,args) = parser.parse_args()
 
	if not options.url:
		print 'You must specify an URL.',sys.argv[0],'--help for more details' 
		exit(1)
	if not options.file:
		print 'You must specify a destination file.',sys.argv[0],'--help for more details'
		exit(1)
 
	return options
 
if __name__ == '__main__':
	main()

Download it, try it out. I worked wonders for me and I hope it will prove useful to other people too…