I’ve been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk isn’t the best way to to things out there and that my first solution to the problem sucks
Yes, the sad truth…
Yesterday, I tried to put the application on the server – a Debian Lenny machine without X. And this is where it all broke down. I don’t want to install Xorg on this machine just so that a small script will work, so I’ve looked for alternatives ways to run the script. One of the first alternatives I’ve found was Xvfb. which, according to Wikipedia…
In the X Window System, Xvfb or X virtual framebuffer is an X12 server that performs all graphical operations in memory, not showing any screen output. From the point of view of the client, it acts exactly like any other server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to even have a screen or any input device. Only a network layer is necessary.
…should get the job done. But it didn’t. While running under Xvfb, GTK kept throwing segmentation faults and crashing the whole script.
I was faced with the following decision: spend hours or perhaps days trying to see why Xvfb and GTK make such uneasy bed fellows or rewrite a 50 lines crawler script. I knew from my previous research on the matter that python also had binding with WebKit and Qt, so I’ve gave it a try. And it proved to be a much better solution than GTK.
QT to the rescue
Although I’m a Gnome/GTK fan, I must admit that Qt is a much better candidate for this job. First of all, it has extensive documentation, whereas pywebkitgtk’s is scarce. And, the second being that it works in my particular case, which can prove to be a huge advantage
Under Ubuntu and Debian, you can install the package by simply typing…
sudo apt-get install python-qt4 libqt4-webkit
…in the console. And you’re done. You can run applications with python and Qt. The rewritten crawler code is:
#!/usr/bin/env python import sys import signal from optparse import OptionParser from PyQt4.QtCore import * from PyQt4.QtGui import * from PyQt4.QtWebKit import QWebPage class Crawler( QWebPage ): def __init__(self, url, file): QWebPage.__init__( self ) self._url = url self._file = file def crawl( self ): signal.signal( signal.SIGINT, signal.SIG_DFL ) self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading ) self.mainFrame().load( QUrl( self._url ) ) def _finished_loading( self, result ): file = open( self._file, 'w' ) file.write( self.mainFrame().toHtml() ) file.close() sys.exit( 0 ) def main(): app = QApplication( sys.argv ) options = get_cmd_options() crawler = Crawler( options.url, options.file ) crawler.crawl() sys.exit( app.exec_() ) def get_cmd_options(): """ gets and validates the input from the command line """ usage = "usage: %prog [options] args" parser = OptionParser(usage) parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from') parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to') (options,args) = parser.parse_args() if not options.url: print 'You must specify an URL.',sys.argv[0],'--help for more details' exit(1) if not options.file: print 'You must specify a destination file.',sys.argv[0],'--help for more details' exit(1) return options if __name__ == '__main__': main()
This time it really works. I feel warm and fuzzy on the inside
That’s awesome you got a Qt version working! – it works fine for me on Ubuntu.
Do you know a way to make Webkit wait until any AJAX calls have completed before emitting the loadFinished() signal?
I’m afraid that’s not possible, because the loadFinished is a standard event that gets fired when the page has finished loading, whereas additional AJAX request are not. It’s even harder because AJAX calls are asynchronous, and even if a function was called, you don’t when the AJAX request will finish.
Perhaps you can look for some changes the AJAX calls will make in the page’s DOM. Monitor the HTML code inside for these kind of changes and start when you detect them.
> Perhaps you can look for some changes the AJAX calls will make in the page’s DOM
yeah that would be a reasonable workaround.
To start the AJAX calls I would need to trigger certain JavaScript functions, either directly, or indirectly through eg button click events.
I suppose I could do that by modifying the HTML to put the necessary JavaScript functions in the onLoad event, but that is a hack. Do you know the proper way to trigger JavaScript events?
OH, Man! Oh, man! Oh, man!
You just made my day a lot happier! Thank you!
A millon times, thank you! =)
Hello. This article was very useful for me but I had to create something that fit my needs. Here[1]’s the source, as a contibution.
[1]: http://github.com/emyller/webkitcrawler