I’ve been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk isn’t the best way to to things out there and that my first solution to the problem sucks
Yes, the sad truth…
Yesterday, I tried to put the application on the server – a Debian Lenny machine without X. And this is where it all broke down. I don’t want to install Xorg on this machine just so that a small script will work, so I’ve looked for alternatives ways to run the script. One of the first alternatives I’ve found was Xvfb. which, according to Wikipedia…
In the X Window System, Xvfb or X virtual framebuffer is an X12 server that performs all graphical operations in memory, not showing any screen output. From the point of view of the client, it acts exactly like any other server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to even have a screen or any input device. Only a network layer is necessary.
…should get the job done. But it didn’t. While running under Xvfb, GTK kept throwing segmentation faults and crashing the whole script.
I was faced with the following decision: spend hours or perhaps days trying to see why Xvfb and GTK make such uneasy bed fellows or rewrite a 50 lines crawler script. I knew from my previous research on the matter that python also had binding with WebKit and Qt, so I’ve gave it a try. And it proved to be a much better solution than GTK.
QT to the rescue
Although I’m a Gnome/GTK fan, I must admit that Qt is a much better candidate for this job. First of all, it has extensive documentation, whereas pywebkitgtk’s is scarce. And, the second being that it works in my particular case, which can prove to be a huge advantage
Under Ubuntu and Debian, you can install the package by simply typing…
sudo apt-get install python-qt4 libqt4-webkit
…in the console. And you’re done. You can run applications with python and Qt. The rewritten crawler code is:
#!/usr/bin/env python
import sys
import signal
from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage
class Crawler( QWebPage ):
def __init__(self, url, file):
QWebPage.__init__( self )
self._url = url
self._file = file
def crawl( self ):
signal.signal( signal.SIGINT, signal.SIG_DFL )
self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading )
self.mainFrame().load( QUrl( self._url ) )
def _finished_loading( self, result ):
file = open( self._file, 'w' )
file.write( self.mainFrame().toHtml() )
file.close()
sys.exit( 0 )
def main():
app = QApplication( sys.argv )
options = get_cmd_options()
crawler = Crawler( options.url, options.file )
crawler.crawl()
sys.exit( app.exec_() )
def get_cmd_options():
"""
gets and validates the input from the command line
"""
usage = "usage: %prog [options] args"
parser = OptionParser(usage)
parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
(options,args) = parser.parse_args()
if not options.url:
print 'You must specify an URL.',sys.argv[0],'--help for more details'
exit(1)
if not options.file:
print 'You must specify a destination file.',sys.argv[0],'--help for more details'
exit(1)
return options
if __name__ == '__main__':
main()
This time it really works. I feel warm and fuzzy on the inside
That’s awesome you got a Qt version working! – it works fine for me on Ubuntu.
Do you know a way to make Webkit wait until any AJAX calls have completed before emitting the loadFinished() signal?
I’m afraid that’s not possible, because the loadFinished is a standard event that gets fired when the page has finished loading, whereas additional AJAX request are not. It’s even harder because AJAX calls are asynchronous, and even if a function was called, you don’t when the AJAX request will finish.
Perhaps you can look for some changes the AJAX calls will make in the page’s DOM. Monitor the HTML code inside for these kind of changes and start when you detect them.
> Perhaps you can look for some changes the AJAX calls will make in the page’s DOM
yeah that would be a reasonable workaround.
To start the AJAX calls I would need to trigger certain JavaScript functions, either directly, or indirectly through eg button click events.
I suppose I could do that by modifying the HTML to put the necessary JavaScript functions in the onLoad event, but that is a hack. Do you know the proper way to trigger JavaScript events?
OH, Man! Oh, man! Oh, man!
You just made my day a lot happier! Thank you!
A millon times, thank you! =)
[...] Google吧,记得那天晚上中英文交杂的搜到三点,试了几种方法,最后发现一个比较靠谱的: Downloading a page’s content with python and WebKit [...]
Hello. This article was very useful for me but I had to create something that fit my needs. Here[1]‘s the source, as a contibution.
[1]: http://github.com/emyller/webkitcrawler
I use your script to crawl the web page on windows, but the Chinese characters are garbled. Can you give me some advices to fix this problem? Thank you!
Hi, got this error
cannot connect to X server
Wasn’t one of the requirements that the script run without X?
Thanks
Install Xvfb!
I used the exact same code on Windows. But the output file still have all the original java scripts! Do you guys have any idea why?
I don’t have access to a windows machine, so I can’t help you…
[...] using things like wget or curl was out of question. I eventually find a nice way to do things here. The last thing that I needed was to parallelize it (the webpages were quite [...]
Tried those two expamles but with no luck.
On Ubuntu 11.04 this link helped me http://github.com/emyller/webkitcrawler but explanations from Tudor were excelent.
Ubuntu python-webkitgtk … can not be found in 11.04 repository.
Thanks
I make a web scraping tool using webkit and Pyside, help can help others.
@Suncokret – the package is ‘python-webkit’, which is pywebkitgtk. So much for meaningful names.
[...] Downloading a page’s content with python and WebKit :: Downloading a page’s content afte… [...]
Cannot grab with this… (cannot connect to most of resources) for example to my blog topsidershoes.org
I am dont noe why. Tried two expamles.
Tudor,
What’s the correct way to run this script? When I try simply running it from the shell I get the “Cannot connect to the X server” error (even though I have xvfb installed).
I have managed to get it to run but passing it as a parameter to xvfb-run (command below) though it only returns .
xvfb-run -a -s “-screen 0 640x480x16″ python qttest.py -u=www.google.com -fout.html
Thanks,
Kevin
[...] can help us mimic browsers more closely Found some interesting ones called Pywebkitgtk and PyQt http://blog.motane.lu/2009/07/07/dow…on-and-webkit/ Or even compiling something like firefox into xbmc? Not sure how feasible that is, but saw some [...]
Thank you for this post, I have found it to be very informative. I am still having trouble getting the DOM after everything has loaded. The url in questions is “http://director.flyerservices.com/LCL/AccessibleFlyer/AccessibleCitySelector.aspx?OrganizationId=797d6dd1-a19f-4f1c-882d-12d6601dc376&BannerId=3d5f3800-c099-11d9-9669-0800200c9a66&BannerName=LOB&PublicationType=1&Language=en&Version=TEXT&NoRedirect=true&province=9″ Instead of getting a list of cities I just get the tags. Only success I have had is using Crowbar to render the contents and save to file but I would rather do everything in Python. Any suggestions?