Last week I’ve got a new assignment at my job: a crawler that was supposed to periodically visit some sites and download their content. Sounds simple, isn’t it? Well, it’s not. Mainly because we want to also get all the flash content and some of it is inserted with Javascript, via various libraries like SWFobject or directly with document.write in some cases. I needed a snapshot of how the page actually looks like when the user is looking at it in a browser.
This meant that I had to get the content *after* all the javascripts contained in page finished execution. In developer language, this means after the window.onload event takes place. And, of course, I also needed a Javascript interpreter. So any attempt to use wget/cURL/file_get_contents was destined to fail from the start. I needed browser power
So I’ve googled around for some.
The first thing I came across was using COM to connect to an Internet Explorer instance from python, use it to navigate back and forth and get the HTML content as it’s interpreted by IE’s engine. This had 3 major drawbacks:
- it requires Internet Explorer
- it requires Microsoft Windows
- it requires an opened IE window
Since we want to migrate everything from our windows servers to linux, it would be pointless to go with this approach, since I’d have to rewrite in a month or so. Let aside the “lameness” of the technologies involved
And I’m looking for a solution that doesn’t require an opened browser window, mainly because it should work on servers without X because I don’t want to
(GTK doesn’t work without X – credits go to Alex Novac – and yes, it was retarded of me to think otherwise).
This solution wasn’t good enough, so I kept looking and came across the HtmlUnit Java library. This library is used to write tests in Java for web based applications. Pretty cool. And not so much. Although Java was once my one true love, after all these years spent with scripting languages, declaring variables, compiling the code, writing only OOP code and so on seemed a little…unfamiliar. But it takes more than anApiWithReallyLongCamelCasedClassNames to stop me, so I’ve installed Eclipse and made some tests. Disappointing! The library isn’t very tolerant with messy HTML and Javascript, and, since nobody out there, in the real world, actually abides to W3C recommendations, this library it’s somewhat useless in my case.
The next thing I’ve tried was a solution based on python that relied on integration with Gecko via hulahop. I must admit that I couldn’t get it to work under Ubuntu Jaunty Jackalope, due to incompatibilities in the system’s libraries. I’m sure that with enough time and patience, it can be pursued to work. But, as I didn’t had any, I’ve moved on and tried pywebkitgtk. This proved to be quite okay (I’m not a Safari fan) and it worked out of the box.
After spending several days searching the web, reading articles and trying out different softwares, I decided to share my findings with the world and write a tutorial on how to get the content of a page in python *after* its javascript finished execution. Here it goes:
First of all, install pywebkitgtk. Under Ubuntu, you can do it directly from the repository:
sudo apt-get install python-webkitgtk libwebkit-1.0-1 libwebkit-dev
…it will attempt to install a lot of other stuff, linked libraries and so on. Just say yes ![]()
After the installation is complete, it’s generally a good idea to test it! The following code should display a window with Google’s first page in it:
#!/usr/bin/env python
import gtk
import webkit
window = gtk.Window()
view = webkit.WebView()
view.open('http://www.google.com')
window.add(view)
window.show_all()
window.connect('delete-event', lambda window, event: gtk.main_quit())
gtk.main()
…if it doesn’t, maybe you did something wrong. See if all the packages are in their place. For the conversation’s sake, let’s assume it worked move on. As I said in the first paragraph, I wan to load a webpage, wait for it to execute all the JS in it and take the generated HTML source. A strange problem with pywebkitgtk is that nor the WebView object, nor the encapsulated WebFrame object don’t have a “get_html()” method or something similar. Really, there is no clean way to get the site’s content. But, fortunately, on pywebkitgtk’s wiki. I’ve found this hack that does just that:
class WebView(webkit.WebView):
def get_html(self):
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
html = self.get_main_frame().get_title()
self.execute_script('document.title=oldtitle;')
return html
It executes a javascript that takes the content of the whole document and stores it in the title. And since there is a get_title() method that returns the title’s content, this workaround gets the job done. Kind of lame, but it suffices.
As previously stated, in my application I didn’t want to have a browser window open and with GTK is possible to run your app without calling window.show() or window.show_all(). Long story short, this is how I did it:
#!/usr/bin/env python
import sys, threads # kudos to Nicholas Herriot (see comments)
import gtk
import webkit
import warnings
from time import sleep
from optparse import OptionParser
warnings.filterwarnings('ignore')
class WebView(webkit.WebView):
def get_html(self):
self.execute_script('oldtitle=document.title;document.title=document.documentElement.innerHTML;')
html = self.get_main_frame().get_title()
self.execute_script('document.title=oldtitle;')
return html
class Crawler(gtk.Window):
def __init__(self, url, file):
gtk.gdk.threads_init() # suggested by Nicholas Herriot for Ubuntu Koala
gtk.Window.__init__(self)
self._url = url
self._file = file
def crawl(self):
view = WebView()
view.open(self._url)
view.connect('load-finished', self._finished_loading)
self.add(view)
gtk.main()
def _finished_loading(self, view, frame):
with open(self._file, 'w') as f:
f.write(view.get_html())
gtk.main_quit()
def main():
options = get_cmd_options()
crawler = Crawler(options.url, options.file)
crawler.crawl()
def get_cmd_options():
"""
gets and validates the input from the command line
"""
usage = "usage: %prog [options] args"
parser = OptionParser(usage)
parser.add_option('-u', '--url', dest = 'url', help = 'URL to fetch data from')
parser.add_option('-f', '--file', dest = 'file', help = 'Local file path to save data to')
(options,args) = parser.parse_args()
if not options.url:
print 'You must specify an URL.',sys.argv[0],'--help for more details'
exit(1)
if not options.file:
print 'You must specify a destination file.',sys.argv[0],'--help for more details'
exit(1)
return options
if __name__ == '__main__':
main()
Download it, try it out. I worked wonders for me and I hope it will prove useful to other people too…
Definitely I have to try and write some more Python code when I finish with my school exams. It’s just fucking amazing how nice and fast you get productivity for things that in other languages it would take you at least third times more lines of code (Python vs. Java/C/C++/C#).
You could say it’s laziness over speed (comparing it to C/C++), but after all productivity is more important than speed and more important than these two is readability, which Python has it as a feature in it, not as a virtue of the programmer.
[...] been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk [...]
Oh, man! I was looking exactly for this for several hours already
I’m totally new to python/webkit and it was not obvious for me how I can solve a problem. Thanks a lot!
Try the Qt version, as it worked better for me! More details here.
hi
Thanks for the examples
im working on windows and I get this error:
Do you have any clue about what could be causing it?
Thanks
What version of python are you using? Is pywebkitgtk properly installed? I must admit that I haven’t tested on windows…
PS: perhaps the Qt version will yield better results, so have a look over this link: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/ .
Thanks
the Python version is 2.6.1
At home I use Ubuntu and it works, at work I’ll install a virtual machine with linux.
Thanks for your answer
I don’t use Windows so I don’t test my work in it.
Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.
When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.
Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.
When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.
a onclick=”groups.render_box(‘group_members’, {“page”: 2});return false;” href=”#”>Next
This may help people, the very first example does not work on Ubuntu 9.10. The GTK threads are not initialized.
I found that by adding:
gtk.gdk.threads_init()
Opp’s did not finish my post!
Yes you need to add the gtk.gdk.threads_init() to initialize GTK threads. I added just after my last import.
You also have to add the import statement to use threads.
Hence:
import threads
Hope this helps some one….
Got a question. On the last example you don’t define the ‘sys’ object anywhere. So how did this work?
Hence line:
>print ‘You must specify an URL.’,sys.argv[0],’–help for more details’
will cause a python error.
NameError: global name ‘sys’ is not defined
What is the proper syntax? As sys seems to have a list of arguments?
That got pasted wrong, I usually used it the right way, so it never reached the sys.args line. Python is an interpreted language, so it works like that.
I didn’t test it on Koala yet, nor will I too soon, since that was a project for my former company. But it worked on Jaunty and on Debian (can’t remember the version).
Added import sys and gtk.gdk.threads_init() in the code.
Tudor, you saved me a lot of work with this tutorial so I send you many thanks
I have discovered that you can use the ‘console-message’ signal to read content
from the html.
New code:
... view.connect('load-finished', self._finished_loading) view.connect("console-message" , self._console) ... def _finished_loading(self, view, frame): view.execute_script('console.log(document.documentElement.innerHTML);') .... def _console(self, view, msg, line, sourceid): with open(self._file, 'w') as f: f.write(msg) gtk.main_quit()This works well if there are no other console.log messages from the source html.
If there are, you can add some keys to your console message to recognize their yours.
In self._finished_loading you can add thirt party javascript sources, run javascript commands though view.execute_script… then do view.execute_script(“console.log…”) to get the content back to the python script.
Definitely Use the main_resource from the main frame as you will see in the python api for webkit it is what holds on to the bulk from all the ajax type stuff running, only problem is I havent been able to render javascript and “dump” (as it were), an entire page from this api, It needs some rethinking to be a true crawler.
but here is what is required to view your gmail inbox in pywebkitgtk,
if you run the pywebkit browser demo
when in your gmail inbox will dump all your mail from lists in javascript source
Now If I could Just get the links extracted so I can navigate and recursively download my mail. Only because current provider blocks all SMTP.!!!
it seems the code ‘s execution has nothing to do with python-webkitgtk, as I had not install it, but the code above can and do the work correctly~~
It’d be great if anyone could write a Scrapy extension for this.
Thanks !
This is wicked.
Interestingly, I get an exception for the import of threads.
I am on Ubuntu10.10 with Python 2.6.6.
I removed the import and stuff seems to be working ….
replace
gtk.gdk.threads_init()
to
import gobject
gobject.threads_init()
works even better
Hi,
Using selenium can be a solution as well
I have a problem installing the python-webkitgtk, I googled for the solution, but cudnt find much help.
Anyone who have faced the same problem and solved it?
hi.
I have copy and paste your code, but
I don’t know why:
>>> def crawl(self): File "", line 1 def crawl(self): ^ IndentationError: unexpected indent >>> view = WebView() File "", line 1 view = WebView() ^ IndentationError: unexpected indent >>> view.open(self._url) File "", line 1 view.open(self._url) ^ IndentationError: unexpected indent ...There’s a problem with the indentation when you pasted the code. Just re-indent everything using either tabs either spaces, no combinations of the two.
Sourabh Singi> Can you please give more details on what distribution you are using?