Scraping google.com
I don't know when this happened, but it seems you can't scrape (parse) google.com as easily as before. At least with Python, a simple
s=urllib.urlopen('http://www.google.com/search?q=define%3Aesoteric')
r=s.read()
print r

will give you this notice instead:

/snip/ Your client does not have permission to get URL /search?q=define%3Aesoteric from this server /snip/


And here's me not understanding why my regexes are not matching. :) Apparently, you should at least show some kind of User Agent. In Python you can do this easily by subclassing URLopener and setting it's version property to something, like:
class MyOpener(urllib.URLopener):
version = "InternetExploiter/666"
urllib._urlopener = MyOpener()


And I'm really starting to love TextMate. Just type your little Python script and hit Cmd-R. Brilliant.
|