Parsing websites on a budget

Thursday, Jul 23, 2015
Categories: Developer,
Tags: javascript, web app, web development, web scraping, YQL,

Say you are an up and coming web developer. You want to make a web app that can access content on other sites. Perhaps you want to make a word cloud from a news article on BBC. Or you want to see what videos a site has embedded within itself. You could achieve that by doing some server side voodoo. The ‘standard’ way to go about it would be to download the site to your server and then serve its contents to your webpage.  But that would require two things:

  1. Having enough bandwidth
  2. Knowing a server side language

Say, hypothetically, that you fulfil none of these requirements. What now? Well, you can try to bypass the server and extract the website’s contents directly to your page. But that won’t work. All thanks to the Same Origin Policy. So most, if not all, attempts you make to read content from another domain will be denied by your browser (_read more: why is this important?_).  Another approach would be to use iframe elements in your page to load foreign content. But that technique can be used maliciously (if, for example, someone superimposes a hidden PayPal pay button in an iframe on top of another visible button). So many sites have scripts in place to detect if they are being displayed in an iframe and measures to prevent that.

It seems hopeless! But despair not, you do what I did: harness the almighty power of Yahoo! Query Language (YQL). YQL is like SQL – but with a ‘Y’. And where SQL queries tables for information, YQL queries web pages (among other things) for information. So, for example to extract hyperlinks, the query would be:

select * from html where url='' and xpath='//a'

And you can put that query in an XMLHttpRequest (XHR) using the endpoint specified in the YQL documentation to get your results. So for the above query the XHR URL would be:*%20from%20html%20where%20url%3D''%20and%20xpath%3D'%2F%2Fa'&format=json&

The YQL page will give you the URL for any request you make. And then you can use that as a template in your web app. Yay! You can see an implementation of this in a project of mine: WebWeb (<a href="" target="_blank">source code</a>). Happy coding!