Recently I wrote about how to use the Yahoo! weather api with WordPress and in the comments I was asked how to use it without relying on WordPress. The answer - is cURL.
According to Wikipedia the name cURL comes from "Client for URLs" and it is essentially a command line interface for a web client. This means that you can access web content through a script on your site. This is most often used by websites when they access web apis such as Twitter, Flickr, or as in this case, the Yahoo! weather api.
Note: There's actually loads of different commands and settings for cURL but we are only interested in a few. If you want to check them all out then you can view the docs on php.net.
Below is the code we will be using:
$ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $file); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']); $result = curl_exec($ch); curl_close($ch);
Broken down we have:
- $ch = curl_init(); intiate the curl object
- curl_setopt($ch, CURLOPT_URL, $file); specify the file or url to load
- curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); tell it to return the raw content
- curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']); Simulate the http referer
- $result = curl_exec($ch); perform the cURL request
- curl_close($ch); close the connection
A bit of rejigging from the original WordPress Yahoo! post and you end up with:
<?php
function bm_getWeather ($code = '', $temp = 'c') {
$file = 'http://weather.yahooapis.com/forecastrss?p=' . $code . '&u=' . $temp;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $file);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
$result = curl_exec($ch);
curl_close($ch);
$output = array (
'temperature' => bm_getWeatherProperties('temp', $result),
'weather' => bm_getWeatherProperties('text', $result),
'weather_code' => bm_getWeatherProperties('code', $result),
'class' => 'weatherIcon-' . bm_getWeatherProperties('code', $result),
);
return $output;
}
function bm_getWeatherProperties ($needle, $data) {
$regex = '<yweather :condition.*' . $needle . '="(.*?)".*/>';
preg_match($regex, $data, $matches);
return $matches[1];
}
8 Responses to “Using cURL to Read the Contents of a Web Page” Leave a reply ›
Or you can use file_get_contents() for basic GET requests. http://php.net/manual/en/function.file-get-contents.php
It's easier, and often faster than cURL, though some hosts disable it.
I use file_get_contents too - but It's the hosts disabling it bit that makes cURL so much more flexible. Also cURL has a whole stack of different options that allows you to grab just the parts you need, or fake different browsers etc.
Or send POST requests. It's very useful for dealing with REST APIs.
Read bit on curl in Smashing Magazine , This is cool as well.
nb : Love the website
Another great tool is the PHP Simple HTML DOM Parser. It lets you select portions of the page your scraping using jQuery-style selectors.
This library has helped me countless numbers of times.
I've heard about this and had a play around with their demo's but never used it for anything complex. It's quite a cool idea, and definitely helpful for those sites that don't have an official API. What sort of situations have you needed to use this in? It'd be interesting to see some real world examples.
I generally use SimpleXML. It's included with PHP5, so it's fairly standard across servers.
It turns an XML document into an object/array data structure. If you need to parse an HTML page, you can pass it through Tidy first.
I scrape stock market data.
I have used curl, but curl is a bit complicated. Also, it doesn't work that well when a web site requires an extended dialog with the server, such as login and password and cookies, especially ASP.NET sites.
I have also used biterscripting IS (Internet Sessions) . It also provides a command line interface for a web client. ( http://www.biterscripting.com/..._automatedinternet.html ). There are only a few commands to learn and they work really, really well when it comes to conducting an extended dialog with a web server, including logging in, form filling, exchanging cookies and setting the standard ASP.NET variables _VIEWSTATE, etc, ...
For web servers that don't require a login, simple commands work.
var string page
cat "http://www.something.com/path/to/some/page.ext" > $page
That gets the source for the page into a string variable without doing anything special.
script SS_WebPageToCSV.txt page("http://www.something.com/path/to/some/page.ext")
That extracts a table from a web page and puts it in CSV format.
You need the IS (internet session) only when the web server requires that client establish explicit sessions. Other web pages are available with the simple cat/repro command.