Using cURL to Read the Contents of a Web Page

Recently I wrote about how to use the Yahoo! weather api with WordPress and in the comments I was asked how to use it without relying on WordPress. The answer – is cURL.

According to Wikipedia the name cURL comes from “Client for URLs” and it is essentially a command line interface for a web client. This means that you can access web content through a script on your site. This is most often used by websites when they access web apis such as Twitter, Flickr, or as in this case, the Yahoo! weather api.

Note: There’s actually loads of different commands and settings for cURL but we are only interested in a few. If you want to check them all out then you can view the docs on php.net.

Below is the code we will be using:

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $file);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
$result = curl_exec($ch);
curl_close($ch);

Broken down we have:

  • $ch = curl_init(); intiate the curl object
  • curl_setopt($ch, CURLOPT_URL, $file); specify the file or url to load
  • curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); tell it to return the raw content
  • curl_setopt($ch, CURLOPT_REFERER, $_SERVER[‘REQUEST_URI’]); Simulate the http referer
  • $result = curl_exec($ch); perform the cURL request
  • curl_close($ch); close the connection

A bit of rejigging from the original WordPress Yahoo! post and you end up with:

<?php
function bm_getWeather ($code = '', $temp = 'c') {
	$file = 'http://weather.yahooapis.com/forecastrss?p=' . $code . '&u=' . $temp;

	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $file);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
	curl_setopt($ch, CURLOPT_REFERER, $_SERVER['REQUEST_URI']);
	$result = curl_exec($ch);
	curl_close($ch);
	
	$output = array (
		'temperature' => bm_getWeatherProperties('temp', $result),
		'weather' => bm_getWeatherProperties('text', $result),
		'weather_code' => bm_getWeatherProperties('code', $result),
		'class' => 'weatherIcon-' . bm_getWeatherProperties('code', $result),
	);

	return $output;
}

function bm_getWeatherProperties ($needle, $data) {
	$regex = '<yweather:condition.*' . $needle . '="(.*?)".*/>';
	preg_match($regex, $data, $matches);
	return $matches[1];
}

Posted in: Web Design

14 thoughts on “Using cURL to Read the Contents of a Web Page Leave a comment

    1. I use file_get_contents too – but It’s the hosts disabling it bit that makes cURL so much more flexible. Also cURL has a whole stack of different options that allows you to grab just the parts you need, or fake different browsers etc.

    1. I’ve heard about this and had a play around with their demo’s but never used it for anything complex. It’s quite a cool idea, and definitely helpful for those sites that don’t have an official API. What sort of situations have you needed to use this in? It’d be interesting to see some real world examples.

    2. I generally use SimpleXML. It’s included with PHP5, so it’s fairly standard across servers.

      It turns an XML document into an object/array data structure. If you need to parse an HTML page, you can pass it through Tidy first.

  1. I scrape stock market data.

    I have used curl, but curl is a bit complicated. Also, it doesn’t work that well when a web site requires an extended dialog with the server, such as login and password and cookies, especially ASP.NET sites.

    I have also used biterscripting IS (Internet Sessions) . It also provides a command line interface for a web client. ( http://www.biterscripting.com/helppages_automatedinternet.html ). There are only a few commands to learn and they work really, really well when it comes to conducting an extended dialog with a web server, including logging in, form filling, exchanging cookies and setting the standard ASP.NET variables _VIEWSTATE, etc, …

    For web servers that don’t require a login, simple commands work.

    var string page
    cat “http://www.something.com/path/to/some/page.ext” > $page

    That gets the source for the page into a string variable without doing anything special.

    script SS_WebPageToCSV.txt page(“http://www.something.com/path/to/some/page.ext”)

    That extracts a table from a web page and puts it in CSV format.

    You need the IS (internet session) only when the web server requires that client establish explicit sessions. Other web pages are available with the simple cat/repro command.

  2. $document = new DOMDocument();
    @$document->loadHTML($html);
    $title = $document->getElementsByTagName(‘title’)->item(0)->nodeValue;

    Set $html, and you will have the title 🙂

  3. Hi,
    i also use html dom parser library but its not work each and every web site content grabbing curl is best.
    because i parse walmart.com for my demo application first i write html dom to grab info but shipping and other stuff not get..

    how to secure web api develop fast grab data .

  4. Just to note your regex needs updating (I copied and pasted yours and didn’t get any valid rss), after checking over I realised your regex had a space in after ‘yweather’, removing it make the example work for me:

    $regex = ”;

  5. I am searching for a solution for my wordpress website which comes with the visual composer plugin. Unfortunately this plugin uses file_get_contents() which is not supported by my webhosting company for security reasons however they do support cURL

    The support for this plugin s very poor and I am waiting for their answer for a week now. Without this plugin my WP theme can go to bin as it is responsible for creating the layout for my website.

    Is there anyone who could help me or at least point me towards what should have been done?

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *