Web Scraping with PHP & CURL

11 min read

We’ll start off simple, requesting and downloading a webpage, downloading images, then gradually move onto some more advanced topics, such as submitting forms (registration, login, etc…) and possibly even cracking captchas. In the end, we’ll roll everything we’ve learnt into one PHP class that can be used for quickly and easily building scrapers for almost any site.

So, first off, writing our first scraper in PHP and CURL to download a webpage:

<?php
    // Defining the basic cURL function
    function curl($url) {
        $ch = curl_init();  // Initialising cURL
        curl_setopt($ch, CURLOPT_URL, $url);    // Setting cURL's URL option with the $url variable passed into the function
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL
        return $data;   // Returning the data from the function
    }
?>

This function is then used as such:

 

<?php
    $scraped_website = curl("http://www.example.com");  // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable
?>

THE SCRAPING FUNCTION

In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.

<?php
    // Defining the basic scraping function
    function scrape_between($data, $start, $end){
        $data = stristr($data, $start); // Stripping all data from before $start
        $data = substr($data, strlen($start));  // Stripping $start
        $stop = stripos($data, $end);   // Getting the position of the $end of the data to scrape
        $data = substr($data, 0, $stop);    // Stripping all data from after and including the $end of the data to scrape
        return $data;   // Returning the scraped data from the function
    }
?>

The comments in the function should explain it pretty clearly, but just to clarify further:

1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).

2. stristr() is used to strip all data from before the $start position.

3. substr() is used to strip the $start from the beginning of the data. The $data variable now holds the data we want scraped, along with the trailing data from the input string.

4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.

5. The data we wanted scraped, in $data, is returned from the function.

In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.

MODIFYING THE CURL FUNCTION

Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function

<?php    
    // Defining the basic cURL function
    function curl($url) {
        // Assigning cURL options to an array
        $options = Array(
            CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
            CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
            CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
            CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
            CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
            CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
            CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
        );

        $ch = curl_init();  // Initialising cURL 
        curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL 
        return $data;   // Returning the data from the function 
    }
?>

If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.

The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().

Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…

The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.

PUTTING IT ALL TOGETHER

We place both of those functions in our PHP script and we can use them like so:

<?php
    $scraped_page = curl("http://www.imdb.com");    // Downloading IMDB home page to variable $scraped_page
    $scraped_data = scrape_between($scraped_page, "<title>", "</title>");   // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags

    echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
?>

As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.

SCRAPING MULTIPLE DATA POINTS FROM A WEB PAGE

Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.

So, we’ll expand on this a bit and scrape multiple data points from a web page.

For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.

First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:

http://www.imdb.com/search/title?title=goodfellas

Shown in green is the keyword being searched for.

Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:

http://www.imdb.com/search/title?genres=action

Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.

Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.

Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.

<?php
    $url = "http://www.imdb.com/search/title?genres=action";    // Assigning the URL we want to scrape to the variable $url
    $results_page = curl($url); // Downloading the results page using our curl() funtion

    $results_page = scrape_between($results_page, "<div id="main">", "<div id="sidebar">"); // Scraping out only the middle section of the results page that contains our results

    $separate_results = explode("<td class="image">", $results_page);   // Expploding the results into separate parts into an array

    // For each separate result, scrape the URL
    foreach ($separate_results as $separate_result) {
        if ($separate_result != "") {
            $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href="", "" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
        }
    }

    print_r($results_urls); // Printing out our array of URLs we've just scraped
?>

Now with an explanation of what’s happening here, if it’s not already clear.

1. Assigning the search results page URL we want to scrape the the $url variable.

2. Downloading the results page using our curl() function.

3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…

4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.

5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urls array. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.

6. Right at the end, we printout our array of URLs, just to check that the script worked properly.

That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs.

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.

But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?

Well, that’s what we’ll be doing next. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.

So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.

If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script.

<?php

    $continue = TRUE;   // Assigning a boolean value of TRUE to the $continue variable

    $url = "http://www.imdb.com/search/title?genres=action";    // Assigning the URL we want to scrape to the variable $url

    // While $continue is TRUE, i.e. there are more search results pages
    while ($continue == TRUE) {

        $results_page = curl($url); // Downloading the results page using our curl() funtion

        $results_page = scrape_between($results_page, "<div id="main">", "<div id="sidebar">"); // Scraping out only the middle section of the results page that contains our results

        $separate_results = explode("<td class="image">", $results_page);   // Exploding the results into separate parts into an array

        // For each separate result, scrape the URL
        foreach ($separate_results as $separate_result) {
            if ($separate_result != "") {
                $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href="", "" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array
            }
        }

        // Searching for a 'Next' link. If it exists scrape the url and set it as $url for the next loop of the scraper
        if (strpos($results_page, "Next&nbsp;&raquo;")) {
            $continue = TRUE;
            $url = scrape_between($results_page, "<span class="pagination">", "</span>");
            if (strpos($url, "Prev</a>")) {
                $url = scrape_between($url, "Prev</a>", ">Next");
            }
            $url = "http://www.imdb.com" . scrape_between($url, "href="", """);
        } else {
            $continue = FALSE;  // Setting $continue to FALSE if there's no 'Next' link
        }
        sleep(rand(3,5));   // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.
    }
?>

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

foreach($results_urls as $result_url) {
    // Visit $result_url 
    // Scrape data from page 
    // Add to array or other suitable data structure
}

 

ISSUES:

The script is taking longer than the max_execution_time you have set in your php.ini

You can edit this directly by editing this line in your php.ini file:

ini_set('max_execution_time', 600); // 600 seconds = 10 minutes

Alternatively, you can use PHP’s set_time_limit() function in your script. Personally, I would do this.

So, for the code above, edit it like so:

foreach($results_urls as $result_url) {
 
    set_time_limit(60);    // Setting execution time to 1 minute for each iteration of the loop
 
    $listings_page = curl($result_url);    // Retrieving listings page
      
    $listing_titles[] = scrape_between($listings_page, "<span class="itemprop" itemprop="name">", "</span>");    // Scraping the listing title and adding to array
     
    sleep(rand(3,5));   // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.
  
}
 
print_r($listing_titles);    // Printing out the array of titles on screen

 

Total Execution Time content: 0.00027635097503662 Mins Total Execution Time social : 0.0001 Mins

Read Next

Total Execution Time red next: 0.0000 Mins

Search Now






Categories