Web Scraping with PHP & CURL

1 min read

We’ll start off simple, requesting and downloading a webpage, downloading images, then gradually move onto some more advanced topics, such as submitting forms (registration, login, etc…) and possibly even cracking captchas. In the end, we’ll roll everything we’ve learnt into one PHP class that can be used for quickly and easily building scrapers for almost any site.

So, first off, writing our first scraper in PHP and CURL to download a webpage:

This function is then used as such:

 

THE SCRAPING FUNCTION

In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.

The comments in the function should explain it pretty clearly, but just to clarify further:

1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).

2. stristr() is used to strip all data from before the $start position.

3. substr() is used to strip the $start from the beginning of the data. The $data variable now holds the data we want scraped, along with the trailing data from the input string.

4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.

5. The data we wanted scraped, in $data, is returned from the function.

In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.

MODIFYING THE CURL FUNCTION

Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function

 TRUE,  // Setting cURL's option to return the webpage data
            CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
            CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
            CURLOPT_CONNECTTIMEOUT => 120,   // Setting the amount of time (in seconds) before the request times out
            CURLOPT_TIMEOUT => 120,  // Setting the maximum amount of time for cURL to execute queries
            CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
            CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8",  // Setting the useragent
            CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
        );

        $ch = curl_init();  // Initialising cURL 
        curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options
        $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
        curl_close($ch);    // Closing cURL 
        return $data;   // Returning the data from the function 
    }
?>

If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.

The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().

Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…

The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.

PUTTING IT ALL TOGETHER

We place both of those functions in our PHP script and we can use them like so:

", "");   // Scraping downloaded dara in $scraped_page for content between  and  tags

    echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
?>

As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.

SCRAPING MULTIPLE DATA POINTS FROM A WEB PAGE

Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.

So, we’ll expand on this a bit and scrape multiple data points from a web page.

For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.

First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:

http://www.imdb.com/search/title?title=goodfellas

Shown in green is the keyword being searched for.

Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:

http://www.imdb.com/search/title?genres=action

Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.

Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.

Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.

", "

Now with an explanation of what’s happening here, if it’s not already clear.

1. Assigning the search results page URL we want to scrape the the $url variable.

2. Downloading the results page using our curl() function.

3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…

4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.

5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urls array. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.

6. Right at the end, we printout our array of URLs, just to check that the script worked properly.

That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs.

If we take our scraper script so far, we can perform a basic search on IMDb and scrape the single page of results that is returned for the movies’ URLs.

But what if we want to scrape all of the results pages? What if we then want to scrape all of the results for their specific attributes, such as movie name, release date, description, director and so on…?

Well, that’s what we’ll be doing next. Using PHP and cURL to navigate the results pages and scrape multiple pages of the website for data and organise that data into a logical structure for further use.

So, our first task is to get the URLs from all of the results pages. This involves evaluating whether there is another page of results and, if there is, visiting it, scraping the results URLs and adding them to our array.

If we take our script from last time and include our scrape_between() and curl() functions, we need to make the following changes to the script.

", "

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach() over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

foreach($results_urls as $result_url) {
    // Visit $result_url 
    // Scrape data from page 
    // Add to array or other suitable data structure
}

 

ISSUES:

The script is taking longer than the max_execution_time you have set in your php.ini

You can edit this directly by editing this line in your php.ini file:

ini_set('max_execution_time', 600); // 600 seconds = 10 minutes

Alternatively, you can use PHP’s set_time_limit() function in your script. Personally, I would do this.

So, for the code above, edit it like so:

foreach($results_urls as $result_url) {
 
    set_time_limit(60);    // Setting execution time to 1 minute for each iteration of the loop
 
    $listings_page = curl($result_url);    // Retrieving listings page
      
    $listing_titles[] = scrape_between($listings_page, "", "");    // Scraping the listing title and adding to array
     
    sleep(rand(3,5));   // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble.
  
}
 
print_r($listing_titles);    // Printing out the array of titles on screen

 

Comments (1)

  • at
  • 25/Apr/21 08:34am

School sponsorship

Reply
Type the above code here

Leave a Comment

Type the above code here

Read Next

Best high Schools in kenya

1 min read

Mugumo girls sec, kirinyagamasii girls machakosst bakhita gataragwaarnsens' high schoolfriends school bokolihuruma girlsmaua girls sec school merukoelel boyskayole dayngaru girls sec schoolkenya highp...

Continue reading

Century Park College Machakos kuccps 2018 Intake: 2018/2019 September KUCCPS Admission letters,  Courses offered, Fees structure and Contacts

1 min read

  century park college machakos 2017/2018 september kuccps admission letter   instructions to download kuccps admission letter online download and read the st...

Continue reading

Official KNEC KCSE 2017 results, Checking sms code,or Through Knec Portal

1 min read

Knec kcse 2017 results checking sms code for (safaricom, and airtel) and online portal update: the kenya national examination council (knec) has indicated that candidates who has sat for their form fo...

Continue reading

Process Of Buying Land In Kenya Pdf

36 min read

Listed below are forms for all the land registration processes.in all the steps, one is advised to engage a lawyer.where an applicant needs to register a change and no form exists to cover the matter,...

Continue reading

Official Eldoret college of professional studies Limited Contacts, Courses ,intakes ,fee structures and Location 2018

1 min read

Official eldoret college of professional studies limited contacts p.o box: 4190 - 3000200 kitalecourses offered at eldoret college of professional studies limited1. certificate in social work and c...

Continue reading

computer science jobs and salaries in kenya 2021

3 min read

Salaries for computer science graduates vary depending on institutions you work for and years of experience. but the average salary for fresh graduates in kenya today is ksh 70,000there are several im...

Continue reading