We’ll start off simple, requesting and downloading a webpage, downloading images, then gradually move onto some more advanced topics, such as submitting forms (registration, login, etc…) and possibly even cracking captchas. In the end, we’ll roll everything we’ve learnt into one PHP class that can be used for quickly and easily building scrapers for almost any site.
So, first off, writing our first scraper in PHP and CURL to download a webpage:
This function is then used as such:
THE SCRAPING FUNCTION
In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.
The comments in the function should explain it pretty clearly, but just to clarify further:
1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).
2. stristr() is used to strip all data from before the $start position.
3. substr() is used to strip the $start from the beginning of the data. The $data variable now holds the data we want scraped, along with the trailing data from the input string.
4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.
5. The data we wanted scraped, in $data, is returned from the function.
In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.
MODIFYING THE CURL FUNCTION
Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function
TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } ?>
If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.
The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().
Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…
The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.
PUTTING IT ALL TOGETHER
We place both of those functions in our PHP script and we can use them like so:
", ""); // Scraping downloaded dara in $scraped_page for content betweenand tags echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)" ?>
As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.
SCRAPING MULTIPLE DATA POINTS FROM A WEB PAGE
Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.
So, we’ll expand on this a bit and scrape multiple data points from a web page.
For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.
First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:
http://www.imdb.com/search/title?title=goodfellas
Shown in green is the keyword being searched for.
Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:
http://www.imdb.com/search/title?genres=action
Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.
Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.
Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.
", "
Comments (1)
School sponsorship
ReplyLeave a Comment