How to write a scraper in PHP and CURL to download a webpage facebook scraper, ecommerce scraper, twitter scraper

14 min read

writing our first scraper in PHP and CURL to download a webpage:

// Defining the basic cURL function function curl($url) { $ch = curl_init(); // Initialising cURL curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function } 

This function is then used as such:

$scraped_website = curl("http://www.example.com");

Ok, so now we know how to scrape the contents of a webpage. That’s the end of Part 1.

I know, it might not seem like a great deal has been accomplished here, but, this basic function forms the basis of what we will be building on over the coming posts.

 

THE SCRAPING FUNCTION

In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.

function scrape_between($data, $start, $end){ $data = stristr($data, $start); // Stripping all data from before $start $data = substr($data, strlen($start)); // Stripping $start $stop = stripos($data, $end); // Getting the position of the $end of the data to scrape $data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape return $data; // Returning the scraped data from the function }

 

The comments in the function should explain it pretty clearly, but just to clarify further:

1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).

2. stristr() is used to strip all data from before the $start position.

3. substr() is used to strip the $start from the beginning of the data. The $datavariable now holds the data we want scraped, along with the trailing data from the input string.

4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.

5. The data we wanted scraped, in $data, is returned from the function.

In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.

MODIFYING THE CURL FUNCTION

Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function from Part 1.

function curl($url) { // Assigning cURL options to an array $options = Array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function }

 

If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.

The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().

Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…

The extra cURL settings that have been added are CURLOPT_RETURNTRANSFERCURLOPT_FOLLOWLOCATIONCURLOPT_AUTOREFERERCURLOPT_CONNECTTIMEOUTCURLOPT_TIMEOUTCURLOPT_MAXREDIRSCURLOPT_USERAGENT. They are explained in the comments of the function above.

PUTTING IT ALL TOGETHER

We place both of those functions in our PHP script and we can use them like so:

$scraped_page = curl("http://www.imdb.com"); // Downloading IMDB home page to variable $scraped_page $scraped_data = scrape_between($scraped_page, "", ""); // Scraping downloaded dara in $scraped_page for content betweenandtags echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"

As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.

SCRAPING MULTIPLE DATA POINTS FROM A WEB PAGE

Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.

So, we’ll expand on this a bit and scrape multiple data points from a web page.

For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.

First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:

http://www.imdb.com/search/title?title=goodfellas

Shown in green is the keyword being searched for.

Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:

http://www.imdb.com/search/title?genres=action

Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.

Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.

Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.

$url = "http://www.imdb.com/search/title?genres=action"; // Assigning the URL we want to scrape to the variable $url $results_page = curl($url); // Downloading the results page using our curl() funtion $results_page = scrape_between($results_page, "

", "
"); // Scraping out only the middle section of the results page that contains our results $separate_results = explode("", $results_page); // Expploding the results into separate parts into an array // For each separate result, scrape the URL foreach ($separate_results as $separate_result) { if ($separate_result != "") { $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array } } print_r($results_urls); // Printing out our array of URLs we've just scraped

 

Now with an explanation of what’s happening here, if it’s not already clear.

1. Assigning the search results page URL we want to scrape the the $url variable.

2. Downloading the results page using our curl() function.

3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…

4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.

5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urlsarray. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.

6. Right at the end, we printout our array of URLs, just to check that the script worked properly.

That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs. So, up next time we’re going to cover traversing the pages of a website to scrape data from multiple pages and organise the data in a logical structure.

If we take our script from last time and include our scrape_between() and curl()functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.

$continue = TRUE; // Assigning a boolean value of TRUE to the $continue variable $url = "http://www.imdb.com/search/title?genres=action"; // Assigning the URL we want to scrape to the variable $url // While $continue is TRUE, i.e. there are more search results pages while ($continue == TRUE) { $results_page = curl($url); // Downloading the results page using our curl() funtion $results_page = scrape_between($results_page, "

", "
"); // Scraping out only the middle section of the results page that contains our results $separate_results = explode("", $results_page); // Exploding the results into separate parts into an array // For each separate result, scrape the URL foreach ($separate_results as $separate_result) { if ($separate_result != "") { $results_urls[] = "http://www.imdb.com" . scrape_between($separate_result, "href=\"", "\" title="); // Scraping the page ID number and appending to the IMDb URL - Adding this URL to our URL array } } // Searching for a 'Next' link. If it exists scrape the url and set it as $url for the next loop of the scraper if (strpos($results_page, "Next »")) { $continue = TRUE; $url = scrape_between($results_page, "", ""); if (strpos($url, "Prev")) { $url = scrape_between($url, "Prev", ">Next"); } $url = "http://www.imdb.com" . scrape_between($url, "href=\"", "\""); } else { $continue = FALSE; // Setting $continue to FALSE if there's no 'Next' link } sleep(rand(3,5)); // Sleep for 3 to 5 seconds. Useful if not using proxies. We don't want to get into trouble. }

First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.

Now we have an array with all of the results URLs, for which we can do a foreach()over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.

I’ll get you started:

1
2
3
4
5
foreach($results_urls as $result_url) {
    // Visit $result_url (Reference Part 1)
    // Scrape data from page (Reference Part 1)
    // Add to array or other suitable data structure (Reference Part 2)
}

In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.

 

Recently I came across something that is either becoming more common, or I am more frequently encountering due to the jobs I take, and that is JavaScript client-side encryption of passwords before submission of a login form. Now since this usually uses JavaScript it can seem a bit tricky the first time you encounter it if you’re not used to working with JS.

This kind of thing is usually easily identifiable when you see the submit button of a form calling another function, like so:

1
<input type="submit" onclick="validerLogin();"/>

After identifying the .js file containing this sites JavaScript, the function is found to be:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
function validerLogin() {
    var doc=document.loginFrm;
    if (doc.usager.value.length==0 || doc.usager.value =='Utilisateur'){
        alert("Vous devez entrer votre nom d'usager.");
        doc.usager.focus();
        return false;
    } else if (doc.motdepasse.value.length == 0
            || doc.motdepasse.value == 'Mot de passe') {
        alert("Vous devez entrer votre mot de passe.");
        $('#motdepasseLbl').focus();
        return false;
    } else {
        md5Pass = $.md5(doc.motdepasse.value);
        doc.pwd.value = $.md5(md5Pass皇.salt.value);
        doc.motdepasse.value = '';
        doc.submit();
    }
}

The website, and it’s source code, being entirely in French, isn’t exactly helpful, but from a cursory glance it’s clear that, after ensuring a password has been entered, what is happening is:

  1. The value from the login forms field named motdepasse is being encrypted using an MD5 (which appears to be the jQuery MD5 Plugin by Sebastian Tschan) and is stored in the variable md5Pass.

     

    1
    md5Pass = $.md5(doc.motdepasse.value);

     

  2. The encrypted password stored in the variable md5Pass is being put back through the MD5 encryption function, this time using a salt (more on this in a moment), and being injected into the value attribute of the hidden pwd input field on the sites login form.

     

    2
    doc.pwd.value = $.md5(md5Pass皇.salt.value);

     

  3. The value of motdepasse on the login form is being cleared.

     

    3
    doc.motdepasse.value = '';

     

  4. The login form is being submitted.

     

    4
    doc.submit();

     

SOLVING THE PROBLEM OF CLIENT-SIDE ENCRYPTION WITH PHP & CURL

Now, in our web bot we need to recreate all of these steps before submitting the form using cURL. So, step by step, as above:

  1. First up, we need to encrypt our password using MD5. Luckily for us, PHP has an md5() function.

     

    1
    $md5_pass = md5($password); // md5ing the password

     

  2. Next up, we need to encrypt our encrypted password again, using a salt.

     

    It turns out that the salt being used to encrypt the password for the second time is actually a unique 5 digit number that is generated and stored in the value of a hidden form input field named salt each time the login page is refreshed.

    1
    <input type="hidden" name="salt" value="46545" />

    So, before we submit the form we need to retrieve the login page and scrape the salt string.

    1
    2
    3
    $salt_page = curlRequest($login_url);   // Using cURL to request the login page with the salt string on it.
    $salt = scrapeBetween($salt_page, 'salt" value="', '" />');  // Scraping the salt string.
    // You can find the curlRequest() and scrapeBetween() functions in my tutorial series on Web Scraping with PHP & cURL.

    With this salt string, we can then proceed to re-encrypt our password.

    1
    2
    $md5_pass = md5($password); // md5ing the password.
    $salt_pass = md5($md5_pass.$salt);  // Concatenating and encrypting the md5'd pass and salt.

     

  3. On submission of the form using cURL we need to ensure that we pass an empty string to the motdepass field and our MD5/salted password to the pwd field, using an array like:

     

    1
    2
    3
    $md5_pass = md5($password); // md5ing the password.
    $salt_pass = md5($md5_pass.$salt);  // Concatenating and encrypting the md5'd pass and salt.
    $post_array = array ('usager' => $user, 'motdepass' => '', 'salt' => $salt, 'pwd' => $salt_pass);   // Building post array.

     

  4. Then it’s as simple as making the cURL POST request!

     

    1
    2
    3
    4
    $md5_pass = md5($password); // md5ing the password
    $salt_pass = md5($md5_pass.$salt);  // Concatenating and encrypting the md5'd pass and salt.
    $post_array = array ('usager' => $user, 'motdepass' => '', 'salt' => $salt, 'pwd' => $salt_pass);   // Building post array.
    $site_login = curlPost($login_url, $post_array);    // Logging in using curlPost function. You can find this in my web scraping tutorial series, though it is just a standard POST form submission using cURL.

    curl web scraping tutorial

     

Total Execution Time content: 0.00042536656061808 Mins Total Execution Time social : 0.0001 Mins

Read Next

Total Execution Time red next: 0.0000 Mins

Search Now






Categories