writing our first scraper in PHP and CURL to download a webpage:
// Defining the basic cURL function function curl($url) { $ch = curl_init(); // Initialising cURL curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function }
This function is then used as such:
$scraped_website = curl("http://www.example.com");
Ok, so now we know how to scrape the contents of a webpage. That’s the end of Part 1.
I know, it might not seem like a great deal has been accomplished here, but, this basic function forms the basis of what we will be building on over the coming posts.
In order to extract the required data from the complete page we’ve downloaded, we need to create a small function that will scrape data from between two strings, such as tags.
function scrape_between($data, $start, $end){ $data = stristr($data, $start); // Stripping all data from before $start $data = substr($data, strlen($start)); // Stripping $start $stop = stripos($data, $end); // Getting the position of the $end of the data to scrape $data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape return $data; // Returning the scraped data from the function }
The comments in the function should explain it pretty clearly, but just to clarify further:
1. We define the scraping function as scrape_between(), which takes the parameters $data (string, the source you want to scrape from), $start (string, at which point you wish to scrape from), $end (string, at which point you wish to finish scraping).
2. stristr() is used to strip all data from before the $start position.
3. substr() is used to strip the $start from the beginning of the data. The $datavariable now holds the data we want scraped, along with the trailing data from the input string.
4. strpos() is used to get the position of the $end of the data we want scraped then substr() is used to leave us with just what we wanted scraped in the $data variable.
5. The data we wanted scraped, in $data, is returned from the function.
In a later part we’re going to look at using Regular Expressions (Regex) for finding strings to scrape that match a certain structure. But, for now, this small function is more than enough.
Gradually, as this series progresses, I’m going to introduce more and more of cURL’s options and features. Here, we’ve made a few small modifications to our function from Part 1.
function curl($url) { // Assigning cURL options to an array $options = Array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_CONNECTTIMEOUT => 120, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 120, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1a2pre) Gecko/2008073000 Shredder/3.0a2pre ThunderBrowse/3.2.1.8", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable curl_close($ch); // Closing cURL return $data; // Returning the data from the function }
If you look at the function above, it may seem rather different to the one we created in Part 1, however, it’s essentially the same, just with a few minor tweaks.
The first thing to note, is that, rather than setting the options up one-by-one using curl_setopt(), we’ve created an array called $options to store them all. The array key stores the name of the cURL option and the array value stores it’s setting. This array is then passed to cURL using curl_setopt_array().
Aside from that and the extra settings introduced this function is exactly the same. So not really the same, but, yeah…
The extra cURL settings that have been added are CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, CURLOPT_AUTOREFERER, CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT, CURLOPT_MAXREDIRS, CURLOPT_USERAGENT. They are explained in the comments of the function above.
We place both of those functions in our PHP script and we can use them like so:
$scraped_page = curl("http://www.imdb.com"); // Downloading IMDB home page to variable $scraped_page $scraped_data = scrape_between($scraped_page, "", ""); // Scraping downloaded dara in $scraped_page for content betweenandtags echo $scraped_data; // Echoing $scraped data, should show "The Internet Movie Database (IMDb)"
As you can see. This small scraper visits the IMDb website, downloads the page and scrapes the page title from between the ‘title’ tags, then echos the result.
Visiting a web page and scraping one piece of data is hardly impressive, let alone worth building a script for. I mean, you could just as easily open your web browser and copy/paste it for yourself.
So, we’ll expand on this a bit and scrape multiple data points from a web page.
For this we’ll still be using IMDb as our target site, however, we’re going to try scraping the search results page for a list of URLs returned for a specific search query.
First up, we need to find out how the search form works. Lucky for us the search query is shown in the URL on the search results page:
http://www.imdb.com/search/title?title=goodfellas
Shown in green is the keyword being searched for.
Shown in blue is the attribute being searched within. For our purposes, searching for the name of a film here is pretty pointless, it’s only going to return a single value that we’d actually want. So, instead, let’s try searching by genre:
http://www.imdb.com/search/title?genres=action
Note: These different attributes can be found by going to the http://www.imdb.com/search/title page and performing a search, then looking at the URL of the search results page.
Now we have this, we can feed the URL (http://www.imdb.com/search/title?genres=action) into our script and we have a page with a list of results we want to scrape returned.
Now we need to break up this page into separate sections for each result, then iterate over the sections and scrape the URL of each result.
$url = "http://www.imdb.com/search/title?genres=action"; // Assigning the URL we want to scrape to the variable $url $results_page = curl($url); // Downloading the results page using our curl() funtion $results_page = scrape_between($results_page, "
Now with an explanation of what’s happening here, if it’s not already clear.
1. Assigning the search results page URL we want to scrape the the $url variable.
2. Downloading the results page using our curl() function.
3. Here, on line 5, we are scraping out just the section of results we need. Stripping away the header, sidebar, etc…
4. We need to identify each search result by a common string that can be used to explode the results. This string, that every result has, is the td class=”image”. We use this to explode the results into the array $separate_results.
5. For each separate result, if it’s not empty, we scrape the URL data from between the start point of href=” and end point of ” title= and add it to our $results_urlsarray. But, in the process, because IMDb uses relative path URLs instead of full path, we need to prepend http://www.imdb.com to our result to give us a full URL which can later be used.
6. Right at the end, we printout our array of URLs, just to check that the script worked properly.
That’s pretty cool, no? Well, it would be, but so far all we have is a short list of URLs. So, up next time we’re going to cover traversing the pages of a website to scrape data from multiple pages and organise the data in a logical structure.
If we take our script from last time and include our scrape_between() and curl()functions, we need to make the following changes to the script. Don’t worry, I’ll talk the through after.
$continue = TRUE; // Assigning a boolean value of TRUE to the $continue variable $url = "http://www.imdb.com/search/title?genres=action"; // Assigning the URL we want to scrape to the variable $url // While $continue is TRUE, i.e. there are more search results pages while ($continue == TRUE) { $results_page = curl($url); // Downloading the results page using our curl() funtion $results_page = scrape_between($results_page, "
First up we retrieve the initial results page. Then we scrape all of the results and add them to the array $results_urls. Then we check to see if there is a “Next” link to another page of results, if there is then we scrape that and loop through the script to repeat the scraping of results from the next page. The loop iterates and continues to visit the next page, scraping the results, until there are no more pages of results.
Now we have an array with all of the results URLs, for which we can do a foreach()over to visit each URL and scrape the results. I’ll leave that to you, with what we’ve covered so far it should be easy to figure out.
I’ll get you started:
1
2
3
4
5
|
foreach ( $results_urls as $result_url ) { // Visit $result_url (Reference Part 1) // Scrape data from page (Reference Part 1) // Add to array or other suitable data structure (Reference Part 2) } |
In the next post in the series I’ll post up the code you should have got and then we’ll cover downloading images and other files.
Recently I came across something that is either becoming more common, or I am more frequently encountering due to the jobs I take, and that is JavaScript client-side encryption of passwords before submission of a login form. Now since this usually uses JavaScript it can seem a bit tricky the first time you encounter it if you’re not used to working with JS.
This kind of thing is usually easily identifiable when you see the submit button of a form calling another function, like so:
1
|
< input type = "submit" onclick = "validerLogin();" /> |
After identifying the .js file containing this sites JavaScript, the function is found to be:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
function validerLogin() { var doc=document.loginFrm; if (doc.usager.value.length==0 || doc.usager.value == 'Utilisateur' ){ alert( "Vous devez entrer votre nom d'usager." ); doc.usager.focus(); return false ; } else if (doc.motdepasse.value.length == 0 || doc.motdepasse.value == 'Mot de passe ') { alert("Vous devez entrer votre mot de passe."); $(' #motdepasseLbl').focus(); return false ; } else { md5Pass = $.md5(doc.motdepasse.value); doc.pwd.value = $.md5(md5Pass+doc.salt.value); doc.motdepasse.value = ''; doc.submit(); } } |
The website, and it’s source code, being entirely in French, isn’t exactly helpful, but from a cursory glance it’s clear that, after ensuring a password has been entered, what is happening is:
1
|
md5Pass = $.md5(doc.motdepasse.value); |
2
|
doc.pwd.value = $.md5(md5Pass+doc.salt.value); |
3
|
doc.motdepasse.value = '' ; |
4
|
doc.submit(); |
Now, in our web bot we need to recreate all of these steps before submitting the form using cURL. So, step by step, as above:
1
|
$md5_pass = md5( $password ); // md5ing the password |
It turns out that the salt being used to encrypt the password for the second time is actually a unique 5 digit number that is generated and stored in the value of a hidden form input field named salt each time the login page is refreshed.
1
|
< input type = "hidden" name = "salt" value = "46545" /> |
So, before we submit the form we need to retrieve the login page and scrape the salt string.
1
2
3
|
$salt_page = curlRequest( $login_url ); // Using cURL to request the login page with the salt string on it. $salt = scrapeBetween( $salt_page , 'salt" value="' , '" />' ); // Scraping the salt string. // You can find the curlRequest() and scrapeBetween() functions in my tutorial series on Web Scraping with PHP & cURL. |
With this salt string, we can then proceed to re-encrypt our password.
1
2
|
$md5_pass = md5( $password ); // md5ing the password. $salt_pass = md5( $md5_pass . $salt ); // Concatenating and encrypting the md5'd pass and salt. |
1
2
3
|
$md5_pass = md5( $password ); // md5ing the password. $salt_pass = md5( $md5_pass . $salt ); // Concatenating and encrypting the md5'd pass and salt. $post_array = array ( 'usager' => $user , 'motdepass' => '' , 'salt' => $salt , 'pwd' => $salt_pass ); // Building post array. |
1
2
3
4
|
$md5_pass = md5( $password ); // md5ing the password $salt_pass = md5( $md5_pass . $salt ); // Concatenating and encrypting the md5'd pass and salt. $post_array = array ( 'usager' => $user , 'motdepass' => '' , 'salt' => $salt , 'pwd' => $salt_pass ); // Building post array. $site_login = curlPost( $login_url , $post_array ); // Logging in using curlPost function. You can find this in my web scraping tutorial series, though it is just a standard POST form submission using cURL. |
curl web scraping tutorial
tec institute of management nairobi and eldoret 2017/2018 september kuccps admission letter instructions to download kuccps admission letter online downlo...
Continue readingOfficial misambi vocational training centre contacts p.o. box: 121-40223, kadongocourses offered at misambi vocational training centre1. trade test grade iii-i in electrical installation (nita...
Continue readingFeesthere are three fee categories namely:regularparallel/in-serviceforeign (as detailed in the current fees structure)fees structure is subject to changes from time to timestudents are advised to pay...
Continue readings/noactivitytime lineno of days1advertisement of vacant poststuesday september 1, 20201 day2virtual sensitization of regional directorstuesday september 8, 20201 day3online applications by...
Continue readingOfficial emkwen vocational training centre contacts box: 60 code: 20422courses offered at emkwen vocational training centre 1. ttgiii-i in plant mechanic (nita)for a maximum 25 trainees2. ttg iii-i...
Continue readingOfficial lake institute of tropical medicine contacts p.o. box 816-10300 kerugoyalitmedkerugoya@yahoo.comcourses offered at lake institute of tropical medicine1. craft in nutrition and dietetics (kne...
Continue reading
Comments (0)
Leave a Comment