Web Scraping for newbies - PHP

Ed!

Active Member
Joined
Mar 1, 2016
Messages
30
Reaction score
0
So I've seen many requests on here that involve scraping data from somewhere.

Here is a simple way to do so in php:

We will use the following classes:


I've noticed that there was no DOM function for getting elements by their class name so I wrote the following function:
Code:
function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}

Im going to use this thread's parent page for this example.
We will list all the threads on the first page.

We need to create a DOMDocument and load some content:
Code:
$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));

Now we need to find the element containing the data.
You can right click anywhere in the thread list and inspect element:
tVMduHZ.png


Now you need to find the element containing all the threads, so we just go up the parent elements until we find one that contains all the threads:
4DZkkNI.png

Take note of the id of that element. We will need it in our scraping
Code:
$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

Now we need to find all the thread children of the main thread list container:
z8d0gtf.png


Code:
$threads = getElementsByClassName("threadbit", $threadList);

We only want the titles, so we look for the element that contains the title in each thread:
5Yyy8ZU.png


So now we loop through the thread elements and extract the values of the titles:
Code:
foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

Tada! You've just scraped all the titles of the threads displayed on the forum page

Here is all the code combined which you can just put in a .php and run:
Code:
<?php
//Note that bot protection website will require you to use CURL classes instead to simulate sessions etc.

function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}






//Load html document

$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));



$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

$threads = getElementsByClassName("threadbit", $threadList);

foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

?>
 
Last edited:
Many solutions to a problem. Very good suggestion by the way.
I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.
 
[)roi(];17488346 said:
I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.

Very true. I guess it comes down to the need. How complex the problem is etc.
 
Top
Sign up to the MyBroadband newsletter
X