Web Scraping for newbies - PHP

Ed!

Active Member
Joined
Mar 1, 2016
Messages
30
So I've seen many requests on here that involve scraping data from somewhere.

Here is a simple way to do so in php:

We will use the following classes:


I've noticed that there was no DOM function for getting elements by their class name so I wrote the following function:
Code:
function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}

Im going to use this thread's parent page for this example.
We will list all the threads on the first page.

We need to create a DOMDocument and load some content:
Code:
$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));

Now we need to find the element containing the data.
You can right click anywhere in the thread list and inspect element:
tVMduHZ.png


Now you need to find the element containing all the threads, so we just go up the parent elements until we find one that contains all the threads:
4DZkkNI.png

Take note of the id of that element. We will need it in our scraping
Code:
$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

Now we need to find all the thread children of the main thread list container:
z8d0gtf.png


Code:
$threads = getElementsByClassName("threadbit", $threadList);

We only want the titles, so we look for the element that contains the title in each thread:
5Yyy8ZU.png


So now we loop through the thread elements and extract the values of the titles:
Code:
foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

Tada! You've just scraped all the titles of the threads displayed on the forum page

Here is all the code combined which you can just put in a .php and run:
Code:
<?php
//Note that bot protection website will require you to use CURL classes instead to simulate sessions etc.

function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}






//Load html document

$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));



$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

$threads = getElementsByClassName("threadbit", $threadList);

foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

?>
 
Last edited:

[)roi(]

Executive Member
Joined
Apr 15, 2005
Messages
6,282
Many solutions to a problem. Very good suggestion by the way.
I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.
 

Ed!

Active Member
Joined
Mar 1, 2016
Messages
30
[)roi(];17488346 said:
I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.

Very true. I guess it comes down to the need. How complex the problem is etc.
 
Top