Web Scraping for newbies - PHP

Ed! · Apr 22, 2016

So I've seen many requests on here that involve scraping data from somewhere.

Here is a simple way to do so in php:

We will use the following classes:

I've noticed that there was no DOM function for getting elements by their class name so I wrote the following function:

Code:

function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}

Im going to use this thread's parent page for this example.
We will list all the threads on the first page.

We need to create a DOMDocument and load some content:

Code:

$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));

Now we need to find the element containing the data.
You can right click anywhere in the thread list and inspect element:

Now you need to find the element containing all the threads, so we just go up the parent elements until we find one that contains all the threads:

Take note of the id of that element. We will need it in our scraping

Code:

$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

Now we need to find all the thread children of the main thread list container:

Code:

$threads = getElementsByClassName("threadbit", $threadList);

We only want the titles, so we look for the element that contains the title in each thread:

So now we loop through the thread elements and extract the values of the titles:

Code:

foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

Tada! You've just scraped all the titles of the threads displayed on the forum page

Here is all the code combined which you can just put in a .php and run:

Code:

<?php
//Note that bot protection website will require you to use CURL classes instead to simulate sessions etc.

function getElementsByClassName($className, $document)
{
	$result = array();
	foreach($document->getElementsByTagName("*") as $element)
	{
		if($element->hasAttribute("class"))
		{
			if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
			{
				array_push($result, $element);
			}
		}
	}
	return $result;
}






//Load html document

$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));



$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list

$threads = getElementsByClassName("threadbit", $threadList);

foreach($threads as $thread)
{
	$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
	echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}

?>

[)roi(] · Apr 22, 2016

Why not just use goutte? or Simple HTML Dom

Ed! · Apr 22, 2016

[)roi(];17488332 said:
Why not just use goutte?

Many solutions to a problem. Very good suggestion by the way.

[)roi(] · Apr 23, 2016

Ed! said:
Many solutions to a problem. Very good suggestion by the way.

I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.

Ed! · Apr 23, 2016

[)roi(];17488346 said:
I can understand the need to build your own or for a challenge to rebuild it sgain and improve on it.
For the newbie it's just probably easier to link into an existing stable code base.

Very true. I guess it comes down to the need. How complex the problem is etc.

[)roi(] · Apr 23, 2016

Ed! said:
Very true. I guess it comes down to the need. How complex the problem is etc.

Anyway cool post, keep them coming.

Join the MyBroadband community

Get started

Web Scraping for newbies - PHP

Ed!

Active Member

[)roi(]

Executive Member

Ed!

Active Member

[)roi(]

Executive Member

Ed!

Active Member

[)roi(]

Executive Member