So I've seen many requests on here that involve scraping data from somewhere.
Here is a simple way to do so in php:
We will use the following classes:
I've noticed that there was no DOM function for getting elements by their class name so I wrote the following function:
Im going to use this thread's parent page for this example.
We will list all the threads on the first page.
We need to create a DOMDocument and load some content:
Now we need to find the element containing the data.
You can right click anywhere in the thread list and inspect element:
Now you need to find the element containing all the threads, so we just go up the parent elements until we find one that contains all the threads:
Take note of the id of that element. We will need it in our scraping
Now we need to find all the thread children of the main thread list container:
We only want the titles, so we look for the element that contains the title in each thread:
So now we loop through the thread elements and extract the values of the titles:
Tada! You've just scraped all the titles of the threads displayed on the forum page
Here is all the code combined which you can just put in a .php and run:
Here is a simple way to do so in php:
We will use the following classes:
I've noticed that there was no DOM function for getting elements by their class name so I wrote the following function:
Code:
function getElementsByClassName($className, $document)
{
$result = array();
foreach($document->getElementsByTagName("*") as $element)
{
if($element->hasAttribute("class"))
{
if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
{
array_push($result, $element);
}
}
}
return $result;
}
Im going to use this thread's parent page for this example.
We will list all the threads on the first page.
We need to create a DOMDocument and load some content:
Code:
$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));
Now we need to find the element containing the data.
You can right click anywhere in the thread list and inspect element:
Now you need to find the element containing all the threads, so we just go up the parent elements until we find one that contains all the threads:
Take note of the id of that element. We will need it in our scraping
Code:
$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list
Now we need to find all the thread children of the main thread list container:
Code:
$threads = getElementsByClassName("threadbit", $threadList);
We only want the titles, so we look for the element that contains the title in each thread:
So now we loop through the thread elements and extract the values of the titles:
Code:
foreach($threads as $thread)
{
$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}
Tada! You've just scraped all the titles of the threads displayed on the forum page
Here is all the code combined which you can just put in a .php and run:
Code:
<?php
//Note that bot protection website will require you to use CURL classes instead to simulate sessions etc.
function getElementsByClassName($className, $document)
{
$result = array();
foreach($document->getElementsByTagName("*") as $element)
{
if($element->hasAttribute("class"))
{
if(preg_match('/'.preg_quote($className).'/i',$element->getAttribute("class")))
{
array_push($result, $element);
}
}
}
return $result;
}
//Load html document
$htmlDoc = new DOMDocument();
@$htmlDoc->loadHTML(file_get_contents("http://mybroadband.co.za/vb/forumdisplay.php/132-Software-and-Web-Development"));
$threadList = $htmlDoc->getElementById("threadlist"); //MyBB forum thread list
$threads = getElementsByClassName("threadbit", $threadList);
foreach($threads as $thread)
{
$title = getElementsByClassName("title", $thread)[0]; //Will likely be the first (and only) element returned
echo $title->nodeValue, "<br/>"; //Here you can store data in a database for caching and display it from there on the web
}
?>
Last edited: