RSS and MySQL

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Hello, this is basically a follow up with the next phase of my plan, but I am a little lost again.

I have an RSS feed from News24, I want to store that RSS feed in my DB and then read from it and create a new rss feed to post the Title, link and description on the forum.

Now I have the second part figured out creating the RSS feed is very straight forward basically:

PHP:
//DB code up here
while($row = $result->fetch_assoc()){
?>
<item>
            <title><?php echo $row['title']; ?></title>
            <link><?php echo $row['link']; ?></link> //this link will be changed according to your post's URL
            <description><?php echo $row['description']; ?></description>
</item>

Easy.

The question is.....

How do I import the original RSS feed without making duplicate entries?

currently my foreach loop has this inside:

PHP:
$statement = $db->prepare("INSERT INTO market_feed (`title`, `description`, `link`, `date`) VALUES (:title, :description, :link, :date)");
$statement->execute(array(':title' => $title, ':description' => $description, ':link' => $link, ':date' => $date));

I hope this makes sense?

Baiscally I will run this as a cron job as well but I do not know how to prevent inserting duplicate feeds into the DB.

This is the feed in question:
http://feeds.24.com/articles/Fin24/Markets/rss

EDIT:

This is my current cronjob this is what I do to insert the RSS feed into the DB however I am at a loss as to what to do in order to prevent duplicate inserts, this is where my knowledge ends (this is a lot of learning in the past two days) Thanks to everyone who helped so far! I greatly appreciate it.

PHP:
<?php
$rss = new DOMDocument();
$rss->load('http://feeds.24.com/articles/Fin24/Markets/rss');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
    $item = array ( 
        'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
        'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
        'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
        'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
        );
    array_push($feed, $item);
}
$limit = 5;
for($x=0;$x<$limit;$x++) {
    $title = str_replace(' & ', ' &amp; ', $feed[$x]['title']);
    $link = $feed[$x]['link'];
    $description = $feed[$x]['desc'];
    $date = date('l F d, Y', strtotime($feed[$x]['date']));

    $statement = $db->prepare("INSERT INTO market_feed (`title`, `description`, `link`, `date`) VALUES (:title, :description, :link, :date)");
    $statement->execute(array(':title' => $title, ':description' => $description, ':link' => $link, ':date' => $date));



}
?>


This is how I fetch it from the DB to display on the main page:

PHP:
<?php
//Select RSS feed to show

$sql = "SELECT * FROM (SELECT ID, title, description, link, date FROM market_feed ORDER BY ID DESC LIMIT 5) sub ORDER BY ID ASC";

$statement = $db->prepare($sql);
$statement->execute();

//Display the selected RSS feed

while ($row = $statement->fetch(PDO::FETCH_ASSOC)) {

    echo '<p class="no-buffer"><strong><a href="'.$row["link"].'" target="_blank" title="'.$row["title"].'">'.$row["title"].'</a></strong><br />';
    echo '<small><em>Posted on '.$row["date"].'</em></small></p>';
    echo '<p class="no-buffer-description">'.$row["description"].'</p>';  
  
}
?>
 
Last edited:

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
How would you know it is duplicate?
If you do it by looking at the "date" and "title" values, then you could add a "where not exist" clause to the insert statement.
See details here: http://stackoverflow.com/questions/3164505/mysql-insert-record-if-not-exists-in-table

This is the RSS feed from news24

Code:
<item>
<title>
Fin24.com | Emerging assets rise as traders push back Fed bets
</title>
<description>
Emerging-market currencies and stocks have advanced for a fifth day as traders pushed back forecasts for when the Federal Reserve will increase US interest rates.
</description>
<link>
http://www.fin24.com/Markets/International-Markets/emerging-assets-rise-as-traders-push-back-fed-bets-20160810
</link>
<pubDate>Wed, 10 Aug 2016 09:07:49 +0200</pubDate>
<enclosure url="http://scripts.24.co.za/img/sites/fin24.png" length="1" type="image/png"/>
</item>

I was thinking maybe match the title?

However I am not sure if that is the correct way or if that is even possible.

My idea would be, the cron job runs check if the title exists if it doesnt then it adds the new item if that makes sense?
 

Nod

Honorary Master
Joined
Jul 22, 2005
Messages
10,057
Ok, so the cron will basically do a count where the title = 'x', and if it is 0, then it writes the new entry, otherwise it skips?
Should work. I would also use the date field, just to catch possible updates to existing stories (if applicable to this kind of news).

There are many roads to the same destination. Pick the one that will work for you.
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Ok, so the cron will basically do a count where the title = 'x', and if it is 0, then it writes the new entry, otherwise it skips?
Should work. I would also use the date field, just to catch possible updates to existing stories (if applicable to this kind of news).

There are many roads to the same destination. Pick the one that will work for you.

Just to clear it up since this is my first time doing the DB thing.

when you say

"count where the title = 'x', and if it is 0, then it writes the new entry, otherwise it skips?"

That zero is confusing me as the title is a string.

I was thinking the statement should have a if $title = $title then skip.

In otherwords if $title from the RSS feed matches that of the $title in the database then skip this insert and move on to the next.

I hope this makes sense?

This is my current cronjob this is where my knowledge end (this is a lot of learning in the past two days)

PHP:
<?php
$rss = new DOMDocument();
$rss->load('http://feeds.24.com/articles/Fin24/Markets/rss');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
    $item = array ( 
        'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
        'desc' => $node->getElementsByTagName('description')->item(0)->nodeValue,
        'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
        'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
        );
    array_push($feed, $item);
}
$limit = 5;
for($x=0;$x<$limit;$x++) {
    $title = str_replace(' & ', ' &amp; ', $feed[$x]['title']);
    $link = $feed[$x]['link'];
    $description = $feed[$x]['desc'];
    $date = date('l F d, Y', strtotime($feed[$x]['date']));

    $statement = $db->prepare("INSERT INTO market_feed (`title`, `description`, `link`, `date`) VALUES (:title, :description, :link, :date)");
    $statement->execute(array(':title' => $title, ':description' => $description, ':link' => $link, ':date' => $date));



}
?>
 
Last edited:

Nod

Honorary Master
Joined
Jul 22, 2005
Messages
10,057
For example if you do:
Code:
select count(name) from table where title="title"
The result will be a number of articles with that title. In your case ideally either 1 or 0.
If the value is >0 then you already have that title, if it is 0, then it doesn't exist, and you should add it to the DB.

From my previous link, you could do this all in the insert query:
Code:
INSERT INTO market_feed (`title`, `description`, `link`, `date`)
SELECT * FROM (SELECT :title, :description, :link, :date) AS tmp
WHERE NOT EXISTS (
    SELECT `title` FROM market_feed WHERE title= ':title'
) LIMIT 1;

There is also the option of using:
  • INSERT IGNORE, and (skip if exist)
  • INSERT … ON DUPLICATE KEY UPDATE (update field if exist)
  • REPLACE (replace if exist, or create if not)
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
@NOD, you are a beacon of light man!

MUCH appreciated, so far my have to buy a beer list is

@BIO
@Rward
@NOD
@Koeks
@kabel
@Hamster
@gkm

All you guys are amazing
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
@NOD

I did this:

PHP:
$statement = $db->prepare("INSERT INTO news_feed (`title`, `description`, `link`, `date`) SELECT * FROM (SELECT :title, :description, :link, :date) AS tmp WHERE NOT EXISTS ( SELECT `title` FROM news_feed WHERE title= ':title') LIMIT 1;");
$statement->execute(array(':title' => $title, ':description' => $description, ':link' => $link, ':date' => $date));

Which came close, this is the result in phpmyadmin
ID 1 - 5 was already in the DB

GIF
2016-08-10_11-20-02.jpg
Download the gif here:
http://i.giphy.com/3o6Zt07Qy69MBjOWaI.gif

PNG
2016-08-10_11-20-47.jpg
 
Last edited:

Nod

Honorary Master
Joined
Jul 22, 2005
Messages
10,057
2 and 9 seems to be identical? Is there a difference in title?
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
2 and 9 seems to be identical? Is there a difference in title?

none that I can find, hence why I am also scratching my brain.

Exported to CSV and took a screenshot:

2016-08-10_12-52-12.jpg
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Well on the plus side.

At least my communication with the forum DB works!!

I take RSS feed from News website

I then store it in my DB

I then display it to my users on the home page and I create a new RSS feed and send that to a script on the forum and post it to the forum as a cron job every 10 minutes ignoring existing feed items.

Wow people!! What a world

However I still need to prevent duplicates to the DB form the News RSS

View attachment 381211

2016-08-10_13-01-29.jpg
 
Last edited:

MagicDude4Eva

Banned
Joined
Apr 2, 2008
Messages
6,479
I personally would take the link as the unique identifier - it is more likely that a title will change and not necessarily the link. Alternatively you could always create a SHA256 hash of title/description and store the hash as the unique key.
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
I personally would take the link as the unique identifier - it is more likely that a title will change and not necessarily the link. Alternatively you could always create a SHA256 hash of title/description and store the hash as the unique key.

That makes sense, I am in the deep end I might as well go all out. This hash sounds familiar I have no idea what it is, but I recall reading the word a few times.

I have no idea how it looks or works googling brings up mostly cannabis
 
Last edited:

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Holy smolly so If I am understanding this correctly then we add a coloumn in the DB called unique_hash

then we create a new unique_hash in the foreach loop and on the SQL statement we do insert, check if unique_hash in loop matches unique_hash in DB if it does skip if it does not match add?

So in the loop I will also add
PHP:
$unique_hash = hash('$title', '$link');

EDIT:
I am sorry I see the first bit is the type of hash.

So I must do:

PHP:
$unique_hash = hash('SHA256', '$title.$link');

TESTING

I tested this now seems although I change the url, the hash seems to stay the same?

PHP:
<?php

$startTime = microtime(true);


// whatever you want to time
$title = "Tri-tip pig ribs, landjaeger tip tongue, capicola braunschweiger tip shoulder 
ham jerky pastrami tip venison. Ham ribs bresaola andouillette, ham pig 
braunschweiger, pig pastrami pig brisket pig landjaeger ham bresaola. Ribs 
pancetta cow short beef, kevin biltong ham pancetta ham turducken cow corned 
prosciutto pig tip cow corned belly hock. Ham tail pancetta shankle, ham 
bresaola cow sausage. Capicola tip pork kielbasa ham ham short ribs beef 
meatball. Cow ham beef ribs chuck.";

$link = "http://www.someurl.com/news/rss/";  

$unique_hash = hash('SHA256', '$title.$link');

echo "<br>";
echo "Time:  " . number_format(( microtime(true) - $startTime), 4) . " Seconds\n";
echo "<br>";
echo "$unique_hash";

?>
 
Last edited:

Nod

Honorary Master
Joined
Jul 22, 2005
Messages
10,057
Actually, the variables ($title.$link), does not need any quotes.
The quotes seem to be included in the resulting hash.

Tried all 3 (single, double and without) below:
Code:
$ unique_hash = hash('SHA256', "Tri-tip pig ribs, landjaeger tip tongue, capicola braunschweiger tip shoulder
ham jerky pastrami tip venison. Ham ribs bresaola andouillette, ham pig
braunschweiger, pig pastrami pig brisket pig landjaeger ham bresaola. Ribs
pancetta cow short beef, kevin biltong ham pancetta ham turducken cow corned
prosciutto pig tip cow corned belly hock. Ham tail pancetta shankle, ham
bresaola cow sausage. Capicola tip pork kielbasa ham ham short ribs beef
meatball. Cow ham beef ribs chuck..http://www.someurl.com/news/rss/")
4bd2fbd4de03512bc0d5c7f15a789b66ed786761e33c4508d8e55dae829487d1

$ unique_hash = hash('SHA256', 'Tri-tip pig ribs, landjaeger tip tongue, capicola braunschweiger tip shoulder
ham jerky pastrami tip venison. Ham ribs bresaola andouillette, ham pig
braunschweiger, pig pastrami pig brisket pig landjaeger ham bresaola. Ribs
pancetta cow short beef, kevin biltong ham pancetta ham turducken cow corned
prosciutto pig tip cow corned belly hock. Ham tail pancetta shankle, ham
bresaola cow sausage. Capicola tip pork kielbasa ham ham short ribs beef
meatball. Cow ham beef ribs chuck..http://www.someurl.com/news/rss/')
aadc03e92cff7b3f27a6e322d86ba5ceecb7066a809c0bade5bda76107f4d5fd

$ unique_hash = hash('SHA256', Tri-tip pig ribs, landjaeger tip tongue, capicola braunschweiger tip shoulder
ham jerky pastrami tip venison. Ham ribs bresaola andouillette, ham pig
braunschweiger, pig pastrami pig brisket pig landjaeger ham bresaola. Ribs
pancetta cow short beef, kevin biltong ham pancetta ham turducken cow corned
prosciutto pig tip cow corned belly hock. Ham tail pancetta shankle, ham
bresaola cow sausage. Capicola tip pork kielbasa ham ham short ribs beef
meatball. Cow ham beef ribs chuck..http://www.someurl.com/news/rss/)
c5d0f4d5fdae4d510418c1453b699c21aa40d06467cde0d03692d5eee697dc70
 
Last edited:

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Change
Code:
$unique_hash = hash('SHA256', '$title.$link');
to
Code:
$unique_hash = hash('SHA256', "$title.$link");
Single quotes will print literal strings.

OMG. what a world! that works!

So much to learn!
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Now back to number 1 since I assume this would mean to just swap out :title with :unique_hash however :title did not seem to work so maybe something else is wrong?

Or should I just go with :unique_hash and see what it does
 

MagicDude4Eva

Banned
Joined
Apr 2, 2008
Messages
6,479
Holy smolly so If I am understanding this correctly then we add a coloumn in the DB called unique_hash

then we create a new unique_hash in the foreach loop and on the SQL statement we do insert, check if unique_hash in loop matches unique_hash in DB if it does skip if it does not match add?

Perfectionist: Create a table where you have columns: feeditemhash, title, link, description and where feeditemhash is the unique key (I personally would add "feedsource_id" so that you can have multiple feeds and then have the unique key across two columns).

With the unique key the SQL-insert would just fail with a duplicate key error and you skip that error. Above really depends on what you want to achieve. If you add the "feedsource_id" you can cater for multiple feeds. You could very well opt to still have the unique key on the feeditemhash which then gives you the option that only the first "discovered" feed item will be inserted.
 

Thor

Honorary Master
Joined
Jun 5, 2014
Messages
44,236
Perfectionist: Create a table where you have columns: feeditemhash, title, link, description and where feeditemhash is the unique key (I personally would add "feedsource_id" so that you can have multiple feeds and then have the unique key across two columns).

With the unique key the SQL-insert would just fail with a duplicate key error and you skip that error. Above really depends on what you want to achieve. If you add the "feedsource_id" you can cater for multiple feeds. You could very well opt to still have the unique key on the feeditemhash which then gives you the option that only the first "discovered" feed item will be inserted.

I am a slight bit lost.


This is what my DB looks like:

2016-08-10_13-51-05.jpg


This is what both feeds looks like ( I made date text in this case since it's just a pubdate in the RSS (Wednesday August 10, 2016))
2016-08-10_13-51-26.jpg

what should the type be for the hash field/row?
 
Top