Finding keyword in a MySQL database PHP

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
662
I'm trying to match search parameter as best as possible with records in a database.

E.G I want to find
001 Super Mario: 3
002 Super Mario Kart
Tekken

But the database stores the above records like this.
Rec1: Super Mario 3.
Rec2: Super Mario Kart
Rec3: Tekken

Using like '%%' will not work except for Tekken

I tried using MySQL Fulltext search ,[match against] and fiddled with the relevance counter to try and find a better match.
which gives good results but causes some basic inaccuracies and finding the wrong record


Here's an example.

It will find Tekken but its relevance score is too low because other rewords exist that are more relevant.

I.E
Fighting Edition: Tekken 6 / Tekken Tag Tournament 2 / SoulCalibur V (Because Tekken is mentioned more than once)
Not sure why Tekken 6 and Tekken 2 is ranked higher.

Also if you search Tekken 2 , It will also not ranked the highest because Tekken was mentioned more than once in other records.

The 2 in tekken 2 is not taken into concideration because the fulltext search max len setting is defaulted to 4 , Can probably decrease it but I imagine that will cause havoc.

Any ideas ?
 

Darko

Senior Member
Joined
Jul 9, 2008
Messages
624
You can use the OR statement to combine the searches.

SELECT productName
FROM products
WHERE (productName LIKE '%tekken%')
OR (productName LIKE '%mario%')
OR (productName LIKE '%kart%');

Or am I missing your question?
 

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
662
Or am I missing your question?
Yes sort of :) Probably due to my bad explanation.

Some background. I obtained a database that contains meta data for over 60k games.
I.E game_title, description, players, co-op, platform ect ect.

The database comes from thegamesdb.net. Its mostly meant to be a Wiki for all games out there but is also meant to be a database used for scraping images. So say you have a collection of retro games loaded on I.E Retro pi or Lakka.

Most of these systems allow you to scrape an image database so that it loads thumbnail images of these games in a a menu. Simply gives a nice touch to your game library.

The database has a table called games_hashes, which is meant to store the hash value for games. So instead of searching by file name it allows matching by a MD5 Checksum. Their hash table is empty. Besides "trying" to be helpful Its a nice challenge to me.

So I would like to create a tool which people can run against their game libraries Which will then take each game they have , calculate the checksum and uplaod to the database via a JSON web service.

Sometime files are numbered.
001 Mario Kart.nes
002 Super Contra.nes

Also sometimes the database title has a : character in the title name but due to a : being an illegal character in a path name people substitute it with a hyphen. There are various other scenarios which i cannot think of now.

I have an idea straight SQL will not resolve the issue, I probably have to do something to the file name before trying to search it I.E replace hyphens with nothing/space remove colons ect ect. or even replace everything besides alpha numerical characters.

I have already tried various things but i keep running into other issues, like word1-word2 = word1word2 or if you replace with a space then you have to end up replacing spaces with a space. It gets out of hand :) .
 
Last edited:

Nod

Executive Member
Joined
Jul 22, 2005
Messages
9,041
If people type in "Mortal Kombat: Deception" in the search bar, then you could replace the "ugly" text like the ":" with a "%" (wildcard). This would take care of the cases where the colon was replaced with a dash, or something else.
There is also the MySQL's full text search function.

Have you looked into SphinxSearch?
It looks interesting at least.
 
Last edited:

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
662
Had a look at elastic search. Its not what I expected. Its a database on it's own. bit above my head at the moment,.
Dont realy want to convert the data I have into an elastic database,. (If I understand it correctly)


Looking at Sphinx Search, thanks :)

I am making use of MySQL full text search,. And it works well for humans but for a computer it will rank results incorrectly.

Ie if you search "Tekken 2" , "Tekken 6 - We are the Tekken" will rank higher than Tekken 2 because "Tekken" is mentioned more than once in the title.
 

Mystic Twilight

Expert Member
Joined
Dec 23, 2010
Messages
1,992
Not exactly sure how you're implementing this or intend to have this implemented, have you considered making a program to leverage TF-IDF to do your sorting and searching based on assigned weights? See below.

 

gkm

Expert Member
Joined
May 10, 2005
Messages
1,517
I am making use of MySQL full text search,. And it works well for humans but for a computer it will rank results incorrectly.

Ie if you search "Tekken 2" , "Tekken 6 - We are the Tekken" will rank higher than Tekken 2 because "Tekken" is mentioned more than once in the title.
Something like ElasticSearch might unfortunately do the same, due to the higher frequency of the word: https://www.compose.com/articles/how-scoring-works-in-elasticsearch/
But maybe not, since ES also takes field length into account.

And yes, ES is its own database. I still think in the long run with will be worth the effort if you want the best possible results, since in my opinion it is the best you are going to be able to get for fast full text search. That being said, search algorithms do not work like humans, so you will not always get results ranked the way you would have done it.
 

gkm

Expert Member
Joined
May 10, 2005
Messages
1,517
Not exactly sure how you're implementing this or intend to have this implemented, have you considered making a program to leverage TF-IDF to do your sorting and searching based on assigned weights? See below.

That is also part of what ElasticSearch does.
 

Mystic Twilight

Expert Member
Joined
Dec 23, 2010
Messages
1,992
That is also part of what ElasticSearch does.
Yea, its just the OPs use case seems too small to implement an entire saas ecosystem when you could diy the solution. Of course don't reinvent the wheel, but maybe make your own bicycle wheel instead of fitting on tank treads to do the job. Maybe I'm misunderstanding at what scale OP is trying to achieve. Just a thought.
 

gkm

Expert Member
Joined
May 10, 2005
Messages
1,517
Yea, its just the OPs use case seems too small to implement an entire saas ecosystem when you could diy the solution. Of course don't reinvent the wheel, but maybe make your own bicycle wheel instead of fitting on tank treads to do the job. Maybe I'm misunderstanding at what scale OP is trying to achieve. Just a thought.
I agree it could be a fun project to build ones own TF-IDF rankings. And the article you have linked is really good. Thanks for providing it. I was under the impression the OP just wanted something that would work. Free text searching does take some effort, so I guess the OP shall have to learn some tool or roll something to solve it. Should be a fun project overall.
 

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
662
Here's a query that will manipulate the weight and prioritize exact matches above the standard MySQL relevance full text search.

Code:
select
game_title,
MATCH(game_title) AGAINST('tekken 2') AS relevance1,
MATCH(game_title) AGAINST('"tekken 2"') AS relevance2,
CASE WHEN game_title = 'tekken 2' then 100 else 1 end as relevance3,

    MATCH(game_title) AGAINST('tekken 2')+
    MATCH(game_title) AGAINST('"tekken 2"')*
    CASE WHEN game_title = 'tekken 2' then 100 else '0'
end as 'final_relevance'

from games
where
MATCH(game_title) AGAINST('tekken 2') Limit 10
relevance.png

But not quite there yet, will resolve some searches, better than what I have had so far. I'm going to try the TF-IDF technique.
 
Last edited:

SBSP

Senior Member
Joined
Sep 7, 2007
Messages
662
TF-IDF implemented by ID-10-T :)

Its works quite well. I had to customize it a little by overwriting a divisor value when a "document" contains a single word in order to manipulate the TF-IDF formula. Single words rank higher because the divisor cause a higher return value. I.E Frequency/WordCount 1 / 1 = 1 where as 1 / 2 = 0.5

I can now search terms in any of the following order.
001 2 Tekken Bla bla bla
And it will rank Tekken 2 the highest.

Not perfect because
001 Tekken Junbkdsvgsad
Will rank tekken and tekken 2 with the same rank score. But I think i can fix this with a simple if statement or simply rather not return anything.

The article provided by Mystic Twilight is MAGIC! thanks again.The same site provides a python example which also helped.
Here's Both
https://towardsdatascience.com/tf-i...-in-python-on-real-world-dataset-796d339a4089
Python Example
https://towardsdatascience.com/natu...feature-engineering-using-tf-idf-e8b9d00e7e76

And another article that clarified it all for me.
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/


Not sure if my PHP script below makes any sense but here it is anyways.
No prepared statements , escaped strings or anthing just plain winging it as i read through the article.

PHP can be a real ******* when it comes declaration types and divisions! :mad:

Proof of concept!
Try it out
https://www.retroskraper.com/test.php

PHP:
<?php
require_once 'modules/mysqllogin.php';
require_once 'modules/filters.php';
require_once 'modules/public.php';

if( !empty($_GET['search']) ){$search=MySanitize($_GET['search']);}else{$search='';}

echo "<h4>Input=[$search]</h4>";

$MySQLServer=GETdbHostNameRO();
$MySQLDB=GETdbNameRO();
$MySQLUsername=GETdbUsernameRO();
$MySQLPassword=GETdbPasswordRO();

            $SqlQuery="
            select distinct games.* from games  where  MATCH(game_title) AGAINST('$search') Limit 10 ";
//BOF Generate Bag of words.
            $SearchDocumentCount=0;
            $DocRecArray = array();
            $con=mysqli_connect($MySQLServer,$MySQLUsername,$MySQLPassword,$MySQLDB);
            if (mysqli_connect_errno())
            {
            echo "Failed to connect to MySQL: " . mysqli_connect_error();
            }
            $result = mysqli_query($con,$SqlQuery);
            while($row = mysqli_fetch_array($result))
            {
                $SearchDocumentCount++;
                $game_title=$row['game_title'];
                array_push($DocRecArray,$game_title);
                //echo '1' . $game_title . '<br>';
            }
            mysqli_close($con);  
//EOF Generate Bag of words.      

            $BuildTable='';
            $SqlQuery="
            select distinct
            games.*,
             MATCH(game_title) AGAINST('$search') AS relevance1,
             MATCH(game_title) AGAINST('\"$search\"') AS relevance2,
             CASE WHEN game_title = '$search' then 50 else 1 end as relevance3,
           
             MATCH(game_title) AGAINST('$search')+
             MATCH(game_title) AGAINST('\"$search\"')+
             CASE WHEN game_title = '$search' then 50 else 1 end as final_relevance,
                         
            (SELECT filename from banners where games_id=games.id and filename like '%boxart/front%' limit 1) as 'LOCAL_IMG_LINK',
            (SELECT name from platforms where id=games.platform limit 1) as 'platform_name',
            (SELECT genre from genres where genres.id=games_genre.genres_id LIMIT 1 ) as 'genre_name',
            (SELECT name from pubs_list where id=games_pubs.pub_id) as 'publisher_name',
            (SELECT name from devs_list where id=games_devs.dev_id) as 'developer_name',
            games_genre.genres_id
                                   
            from games
            left join games_genre on games_genre.games_id=games.id
            left join games_pubs on games_pubs.games_id=games.id
            left join games_devs on games_devs.games_id=games.id
            where
            MATCH(game_title) AGAINST('$search')

            Limit 10
            ";
            $reccount=0;
            $BuildOutput='';
            $con=mysqli_connect($MySQLServer,$MySQLUsername,$MySQLPassword,$MySQLDB);
            if (mysqli_connect_errno())
            {
            echo "Failed to connect to MySQL: " . mysqli_connect_error();
            }
            $result = mysqli_query($con,$SqlQuery);
            while($row = mysqli_fetch_array($result))
            {
                $reccount++;
                $id=$row['id'];
                $relevance1=$row['relevance1'];
                $relevance2=$row['relevance2'];
                $relevance3=$row['relevance3'];
                $final_relevance=$row['final_relevance'];
               
                $game_title=$row['game_title'];
                $overview=$row['overview'];
                $release_date=$row['release_date'];
                $platform=$row['platform'];
                $platform_name=$row['platform_name'];
                $publisher_name=$row['publisher_name'];
                $developer_name=$row['developer_name'];
               
                $coop=$row['coop'];
                $players=$row['players'];          
                $LOCAL_IMG_LINK='images/thumb/' . $row['LOCAL_IMG_LINK']; //Local Image
               
                $genres_id=$row['genres_id'];
                $genre_name=$row['genre_name'];
               
                $FinalTFScore=0;
                $tf=0;
               
                //Break search up into words
                $searchArry = explode(' ',strtolower($search));
               
                $BuildOutput = $BuildOutput .  $reccount . ' <b>Document:</b> ' . $game_title . '<br>';
                foreach($searchArry as $SearchArryVal)
                {                  
                    //Calculate TF
                    (float)$tfreq=TermCounter($SearchArryVal,$game_title);
                    (float)$Wordcount = WordCounter($game_title);          
                    (float)$tf = $tfreq / $Wordcount ;

                    //Calculate IDF
                    $IDFDocumentMatchCount=0;
                    $IDFDocumentMatchCount = IDFDocMatchCount($DocRecArray,$SearchArryVal);
                    if($IDFDocumentMatchCount!=0)
                        {
                            $TF_IDF=(float)$tf*log($SearchDocumentCount/$IDFDocumentMatchCount);
                        }else
                        {
                            $TF_IDF=0.001;
                        }
                    $BuildOutput = $BuildOutput .  'Term: <b>' . $SearchArryVal . '</b> Freq=<b>' . $tfreq . '</b> WordCount=<b>' . $Wordcount . '</b> TFScore=<b>' . $tf . '</b> Document Count=<b>' . $SearchDocumentCount . '</b> Documents Containing <b>' . $SearchArryVal . '</b>=<b>' . $IDFDocumentMatchCount .  '</b><br>';          
                   
                    $FinalTFScore = (float)$FinalTFScore  + $TF_IDF;
                }
                $BuildOutput = $BuildOutput .  '-------------------------------------------------------------------------------------------------------------------------------------------<br>';
                //var_dump($tf);
               
                //$FinalTFScore=$FinalTFScore*20;
                $FinalRankingScore=$FinalTFScore*$final_relevance;

                $BuildTable=$BuildTable . "<tr id=\"tablerec_$reccount\"><td>$reccount</td><td>$game_title</td><td>$platform_name</td><td>$relevance1</td><td>$relevance2</td><td>$relevance3</td><td>$final_relevance</td><td>" . (float)$FinalTFScore . "</td><td>$FinalRankingScore</td></tr>";              
            }
            mysqli_close($con);  
           
function TermCounter($Term,$DOC)
{
    $TermCount=0;
    $arrDocument = explode(' ',$DOC);  
    foreach ($arrDocument as $value)
    {
        if(strtolower($value)==strtolower($Term)){$TermCount++;}
    }
    return $TermCount;
   
}
function WordCounter($string)
{
    $tringarr = explode(' ',$string);
    //Custom adjustment to over come high scores on documents containing a single word.
    if(count($tringarr) ==1)
    {
        return 2;
    }
    else
    {
        return count($tringarr);
    }
}

function IDFDocMatchCount($DocumentArry,$Term)
{
    //Count how many documents contain a speciffic word
    $RecMatchCount=0;
    foreach ($DocumentArry as $DocumentArryVal)
    {
        $DocumentArryVal = strtolower($DocumentArryVal);
        $Term = strtolower($Term);
       
        if(strpos($DocumentArryVal, $Term) == false){$RecMatchCount=$RecMatchCount+0;}else{$RecMatchCount++;}
    }
   
    return $RecMatchCount;
}

?>
<body style="font-family:verdana;font-size:11px;">
    <form action="test.php" method="get">
        Search<br><input type="text" name="search" value="<?php echo $search;?>" size="50"><br>
    <input type="submit"><br>
    </form>
   
    <?php //echo $SqlQuery;?>
    <h4>Search Results</h4>  
    <table style="border-collapse: collapse; border: 1px solid black;font-family:verdana;font-size:11px;" border="bordered">
        <tr><td><b>#</td><td><b>game_title</td><td><b>platform</td><td><b>relevance1</td><td><b>relevance2</td><td><b>relevance3</td><td><b>Total Relevance</td><td><b>TF-IDF</td><td><b>Final Rank</td></tr>
        <?php echo $BuildTable;?>
    </table>
        <br>
        <h4>TF-IDF Process Output</h4>
        <?php echo $BuildOutput;?>
    </table>
</body>
 
  • Like
Reactions: gkm
Top