[ic] weighted search result sorting
Kevin Walsh
interchange-users@icdevgroup.org
Thu Jan 2 17:23:00 2003
Paul Jordan [paul@gishnetwork.com] wrote:
>
> In recent testing of my search engine I realized that while, yes it DOES
> return a very good result set, it poorly sorts them. We have a content
> site, that for example sells images. So for example:
>
> sku keywords
> sku123 ocean, island, sky, trees, water
> sku124 sky, clouds, blue, day
>
> Lets say I have thousands similar to this. The problem arises when
> someone searches for the term 'sky'. It will pull both results from
> above, but if I sort by sku it will show the pictures of the island with
> water and sky, or any number of picture with sky in it WILL appear
> BEFORE a simple brilliant SKY by itself.... which is not good, if left
> to sorting a field value.
>
> I have been thinking of ways to "weight" the result set. I am not an
> expert on efficiency nor databases. I am using Mysql, but NOT an SQL
> query because I am doing full text searches.
>
> A pseudo idea would be like:
>
> sku keywords
> sku123 ocean_7, island_9, sky_5, trees_4, water_5
> sku124 sky_10, clouds_10, blue_3, day_2
>
> I have no idea if this is possible but in the above it is assumed that
> with substring matching turned on, 'sky' will still be a HIT for both,
> then maybe create some custom tf=? or method of sorting based on the
> numeric TOTALs of the corresponding _'n' with regards to the words
> matched by the users search spec.
>
> So now with the above, a search for 'sky' will still return both, but
> the first one visible will be sku124 (because sky=10) and for the other
> (sky=5)
>
> But if someone searched for 'sky ocean' then both would still be
> returned but sku123 will be first because (sky+ocean=12) and the other
> sku is (sky=10)
>
> I still want to return both, because a Graphic artists can just take the
> sky from one and the ocean from another, so both are relevant. I know I
> know this is starting to sound terribly inefficient :) but the but any
> normal tf=?,?,? will simply not work well at all for us.
>
I'd have a table along the lines of the following:
sku keyword weight
------- --------------- --------
sku123 ocean 7
sku123 island 9
sku123 sky 5
sku123 trees 4
sku123 water 5
sku124 sky 10
sku124 clouds 10
sku124 blue 3
sku124 day 2
PRIMARY INDEX: sku, keyword
NON-UNIQUE INDEX: keyword
The following (untested) SQL queries should then work as follows:
Query:
SELECT sku, SUM(weight) AS skuweight
FROM keywords
WHERE keyword IN ('sky')
GROUP BY sku
ORDER BY skuweight DESC
Results:
sku skuweight
------- ---------
sku124 10
sku123 5
Query:
SELECT sku, SUM(weight) AS skuweight
FROM keywords
WHERE keyword IN ('sky','ocean')
GROUP BY sku
ORDER BY skuweight DESC
Results:
sku skuweight
------- ---------
sku123 12
sku124 10
Or that's the theory, at least. :-)
You'd pass the keywords into a page which would split them into
individual lower-case words for use in the WHERE clause. If you
wanted to allow substring matches, you could have multiple "LIKE"
lines instead of one large "IN" line:
User types "sk oce" and the following is generated:
SELECT sku, SUM(weight) AS skuweight
FROM keywords
WHERE keyword LIKE '%sk%'
OR keyword LIKE '%oce%'
GROUP BY sku
ORDER BY skuweight DESC
Results:
sku skuweight
------- ---------
sku123 12
sku124 10
Once you have your SQL query, you'd execute it with the [query] tag
and display the results in any way you see fit.
--
_/ _/ _/_/_/_/ _/ _/ _/_/_/ _/ _/
_/_/_/ _/_/ _/ _/ _/ _/_/ _/ K e v i n W a l s h
_/ _/ _/ _/ _/ _/ _/ _/_/ kevin@cursor.biz
_/ _/ _/_/_/_/ _/ _/_/_/ _/ _/