[ic] weighted search result sorting

Paul Jordan interchange-users@icdevgroup.org
Thu Jan 2 15:35:59 2003


Hi List

In recent testing of my search engine I realized that while, yes it DOES
return a very good result set, it poorly sorts them. We have a content
site, that for example sells images. So for example:

sku		keywords
sku123	ocean, island, sky, trees, water
sku124	sky, clouds, blue, day

Lets say I have thousands similar to this. The problem arises when
someone searches for the term 'sky'. It will pull both results from
above, but if I sort by sku it will show the pictures of the island with
water and sky, or any number of picture with sky in it WILL appear
BEFORE a simple brilliant SKY by itself.... which is not good, if left
to sorting a field value.

I have been thinking of ways to "weight" the result set. I am not an
expert on efficiency nor databases. I am using Mysql, but NOT an SQL
query because I am doing full text searches.

A pseudo idea would be like:

sku		keywords
sku123	ocean_7, island_9, sky_5, trees_4, water_5
sku124	sky_10, clouds_10, blue_3, day_2

I have no idea if this is possible but in the above it is assumed that
with substring matching turned on, 'sky' will still be a HIT for both,
then maybe create some custom tf=? or method of sorting based on the
numeric TOTALs of the corresponding _'n' with regards to the words
matched by the users search spec.

So now with the above, a search for 'sky' will still return both, but
the first one visible will be sku124 (because sky=10) and for the other
(sky=5)

But if someone searched for 'sky ocean' then both would still be
returned but sku123 will be first because (sky+ocean=12) and the other
sku is (sky=10)

I still want to return both, because a Graphic artists can just take the
sky from one and the ocean from another, so both are relevant. I know I
know this is starting to sound terribly inefficient :) but the but any
normal tf=?,?,? will simply not work well at all for us.

As of now I have a general "collection" form to gather the users search
terms, then on my results page I separate out all the search terms and
do a nice juicy in-page co=1 search. So basically I do have all search
terms separated out at one point, if that helps.

Any idea on how to go about this? If any consultants have an idea but
feel it is way to complicated to share, or for me to handle, the please
contact me off-list with the idea, and if it is suitable for what I am
doing, then we can work out arrangements. paul@gishnetwork.com

If you have any advice(methods) for me to look into, I would be
grateful, I like trying to do this on my own, but realize this may be a
complicated one.... or not :)

Other ideas may be to base it on how many times the word appears in the
record (I don't like this one as it can get ugly). Obviously any method
will require a competent user inputting the database info, I do not
think there is any escaping that. I think I am pretty safe in assuming
any solution will require some sort of post search (perl) sorting
facility... correct?

Thanks in advance.

Paul