Monday, March 26, 2012

How to Architect Full-Text Search

John Kane recently provided this response to a question of using a pattern
match in FTS.
John Kane 3/16/2005 6:27 PM PST
From Discussion Group: sqlserver.programming
Subject: Re: CONTAINS to behave as LIKE %abc%
Date: 3/16/2005 6:27 PM PST
From: John Kane
Farhan & Dave,
Farhan, the leading "*" (asterisk) wildcard in the search condition is
ignored and this is by design for SQL Server 7.0, 2000 and 2005.
SQL FTS only supports a trailing "*" (asterisk) wildcard, i.e.. a wildcard
word-based suffix search, for example, a search for "book*" will find book,
books, booking & booked. A leading "*" (asterisk) wildcard is not supported
because unlike T-SQL LIKE, SQL FTS is a language-specific linguistic search
method, while LIKE is a grep or pattern search method. Would you want to
search on "*og" and find God and Log in the same results?
Regards,
John-- SQL Full Text Search Blog http://spaces.msn.com/members/jtkane/
This is the behavior I want in my application that currently employs a FTS
Catalog to index text I recognize from an OCR process.
I am finding that customers want to search for substrings that are contained
in the words that are recognized.
The recognized text is stored in a TEXT datatype column in a SQL table and
is indexed by FTS.
My question is, how to architect a solution that provides a quick response
to the search parameters but allows finding all pattern matches when a
wildcard is added to the leading and trailing end of a word?
For example, find all documents that contain the string "emb". It should
return december, november, september.
Therefore, I want to use code such as "LIKE %emb%" but it must work on a
table with over one million rows of text to be searched.
I don't believe you can index a text datatype column.
Because I am processing documents of unknown length, varchar columns are not
large enough and I will exceed their storage limit.
If I can't use FTS because it doesn't support a wildcard prefix, what do you
suggest?
Thanks
You can index text columns, but that isn't what you were asking. Like
you mentioned, you cannot do "LIKE" type searching with full text
searching. The only alternative is do write a function that manually
goes through your fields and counts up the instances of match. This
would be incredibly slow with a large amount of fields though. Another
option would be to create your "own" index.
Below is a function i found somewhere that will return a count of the
number of times a string appears in a field. Below that is an example
sql of how to call it.
/************************************************** ****************
*
*Description:Counts the Instances OF a String within a String
*AS well You can pass Patterns or wild cards TO be searched AND
counted.
*
*Author: Brad Skidmore
*Date: 4/6/2004
*
************************************************** ****************/
CREATE FUNCTION dbo.CountTextFrequency
(
@.TextString text,
@.SubString varchar(8000)
)
RETURNS INT
AS
BEGIN
DECLARE @.Count int --Count the instances OF @.SubString
DECLARE @.Pos int --Pos inside CURRENT Chunk OF @.TextString
DECLARE @.txtLenint --Len OF the Chunk
DECLARE @.txtPosint --start pos OF the CURRENT chunk
DECLARE @.MyTextString varchar(8000) --text data OF CURRENT chunk
SET @.txtLen = 8000--Set the MAX Len a varchar can hold (Chunk the
BLOB text)
SET @.txtPos = 1--Start at 1 pos OF the Blob text
SET @.Count =0--Set the Count OF Substring
--Get the first Chunck
SET @.MyTextString = SUBSTRING(@.TextString, @.txtPos, @.txtLen)
--While the latest chunk has SOME data COUNT the instances OF
@.SubString
WHILE DATALENGTH(@.MyTextString) > 0
BEGIN
SET @.Pos = PATINDEX('%' + @.SubString + '%', @.MyTextString)
WHILE @.Pos > 0
BEGIN
SET @.Count = @.Count + 1
IF DATALENGTH(@.SubString) > 1
BEGIN
SET @.MyTextString = STUFF(@.MyTextString, 1, @.Pos +
DATALENGTH(@.SubString)-1 ,'')
END
ELSE
BEGIN
SET @.MyTextString = STUFF(@.MyTextString, 1, @.Pos ,'')
END
SET @.Pos = PATINDEX('%' + @.SubString + '%', @.MyTextString)
END
--Get Subsequent Chuncks
SET @.txtPos = @.txtPos + @.txtLen
SET @.MyTextString = SUBSTRING(@.TextString, @.txtPos, @.txtLen)
END
--Return the COUNT OF @.SubString found in ALL the chunks
RETURN(@.Count)
END
Here is the query to use this function...it will return a ranking type
result for you.
Select KEY, SUM(dbo.CountTextFrequency(SearchField,'test') +
dbo.CountTextFrequency(SearchField,'string')) as Score into FROM
database.dbo.tableiwanttosearch SearchIndex where
(dbo.CountTextFrequency(SearchField,'test') > 0 ) and
(dbo.CountTextFrequency(SearchField,'string') > 0 ) group by KEY
This is all rough..I tried it on my site, but found it took too long to
return results...I need it to be very fast.
Let me know if this helped.
|||Unfortunately SQL FTS does not provide this functionality. You will need to
use a Like statement for this and this does not allow you to search binary
documents stored in columns of the image data type. Like can also be very
slow.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Binder" <rgondzur@.hotmail.com> wrote in message
news:OTgGRCuaFHA.2128@.TK2MSFTNGP14.phx.gbl...
> John Kane recently provided this response to a question of using a pattern
> match in FTS.
> John Kane 3/16/2005 6:27 PM PST
> From Discussion Group: sqlserver.programming
> Subject: Re: CONTAINS to behave as LIKE %abc%
> Date: 3/16/2005 6:27 PM PST
> From: John Kane
> Farhan & Dave,
> Farhan, the leading "*" (asterisk) wildcard in the search condition is
> ignored and this is by design for SQL Server 7.0, 2000 and 2005.
> SQL FTS only supports a trailing "*" (asterisk) wildcard, i.e.. a wildcard
> word-based suffix search, for example, a search for "book*" will find
book,
> books, booking & booked. A leading "*" (asterisk) wildcard is not
supported
> because unlike T-SQL LIKE, SQL FTS is a language-specific linguistic
search
> method, while LIKE is a grep or pattern search method. Would you want to
> search on "*og" and find God and Log in the same results?
> Regards,
> John-- SQL Full Text Search Blog http://spaces.msn.com/members/jtkane/
>
> This is the behavior I want in my application that currently employs a FTS
> Catalog to index text I recognize from an OCR process.
> I am finding that customers want to search for substrings that are
contained
> in the words that are recognized.
> The recognized text is stored in a TEXT datatype column in a SQL table and
> is indexed by FTS.
> My question is, how to architect a solution that provides a quick response
> to the search parameters but allows finding all pattern matches when a
> wildcard is added to the leading and trailing end of a word?
> For example, find all documents that contain the string "emb". It should
> return december, november, september.
> Therefore, I want to use code such as "LIKE %emb%" but it must work on a
> table with over one million rows of text to be searched.
> I don't believe you can index a text datatype column.
> Because I am processing documents of unknown length, varchar columns are
not
> large enough and I will exceed their storage limit.
> If I can't use FTS because it doesn't support a wildcard prefix, what do
you
> suggest?
>
> Thanks
>
>
sql

No comments:

Post a Comment