Solr Fuzzy Search for similar words

Asked by on 2012-03-26T23:57:29-04:00
I am trying to do a fuzzy search for "jahngir" ~ 0.2, which does not return any results. My indexes has records with data "JAHANGIR RAHMAN MD". If I try a search with exact word "jahangir" ~ 0.2, it works. Can someone please help, on what I am doing wrong. I have spent a lot of time trying to figure out on how the Solr Fuzzy search works. Any links which explain Solr Fuzzy search would be helpful. Below is the text field that I am using for indexing. Thanks in advance.

 <fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
 <analyzer type="index">
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 <!-- in this example, we will only use synonyms at query time
 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
 -->
 <!-- Case insensitive stop word removal.
 add enablePositionIncrements=true in both the index and query
 analyzers to leave a 'gap' for more accurate phrase queries.
 -->
 <filter class="solr.StopFilterFactory"
 ignoreCase="true"
 words="stopwords.txt"
 enablePositionIncrements="true"
 />
 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
 <filter class="solr.PorterStemFilterFactory"/>
 <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
 <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
 </analyzer>
 <analyzer type="query">
 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
 <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
 <filter class="solr.StopFilterFactory"
 ignoreCase="true"
 words="stopwords.txt"
 enablePositionIncrements="true"
 />
 <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
 <filter class="solr.PorterStemFilterFactory"/>
 <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
 <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="false"/>
 </analyzer>
</fieldType>


Here is the configuration that worked for me after the response. Thanks!

<!-- Modified to fit fuzzy queries --> 
 <fieldType name="text_exact_fuzzy" class="solr.TextField" omitNorms="false">
 <analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StandardFilterFactory"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <analyzer type="query">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StandardFilterFactory"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 </fieldType>

Best Answer

Answered by on 2012-03-28T16:58:25-04:00
No, you do not need to enable stemming, and the use of a stemmer may be causing the problem.

You have far too many filters on the text field. You are converting a word to a Porter stem, which often is not a real word, then taking the phonetic key of that. The surface word will rarely match the phonetic key stored in the index. The phonetic key will be very different from the original word.

Use the analyzer page in the admin UI to see how terms are processed.

I recommend splitting the kinds of approximate match into different fields.

  • text_exact: lowercase, that's about it
  • text_stem: lowercase and stem
  • text_phonetic: lowercase and double metaphone, do not stem
Use fuzzy matching with text_exact, because it handles typing errors. Do not use fuzzy against the other fields.

You can weight these fields differently, the exact match is a higher-quality match than the rest, so it can have a bigger weight. The stemmed match is a better match than phonetic, so it should have a weight smaller than exact, but bigger than phonetic.

Your Answer
No advertising and No spamming please.
Name:
Answer:

All Answers

Answered by on 2012-03-27T08:16:26-04:00
In order to get Fuzzy Searches to work, you will need to enable the correct Stemming and/or Filter Factory for your desired language. Please see the http://wiki.apache.org/solr/LanguageAnalysis topic on the http://wiki.apache.org/solr/ for more details.

Edit: Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#High_Level_Concepts for more details on the different ways of indexing your data and how this impacts the search of your data.