Solr Standalone Analysers

21 Jul 2019 solr java

To efficiently and effectively search text, Solr/Lucene, splits text into tokens (which are actually graphs) at index time as well as query time. These tokens/graphs can be both pre and post filtered to provide additional flexability. Though they can be pre and/or post filtered this post will only cover analysers that are standalone and are not chainable or pre/post filterable.

Before we start with the list of standalone analysers it should be noted that these analysers only work on Solr text fields with the field type of solr.TextField.

Solr schema

To set-up a solr.TextField to utilize an analyser we need to configure the field inside the solr scheme. In the email below we have a field keyed as text_german that utilizes the org.apache.lucene.analysis.de.GermanAnalyzer. As this is a standalone analyser the analyser is utilized for analysis at both indexing and query time.

<fieldType name="text_german" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer"/>
</fieldType>

Standalone analysers

All standalone analysers ultimately extend org.apache.lucene.analysis.Analyzer, but for the most part the standalone analysers extend org.apache.lucene.analysis.StopwordAnalyzerBase.

Analyzer - An Analyzer builds TokenStreams, which analyze text.
- AnalyzerWrapper - Extension to Analyzer suitable for Analyzers which wrap other Analyzers.
  - ShingleAnalyzerWrapper - A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.
- DutchAnalyzer - Analyzer for Dutch language.
- KeywordAnalyzer - “Tokenizes” the entire stream as a single token.
- MorfologikAnalyzer - org.apache.lucene.analysis.Analyzer using Morfologik library.
- SimpleAnalyzer - An Analyzer that filters LetterTokenizer with LowerCaseFilter
- SmartChineseAnalyzer - SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.
- StopwordAnalyzerBase - Base class for Analyzers that need to make use of stopword sets.
  - ArabicAnalyzer - Analyzer for Arabic.
  - ArmenianAnalyzer - Analyzer for Armenian.
  - BasqueAnalyzer - Analyzer for Basque.
  - BrazilianAnalyzer - Analyzer for Brazilian Portuguese language.
  - BulgarianAnalyzer - Analyzer for Bulgarian.
  - CatalanAnalyzer - Analyzer for Catalan.
  - CJKAnalyzer - An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
  - ClassicAnalyzer - Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
  - CzechAnalyzer - Analyzer for Czech language.
  - DanishAnalyzer - Analyzer for Danish.
  - EnglishAnalyzer - Analyzer for English.
  - FinnishAnalyzer - Analyzer for Finnish.
  - FrenchAnalyzer - Analyzer for French language.
  - GalicianAnalyzer - Analyzer for Galician.
  - GermanAnalyzer - Analyzer for German language.
  - GreekAnalyzer - Analyzer for the Greek language.
  - HindiAnalyzer - Analyzer for Hindi.
  - HungarianAnalyzer - Analyzer for Hungarian.
  - IndonesianAnalyzer - Analyzer for Indonesian (Bahasa)
  - IrishAnalyzer - Analyzer for Irish.
  - ItalianAnalyzer - Analyzer for Italian.
  - JapaneseAnalyzer - Analyzer for Japanese that uses morphological analysis.
  - LatvianAnalyzer - Analyzer for Latvian.
  - LithuanianAnalyzer - Analyzer for Lithuanian.
  - NorwegianAnalyzer - Analyzer for Norwegian.
  - PersianAnalyzer - Analyzer for Persian.
  - PolishAnalyzer - Analyzer for Polish.
  - PortugueseAnalyzer - Analyzer for Portuguese.
  - RomanianAnalyzer - Analyzer for Romanian.
  - RussianAnalyzer - Analyzer for Russian language.
  - SoraniAnalyzer - Analyzer for Sorani Kurdish.
  - SpanishAnalyzer - Analyzer for Spanish.
  - StandardAnalyzer - Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
  - StopAnalyzer - Filters LetterTokenizer with LowerCaseFilter and StopFilter.
  - SwedishAnalyzer - Analyzer for Swedish.
  - ThaiAnalyzer - Analyzer for Thai language.
  - TurkishAnalyzer - Analyzer for Turkish.
  - UAX29URLEmailAnalyzer - Filters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with org.apache.lucene.analysis.standard.StandardFilter, org.apache.lucene.analysis.LowerCaseFilter and org.apache.lucene.analysis.StopFilter, using a list of English stop words.
  - UkrainianMorfologikAnalyzer - A dictionary-based Analyzer for Ukrainian.
- UnicodeWhitespaceAnalyzer - An Analyzer that uses UnicodeWhitespaceTokenizer.
- WhitespaceAnalyzer - An Analyzer that uses WhitespaceTokenizer.