Solr Standalone Analysers
To efficiently and effectively search text, Solr/Lucene, splits text into tokens (which are actually graphs) at index time as well as query time. These tokens/graphs can be both pre and post filtered to provide additional flexability. Though they can be pre and/or post filtered this post will only cover analysers that are standalone and are not chainable or pre/post filterable.
Before we start with the list of standalone analysers it should be noted that these analysers only work on Solr text fields with the field type of solr.TextField
.
Solr schema
To set-up a solr.TextField
to utilize an analyser we need to configure the field inside the solr scheme. In the email below we have a field keyed as text_german
that utilizes the org.apache.lucene.analysis.de.GermanAnalyzer
. As this is a standalone analyser the analyser is utilized for analysis at both indexing and query time.
<fieldType name="text_german" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.de.GermanAnalyzer"/>
</fieldType>
Standalone analysers
All standalone analysers ultimately extend org.apache.lucene.analysis.Analyzer, but for the most part the standalone analysers extend org.apache.lucene.analysis.StopwordAnalyzerBase.
- Analyzer - An Analyzer builds TokenStreams, which analyze text.
- AnalyzerWrapper - Extension to Analyzer suitable for Analyzers which wrap other Analyzers.
- ShingleAnalyzerWrapper - A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.
- DutchAnalyzer - Analyzer for Dutch language.
- KeywordAnalyzer - “Tokenizes” the entire stream as a single token.
- MorfologikAnalyzer - org.apache.lucene.analysis.Analyzer using Morfologik library.
- SimpleAnalyzer - An Analyzer that filters LetterTokenizer with LowerCaseFilter
- SmartChineseAnalyzer - SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.
- StopwordAnalyzerBase - Base class for Analyzers that need to make use of stopword sets.
- ArabicAnalyzer - Analyzer for Arabic.
- ArmenianAnalyzer - Analyzer for Armenian.
- BasqueAnalyzer - Analyzer for Basque.
- BrazilianAnalyzer - Analyzer for Brazilian Portuguese language.
- BulgarianAnalyzer - Analyzer for Bulgarian.
- CatalanAnalyzer - Analyzer for Catalan.
- CJKAnalyzer - An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
- ClassicAnalyzer - Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
- CzechAnalyzer - Analyzer for Czech language.
- DanishAnalyzer - Analyzer for Danish.
- EnglishAnalyzer - Analyzer for English.
- FinnishAnalyzer - Analyzer for Finnish.
- FrenchAnalyzer - Analyzer for French language.
- GalicianAnalyzer - Analyzer for Galician.
- GermanAnalyzer - Analyzer for German language.
- GreekAnalyzer - Analyzer for the Greek language.
- HindiAnalyzer - Analyzer for Hindi.
- HungarianAnalyzer - Analyzer for Hungarian.
- IndonesianAnalyzer - Analyzer for Indonesian (Bahasa)
- IrishAnalyzer - Analyzer for Irish.
- ItalianAnalyzer - Analyzer for Italian.
- JapaneseAnalyzer - Analyzer for Japanese that uses morphological analysis.
- LatvianAnalyzer - Analyzer for Latvian.
- LithuanianAnalyzer - Analyzer for Lithuanian.
- NorwegianAnalyzer - Analyzer for Norwegian.
- PersianAnalyzer - Analyzer for Persian.
- PolishAnalyzer - Analyzer for Polish.
- PortugueseAnalyzer - Analyzer for Portuguese.
- RomanianAnalyzer - Analyzer for Romanian.
- RussianAnalyzer - Analyzer for Russian language.
- SoraniAnalyzer - Analyzer for Sorani Kurdish.
- SpanishAnalyzer - Analyzer for Spanish.
- StandardAnalyzer - Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
- StopAnalyzer - Filters LetterTokenizer with LowerCaseFilter and StopFilter.
- SwedishAnalyzer - Analyzer for Swedish.
- ThaiAnalyzer - Analyzer for Thai language.
- TurkishAnalyzer - Analyzer for Turkish.
- UAX29URLEmailAnalyzer - Filters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with org.apache.lucene.analysis.standard.StandardFilter, org.apache.lucene.analysis.LowerCaseFilter and org.apache.lucene.analysis.StopFilter, using a list of English stop words.
- UkrainianMorfologikAnalyzer - A dictionary-based Analyzer for Ukrainian.
- UnicodeWhitespaceAnalyzer - An Analyzer that uses UnicodeWhitespaceTokenizer.
- WhitespaceAnalyzer - An Analyzer that uses WhitespaceTokenizer.
- AnalyzerWrapper - Extension to Analyzer suitable for Analyzers which wrap other Analyzers.