Full-text search using text indexes
Text indexes in ClickHouse (also known as "inverted indexes") provide fast full-text capabilities on string data. The index maps each token in the column to the rows which contain the token. The tokens are generated by a process called tokenization. For example, ClickHouse tokenizes the English sentence "All cat like mice." by default as ["All", "cat", "like", "mice"] (note that the trailing dot is ignored). More advanced tokenizers are available, for example for log data.
Creating a Text Index
To create a text index, first enable the corresponding experimental setting:
A text index can be defined on a String and FixedString column using the following syntax:
The tokenizer
argument specifies the tokenizer:
default
splits strings along non-alphanumeric ASCII characters.ngram
split strings into equally large n-grams.split
split strings along certain user-defined separator strings.no_op
performs no tokenization, i.e. every row value is a token.
To test how the tokenizers split the input string, you can use ClickHouse's tokens function:
As an example,
returns
If you chose the ngram
tokenizer, you can set the ngram length using the (optional) parameter ngram_size
.
If ngram_size
is not specified, the default ngram size is 3.
The smallest and largest possible ngram size are 2 and 8.
If you chose the split
tokenizer, you can set the separators using the (optional) parameter separators
.
The parameter expects a list of strings, for example, separators = [', ', '; ', '\n', '\\']
.
Note that each string can consist of multiple characters (', '
in the example).
If parameter split
is not specified, a single whitespace [' ']
is used by default.
The split
tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%']
will cause %21abc
to be tokenized as ['abc']
, whereas switching both separators strings ['%', '%21']
will output ['21abc']
.
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.
Text indexes in ClickHouse are implemented as secondary indexes. However, unlike other skipping indexes, text indexes have a default index GRANULARITY of 64. This value has been chosen empirically and it provides a good trade-off between speed and index size for most use cases. Advanced users can specify a different index granularity (we do not recommend this).
Advanced parameters
The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.
Optional parameter dictionary_block_size
(default: 128) specifies the size of dictionary blocks in rows.
Optional parameter max_cardinality_for_embedded_postings
(default: 16) specifies the cardinality threshold below which posting lists should be embedded into dictionary blocks.
Optional parameter bloom_filter_false_positive_rate
(default: 0.1) specifies the false-positive rate of the dictionary bloom filter.
Text indexes can be added to or removed from a column after the table has been created:
Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically.
Supported functions
The text index can be used if text functions are used in the WHERE
clause of a SELECT query:
=
and !=
=
(equals) and !=
(notEquals ) match the entire given search term.
Example:
The text index supports =
and !=
, yet equality and inequality search only make sense with the no_op
tokenizer (which causes the index to store entire row values).
IN
and NOT IN
IN
(in) and NOT IN
(notIn) are similar to functions equals
and notEquals
but they match all (IN
) or none (NOT IN
) of the search terms.
Example:
The same restrictions as for =
and !=
apply, i.e. IN
and NOT IN
only make sense in conjunction with the no_op
tokenizer.
LIKE
, NOT LIKE
and match
These functions currently use the text index for filtering only if the index tokenizer is either default
or ngram
.
In order to use LIKE
like, NOT LIKE
(notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
Example:
support
in the example could match support
, supports
, supporting
etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
The spaces left and right of support
make sure that the term can be extracted as a token.
startsWith
and endsWith
Similar to LIKE
, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
Example:
In the example, only clickhouse
is considered a token.
support
is no token because it can match support
, supports
, supporting
etc.
To find all rows that start with clickhouse supports
, please end the search pattern with a trailing space:
Similarly, endsWith
should be used with a leading space:
hasToken
and hasTokenOrNull
Functions hasToken and hasTokenOrNull match against a single given token.
Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).
Example:
Functions hasToken
and hasTokenOrNull
are the most performant functions to use with the text
index.
searchAny
and searchAll
Functions searchAny and searchAll match against one or all of the given tokens.
Like hasToken
, no tokenization of the search terms takes place.
Example:
Implementation
Index layout
Each text index consists of two (abstract) data structures:
- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
Since a text index is a skip index, these data structures exist logically per index granule.
During index creation, three files are created (per part):
Dictionary blocks file (.dct)
The tokens in an index granule are sorted and stored in dictionary blocks of 128 tokens each (the block size is configurable by parameter dictionary_block_size
).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index granules file (.idx)
The index granules file contains for each dictionary block the block's first token, its relative offset in the dictionary blocks file, and a bloom filter for all tokens in the block. This sparse index structure is similar to ClickHouse's sparse primary key index). The bloom filter allows to skip dictionary blocks early if the searched token is not contained in a dictionary block.
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the cardinality of a posting list is less than 16 (configurable by parameter max_cardinality_for_embedded_postings
), it is embedded into the dictionary.
Direct read
Certain types of text queries can be sped up significantly by an optimization called "direct read". More specifically, if the SELECT query does not project from the text column, the optimization can be applied.
Example:
Supported functions
Example: Hackernews dataset
Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without an text index:
The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews
table:
Consider the following simple search for the term ClickHouse
(and its varied upper and lower cases) in the comment
column:
Notice it takes 3 seconds to execute the query:
We will use ALTER TABLE
and add an text index on the lowercase of the comment
column, then materialize it (which can take a while - wait for it to materialize):
We run the same query...
...and notice the query executes 4x faster:
We can also search for one or all of multiple terms, i.e., disjunctions or conjunctions: