groonga - オープンソースのカラムストア機能付き全文検索エンジン

7.11. Scorer

7.11.1. 概要

Groonga has scorer module that customizes score function. Score function computes score of matched record. The default scorer function uses the number of appeared terms. It is also known as TF (term frequency).

TF is a fast score function but it's not suitable for the following cases:

  • Search query contains one or more frequently-appearing words such as "the" and "a".
  • Document contains many same keywords such as "They are keyword, keyword, keyword ... and keyword". Search engine spammer may use the technique.

Score function can solve these cases. For example, TF-IDF (term frequency-inverse document frequency) can solve the first case. Okapi BM25 can solve the second case. But their are slower than TF.

Groonga provides TF-IDF based scorer as scorer_tf_idf but doesn't provide Okapi BM25 based scorer yet.

You don't need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.

For example, Google uses PageRank for scoring. You may be able to use data type ("title" data are important rather than "memo" data), tag, geolocation and so on.

Please stop to think about only score function for scoring.

7.11.2. 使い方

このセクションではscorerの使い方について説明します。

使い方を示すために使うスキーマ定義とサンプルデータは以下の通りです。

サンプルスキーマ:

実行例:

table_create Memos TABLE_HASH_KEY ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos title COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms title_index COLUMN_INDEX|WITH_POSITION Memos title
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms content_index COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]

サンプルデータ:

実行例:

load --table Memos
[
{
  "_key": "memo1",
  "title": "Groonga is easy",
  "content": "Groonga is very easy full text search engine!"
},
{
  "_key": "memo2",
  "title": "Mroonga is easy",
  "content": "Mroonga is more easier full text search engine!"
},
{
  "_key": "memo3",
  "title": "Rroonga is easy",
  "content": "Ruby is very helpful."
},
{
  "_key": "memo4",
  "title": "Groonga is fast",
  "content": "Groonga! Groonga! Groonga! Groonga is very fast!"
},
{
  "_key": "memo5",
  "title": "PGroonga is fast",
  "content": "PGroonga is very fast!"
},
{
  "_key": "memo6",
  "title": "PGroonga is useful",
  "content": "SQL is easy because many client libraries exist."
},
{
  "_key": "memo7",
  "title": "Mroonga is also useful",
  "content": "MySQL has replication feature. Mroonga can use it."
}
]
# [[0, 1337566253.89858, 0.000355720520019531], 7]

match_columns の中でscore関数を使うことができます。次に構文を示します。

For score function that doesn't require any parameter such as scorer_tf_idf:

SCORE_FUNCTION(COLUMN)

You can specify weight:

SCORE_FUNCTION(COLUMN) * WEIGHT

For score function that requires one or more parameters such as scorer_tf_at_most:

SCORE_FUNCTION(COLUMN, ARGUMENT1, ARGUMENT2, ...)

You can specify weight:

SCORE_FUNCTION(COLUMN, ARGUMENT1, ARGUMENT2, ...) * WEIGHT

match_columns ではカラムごとに異なるスコア関数を使うことができます。

SCORE_FUNCTION1(COLUMN1) ||
  SCORE_FUNCTION2(COLUMN2) * WEIGHT ||
  SCORE_FUNCTION3(COLUMN3, ARGUMENT1) ||
  ...

以下は簡単な使用例です。

実行例:

select Memos \
  --match_columns "scorer_tf_idf(content)" \
  --query "Groonga" \
  --output_columns "content, _score" \
  --sortby "-_score"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "content",
#           "Text"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "Groonga! Groonga! Groonga! Groonga is very fast!",
#         2
#       ],
#       [
#         "Groonga is very easy full text search engine!",
#         1
#       ]
#     ]
#   ]
# ]

Groonga! Groonga! Groonga! Groonga is very fast! contains 4 Groonga. If you use TF based scorer that is the default scorer, _score is 4. But the actual _score is 2. Because the select command uses TF-IDF based scorer scorer_tf_idf().

以下は重みを使った例です。

実行例:

select Memos \
  --match_columns "scorer_tf_idf(content) * 10" \
  --query "Groonga" \
  --output_columns "content, _score" \
  --sortby "-_score"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "content",
#           "Text"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "Groonga! Groonga! Groonga! Groonga is very fast!",
#         22
#       ],
#       [
#         "Groonga is very easy full text search engine!",
#         10
#       ]
#     ]
#   ]
# ]

Groonga! Groonga! Groonga! Groonga is very fast! has 22 as _score. It had 2 as _score in the previous example that doesn't specify weight.

Here is an example that uses scorer that requires one argument. scorer_tf_at_most scorer requires one argument. You can limit TF score by the scorer.

実行例:

select Memos \
  --match_columns "scorer_tf_at_most(content, 2.0)" \
  --query "Groonga" \
  --output_columns "content, _score" \
  --sortby "-_score"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "content",
#           "Text"
#         ],
#         [
#           "_score",
#           "Int32"
#         ]
#       ],
#       [
#         "Groonga! Groonga! Groonga! Groonga is very fast!",
#         2
#       ],
#       [
#         "Groonga is very easy full text search engine!",
#         1
#       ]
#     ]
#   ]
# ]

Groonga! Groonga! Groonga! Groonga is very fast! contains 4 Groonga. If you use normal TF based scorer that is the default scorer, _score is 4. But the actual _score is 2. Because the scorer used in the select command limits the maximum score value to 2.

以下は複数のスコアラーを使う例です。

.. groonga-command
.. include:: ../example/reference/scorer/usage_multiple_scorers.log
.. select Memos \
..   --match_columns "scorer_tf_idf(title) || scorer_tf_at_most(content, 2.0)" \
..   --query "Groonga" \
..   --output_columns "title, content, _score" \
..   --sortby "-_score"

The --match_columns uses scorer_tf_idf(title) and scorer_tf_at_most(content, 2.0). _score value is sum of them.

You can use the default scorer and custom scorer in the same --match_columns. You can use the default scorer by just specifying a match column:

.. groonga-command
.. include:: ../example/reference/scorer/usage_default_and_custom_scorers.log
.. select Memos \
..   --match_columns "title || scorer_tf_at_most(content, 2.0)" \
..   --query "Groonga" \
..   --output_columns "title, content, _score" \
..   --sortby "-_score"

The --match_columns uses the default scorer (TF) for title and scorer_tf_at_most for content. _score value is sum of them.

7.11.3. 組み込みスコアラー

以下は組み込みのスコアラーです。

目次

前のトピックへ

7.10.1. QueryExpanderTSV

次のトピックへ

7.11.3.1. scorer_tf_at_most

このページ