String matching in Dgraph v0.7.5

The recent release of Dgraph is packed with new features and improvements. Many of them are related to strings – full text search (with support for 15 languages!) and regular expression matching have been added, and handling of string values in multiple languages was greatly improved. All of these changes make Dgraph an excellent tool for working with multilingual applications.

Values in Many Languages

We’re working hard to keep the query language easy to use and clean. Dgraph, in v0.7.5, adopted and extended the language tag syntax from the RDF N-Quads standard. It is intuitive, well-known, and was partially supported in previous versions (during data loading).

Let’s start from the beginning – the data. Dgraph uses RDF N-Quads for data loading and backup. String literals in N-Quads may be followed by the @
sign and language tag, e.g. "badger"@en
or "Dachs"@de
. Multiple such literals may be used as a value for a single entity/attribute pair.

When querying for a predicate with multiple values, the user is able to use the @lang
notation known from RDF N-Quads. Many languages can be specified in a list of preference, e.g. @en:de
denotes that preferred language is English, but if such a value is not present, a value in German should be returned.

Language can also be specified in functions, which is important especially for full text search.

Example Data

The dataset used in all examples is the Freebase film data
. As this post is string-oriented, queries are focused on movie titles in multiple languages, and no other information is retrieved. As we don’t have information about type of name
, we use filtering to select only the movie titles, and to limit the number of results a bit – @filter(gt(count(genre), 1))
.

The schema for name
field is very simple – it defines 3 types of indexes:

curl localhost:8080/query -XPOST -d $'
mutation {
  schema {
    name: string @index(term, fulltext, exact) .
  }
}' | python -m json.tool | less

term
index is used for term matching with the allofterms
and anyofterms
functions. Note that it was the only string index available in previous releases of Dgraph.

fulltext
index uses matching with language specific stemming and stopwords. One thing worth noting is, that values indexed with fulltext
are processed according to their’s language (if they are tagged). If values are untagged, English is used as a default language.

exact
index is used for regular expression matching.

Full Text Search (FTS)

Very Short Introduction to Natural Language Processing (NLP)

By definition (from Wikipedia
):

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user).

This may sound trivial, but it’s not. Searching for exact form of the word is not always satisfying for the user. For example, nouns can be singular or plural, verbs have grammatical tenses, etc., and the user may be interested in all values related to the word in any inflected or derived form.

The simple but powerful idea is to find a method, that can transform all the forms of a word to some common base. This process is called stemming
. For many natural languages (including English) stemmers may be implemented using a set of well known grammatical rules. There are also languages (like Polish) where a dictionary based approach is required (i.e. inflected form -> stem mapping).

Only for languages with well known grammatical rules are stemmers are widely available.

Another problem with search are the words that are common, like the
, is
, or at
. In most cases, searching for them gives an enormous amount of results which are useless. Those words are called stop words
. Again, stop words are language specific. The common method of handling those words is just to remove them from the search.

Dgraph FTS/NLP Processing

The following steps are applied to both data (while indexing), and the query pattern:

  1. Tokenization – text is divided into words.
  2. Normalization – all letters are transformed to lowercase. Unicode Normalization
    is applied.
  3. Stop words are removed.
  4. Stemming is applied.

Since stop words contain inflected forms, they are removed before stemming.

Full Text Search Functions

There are two new functions that provide basic support for full text search:

  1. alloftext
    – searches for values that contain all the specified words (using NLP).
  2. anyoftext
    – searches for values that contain one or more of the specified tokens (using NLP).

Examples

Let’s query for white or maybe black
, using the term matching function allofterms
.

curl localhost:8080/query -XPOST -d '

{ movie(func:allofterms([email protected], “white or maybe black”)) @filter(gt(count(genre), 1)) {[email protected] [email protected]
[email protected] } }

Response

Copy

The query gives no results. It may be worth trying less strict match with alloftext

function:

curl localhost:8080/query -XPOST -d '

{ movie(func:alloftext([email protected], “white or maybe black”)) @filter(gt(count(genre), 1)) {[email protected] [email protected]
[email protected] } }

Response

Copy

Query returns 59 results. This example shows that removing a stop word may help in some cases.

In context of NLP, English is quite easy – there are no diacritics, and inflection is rather simple. So let’s try similar query in German:

curl localhost:8080/query -XPOST -d '

{ movie(func:allofterms([email protected], “weiss oder vielleicht schwarz”)) @filter(gt(count(genre), 1)) {[email protected] [email protected]
[email protected] } }

Response

Copy

Again, the query doesn’t return any results.

Now let’s try the NLP-enabled version of this query:

curl localhost:8080/query -XPOST -d '

{ movie(func:alloftext([email protected], “weiss oder vielleicht schwarz”)) @filter(gt(count(genre), 1)) {[email protected] [email protected]
[email protected] } }

Response

Copy

This returns 4 results.

It’s worth noting the inflected forms of schwarz
schwarzes
and Schwartze
. Also the Weiu00dfer
is interesting – u00df
is the escaped Unicode value of
grapheme ß

. weiss
matched Weißer
– the form is inflected, and grapheme equivalency is preserved. Like in the English example, the stop word ( oder
) is ignored.

Gotchas

In some cases, natural language processing can lead to surprising results. Let’s search for the answer to the famous question: To be, or not to be?
:

curl localhost:8080/query -XPOST -d '

{ movie(func:alloftext([email protected], “To be, or not to be?”)) @filter(gt(count(genre), 1)) {[email protected] } }

Response

Copy

The query gives no results, while the term matching query:

curl localhost:8080/query -XPOST -d '

{ movie(func:allofterms([email protected], “To be, or not to be?”)) @filter(gt(count(genre), 1)) {[email protected] } }

Response

Copy

gives two results:

What happened? To be, or not to be?
consists of stop words only. After FTS/NLP processing, there are no movies that match the query.

Regular Expressions (regexp)

Regular expressions are extremely useful for creating sophisticated matchers.

For example, all titles starting with a word containing night
but not knight
may be matched using following query:

curl localhost:8080/query -XPOST -d '

{ movie(func:regexp(name, “^[a-zA-z]*[^Kk ]?[Nn]ight”)) @filter(gt(count(genre), 1)) {[email protected] [email protected]
[email protected] } }

Response

Copy

There are 502 results in the test dataset.

Summary

Dgraph supports extensive, and useful methods of string matching.

Natural language processing, employed for full text search, may be the best choice for lookup based on users input. If more strict matching is required, term matching should give good results. And to get the most precise results of complicated text searches, regular expressions can be used.

稿源:Dgraph Blog (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合技术 » String matching in Dgraph v0.7.5

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录