Friday 13 November 2020

Elasticsearch/Kubernetes/Logstash(ELK) - Part 2

Index Creations

What happens when we send requests to index a document or query the document in a classic search ?

As you know elastic search is a distributed technology, which means it's faster to solve the given problem. So if we have a really large search problem and we break it up into really small ones and we hand them out to each one of those computers and all those computers can sort of work in parallel to solve the problem in a much shorter period of time as opposed to just one computer.

index search is actually very scalable and the logical representation for any human readable data, under the hood it's actually called as shards. So a particular index like this could be split up into multiple shards.

example,
document from 1 to 50 -> shard_0
document from 51 to 100 -> Shard_1

Shards, since we already know that these are distributed(one or more computers) so node one is one computer and Node 2 is another computer and both of them have a classic search running and they can communicate with one another, when on the same network(clustering). so smaller units of storage is called "shards"

shards are placed in either of the containers(computers) so that when actual search requests come up, any of the nodes in the cluster would be able to respond. For backup of these, you call as replica shards if required to be configured. so when creating an index, it will be an exact copy of the actual shard, this is mainly used in the production.

How does search take place ?

what node is going to do is take the ID of this document that we are trying to index and it's going to run it through something called the hashing function.

Shards
So a shard is basically a container of inverted indices also called segments. A segment belongs inside of charts, and a chart can have multiple segments and each one of these segments or inverted indices.

inverted index:
long list of words in alphabetical order and next to each word we need to put the document that word occurred the actual location for that, called token and the process of doing it is called tokenization.

And as you're aware and an elastic search index is made up of multiple shards, and that is how an index can span multiple nodes because the unit of separation is the Shard.

The process of taking raw tax and converting it and turning it into an inverted index is called analysis, so that data is searchable which makes it faster. Once it forms this inverted index then it gets into the memory buffer.

so this one document we sent it to a logic search for indexing. It went through the analysis step and inverted indexes created this document are sent to the buffer and get populated with that document, in the same way, another document to classic search goes through the analysis process and inverted indexes formed and it sends it into the buffer. Once this buffer gets filled up then this buffer gets committed to a segment. which are not called as immutable index

And this is the basis of setting your data so that it's searchable now. Once the segments are done this chart is now searchable. It's searchable and you can rest assured that the data in here is not going to be changed; it's been processed through analysis; inverted indices were formed and then they were committed to the segment. So this is permanent data that you can know you can search and that's all. Each one of those shards are formed.

So the magical process that happens here is this analysis step that's what turns a document like this into an inverted index.

Analysis Process

When we send a document into elasticsearch  goes through a process of running this analysis step. The objective of this step is to convert or transform the document into an inverted index and store it into a shard, so this inverted index gets put into a shard segment. So this process of analysis is the key in indexing documents and not only during indexing but also during Query time when we retrieve or read the documents.

Example:
we will put these two sentences into analysis process

"sentence 1"
"sentence 2"

We were indexing these two sentences. We'd need to get rid of unnecessary information so that we can get to the most important pieces of both
of these documents and only those pieces would be indexed so if we were to convert these documents into an inverted index. This is called tokenization.

It goes through this process called an analyzer, who does all of the analysis and has two parts:

- The first part is tokenization So it has a tokenizer.
- The second step is filtering.

these steps are being done at the filtering,
- Remove stop words (whitespaces)
- Lowercasing
- Stemming (eg running, run, swimming, swim)
- Synonyms (eg thin, lean, skinny)

And when the text goes into this analyzer let's say we are indexing right we are indexing the document when text goes into this analyzer it gets first tokenized and now filtering takes place the tokens that come out are the ones that make it into the inverted index. So this is the indexing step.


Define Custom Index Structure

we could define an index "hr" we deal with this particular structure whichis the logical representation
of the actual data the actual data resides on the disk called shards. We're usually concerned with the logical
representation of the index and how to load the data into index.

Now let's get into the details of how the index structure can be defined if you had to create your index
manually and define the different fields in the properties and so on. So far elastic search created an index
structure for us dynamically on the fly. Now, lets see how to manually create it.

PUT /customer
{
  "settings" : {
      "number_of_replicas" : 2,
      "number_of_shards": 1
    },
    "mappings" : {
        "online" : {
            "properties" : {
                "gender" : {
                  "type" : "text",
                  "analyzer": "standard"
                },
                "age": {
                  "type": "integer"
                },
                "total_spent": {
                  "type": "float"
                }
            }
        }
    }
}

Lets create data to our index, as what we defined earlier.

PUT /customer/online/2343
{
  "gender": "male",
  "age": 22,
  "total_spent": 50000,
  "location": "Kashmir"
}

notice that when we use GET /customer all the data, even though what ever we have mentioned while defining
index, elastic search was added itself dynamically with "location".

we could restrict Elasticsearch, to set value of dynamic
- false: indexing field will be ignored.
- strict: indexing field will throw error.

PUT /customer/_mapping/online
{
  "dynamic": "strict"
}

Analyzers

Elastic search has wide range of built-in analyzers, which can be used in any index without further configurations.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/analysis-analyzers.html

No comments:

Post a Comment