Elastic Search - Basic Concepts
Elasticsearch (popularly known as ELK stack) is an open-source, distributed, and near real-time search engine, Which can be used to solve a variety of use-cases. Let’s understand the basic concepts of Elasticsearch in this post.
What is Elasticsearch?
- Elasticsearch is an open-source, distributed, and near real-time search engine.
- Elasticsearch is complimented by Logstash and Kibana to make it more powerful. This trio combination is known as ELK stack.
- Logstash are used to collect the data in realtime from .log or .txt files and store it in Elasticsearch.
- Kibana is used to explore, visualize and query data in realtime stored in Elasticsearch
Uses of Elasticsearch?
Elasticsearch is most popularly used for following use cases:-
- E-commerce website like search box where you can search for products from a very large variety of product inventory.
- Google like search engine where you get search suggestions as you type ahead, fuzzy search i.e. “do you mean this?” when you type something wrong, and get the search result sorted by most relevant first. Relevancy logic can be customized based on the search term is appearing in title, description, keywords, or body content of the document.
- Centralized logging and monitoring system in Microservice-based applications.
- NoSQL, flexible-schema, JSON document based storage system.
How does Elasticsearch works?
- Elasticsearch is built on top of Apache Lucene search engine Java Library.
- Elasticsearch is also written in Java so require Java environment to run it.
Let’s understand the key concepts of Elasticsearch:-
Document is basic unit of data in JSON format which are indexed in Elasticsearch. Document can have just a text (unstructured data) or collection of key-value pairs (structured data). You can think of a document like a record or row in a relational database.
Each document is a collection of fields, which are key-value pairs. Field’s value can be of type text, numeric, date, or geo. You can think of a field like a column in a relational database.
An index is a collection of documents that have similar characteristics. An index is the highest level entity that you can query against in Elasticsearch. You can think of the index as being similar to a database in a relational database schema. Any documents in an index are typically logically related. In the context of an e-commerce website, for example, you can have an index for Customers, one for Products, one for Orders, and so on. An index is identified by a name that is used to refer to the index while performing indexing, search, update, and delete operations against the documents in it.
Elasticsearch uses an inverted index that supports very fast full-text searches. An inverted index is a data structure which stores a mapping of each unique word, to its appearance in document or a set of documents. It directs you from a word to document(s). An inverted index split the string in document to individual search term (i.e. word) and then map each search term to the document.
doc_id_1 = I love Vanilla Cake. Cake makes me happy. doc_id_2 = Both Chocolate and Vanilla Cake are available in our store. doc_id_3 = I bought Cookies for the party. Term Document Frequency ------------------------------------------------------------------------ Vanilla doc_id_1, doc_id_2 2 (1 in doc_id_1, 1 in doc_id_2) Cake doc_id_1, doc_id_2 3 (2 in doc_id_1, 1 in doc_id_2) Chocolate doc_id_2 1 (1 in doc_id_2) Cookies doc_id_3 1 (1 in doc_id_3)
An Elasticsearch cluster is a group of one or more node instances that are connected together. The power of an Elasticsearch cluster lies in the distribution of data, searching, and indexing, across all the nodes in the cluster. You can think of a cluster as distributed relational database.
A node is a single server that is a part of a cluster. A node stores data and participates in the cluster’s indexing and search capabilities.
Elasticsearch provides the ability to subdivide the index into multiple pieces called shards. Shard allows an index to be distributed in a cluster. Each shard is in itself a fully-functional and independent Lucene “index” that can be hosted on any node within a cluster. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, Elasticsearch can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster.
Elasticsearch allows you to make one or more copies of your index’s shards which are called “replica shards” or just “replicas”. Basically, a replica shard is a copy of a primary shard. Each document in an index belongs to one primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document.
Analogy with Relational Database
How to search data from Elasticsearch?
- Elasticsearch provides REST APIs to perform search queries which can be executed from command-line using Curl or through the Developer Console in Kibana.
- Kibana can be used to search, explore and visualize data stored in Elasticsearch without much technical knowledge.
- Elasticsearch REST APIs provide JSON style comprehensive search capabilities using Query DSL.