Elasticsearch Analyzer 분석

애널라이저 (Analyzer)

참조 https://trend21c.tistory.com/2220?category=1042753

1단계 : 문자 필터 (Character filter) : 0개 또는 그 이상의 문자 필터 (Character filter)

2단계 : 토크나이저 (Tokenizer) : 1개 이상의 토크나이저 (Tokenizer)
3단계 : 토큰 필터 (Token filter) : 0개 또는 그 이상의 토큰필터(Token filter)

위와 같은 3단계를 거치는 이유는 Elasticsearch 특성인 역색인 방식 때문.

1단계 문자 필터 (Character filter)
입력된 원본의 텍스트를 분석에 필요한 형태로 변환 하는 역할

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : """The road to success and the road to failure are almost exactly the same.""",
      "start_offset" : 0,
      "end_offset" : 81,
      "type" : "word",
      "position" : 0
    }
  ]
}

2단계 토크나이저 (Tokenizer)
입력 데이터를 설정된 기준에 따라 검색어 토큰으로 분리하는 역할

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "road",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "success",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "and",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "road",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "to",
      "start_offset" : 37,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "failure",
      "start_offset" : 40,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "are",
      "start_offset" : 48,
      "end_offset" : 51,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "almost",
      "start_offset" : 52,
      "end_offset" : 58,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "exactly",
      "start_offset" : 59,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "the",
      "start_offset" : 67,
      "end_offset" : 70,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "same.",
      "start_offset" : 71,
      "end_offset" : 76,
      "type" : "word",
      "position" : 13
    }
  ]
}

3단계 토큰 필터 (Token filter)

분리된 토큰들에 다시 필터를 적용해서 실제로 검색에 쓰이는 검색어들로 최종 변환하는 역할

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "filter": [
    "stop",
    "lowercase",
    "snowball"
  ],
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "road",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "success",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "road",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "failur",
      "start_offset" : 40,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "almost",
      "start_offset" : 52,
      "end_offset" : 58,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "exact",
      "start_offset" : 59,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "same.",
      "start_offset" : 71,
      "end_offset" : 76,
      "type" : "word",
      "position" : 13
    }
  ]
}

'ELK' 카테고리의 다른 글

Elasticsearch nori plugin 사용해보기 (0)	2022.07.27
Elasticsearch Index Lifecycle Management (ILM) 사용해 보기 (0)	2022.07.19
Elasticsearch Ingest Node PipeLine 사용해 보기 (0)	2022.07.19
Elasticsearch Scroll Search API 조회 (0)	2021.07.14
Elasticsearch 개념 및 용어 (0)	2021.07.07

1995 Dev

Elasticsearch Analyzer 분석

'ELK' 카테고리의 다른 글

티스토리툴바

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31