본문으로 바로가기

Elasticsearch Analyzer 분석

category ELK 2022. 7. 18. 18:57
반응형


 

애널라이저 (Analyzer)

참조 https://trend21c.tistory.com/2220?category=1042753


1단계 : 문자 필터 (Character filter) : 0개 또는 그 이상의 문자 필터 (Character filter)

2단계 : 토크나이저 (Tokenizer) : 1개 이상의 토크나이저 (Tokenizer)
3단계 : 토큰 필터 (Token filter) : 0개 또는 그 이상의 토큰필터(Token filter)

위와 같은 3단계를 거치는 이유는 Elasticsearch 특성인 역색인 방식 때문.

1단계 문자 필터 (Character filter)
입력된 원본의 텍스트를 분석에 필요한 형태로 변환 하는 역할 

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : """The road to success and the road to failure are almost exactly the same.""",
      "start_offset" : 0,
      "end_offset" : 81,
      "type" : "word",
      "position" : 0
    }
  ]
}

2단계 토크나이저 (Tokenizer)
입력 데이터를 설정된 기준에 따라 검색어 토큰으로 분리하는 역할

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "road",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "to",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "success",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "and",
      "start_offset" : 24,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "the",
      "start_offset" : 28,
      "end_offset" : 31,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "road",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "to",
      "start_offset" : 37,
      "end_offset" : 39,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "failure",
      "start_offset" : 40,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "are",
      "start_offset" : 48,
      "end_offset" : 51,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "almost",
      "start_offset" : 52,
      "end_offset" : 58,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "exactly",
      "start_offset" : 59,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "the",
      "start_offset" : 67,
      "end_offset" : 70,
      "type" : "word",
      "position" : 12
    },
    {
      "token" : "same.",
      "start_offset" : 71,
      "end_offset" : 76,
      "type" : "word",
      "position" : 13
    }
  ]
}

3단계 토큰 필터 (Token filter)

분리된 토큰들에 다시 필터를 적용해서 실제로 검색에 쓰이는 검색어들로 최종 변환하는 역할

 

GET _analyze
{
  "char_filter": [
    "html_strip"
  ],
  "tokenizer": "whitespace",
  "filter": [
    "stop",
    "lowercase",
    "snowball"
  ],
  "text": "<h3>The road to success and the road to failure are almost exactly the same.</h3>"
}


//결과
{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "road",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "success",
      "start_offset" : 16,
      "end_offset" : 23,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "road",
      "start_offset" : 32,
      "end_offset" : 36,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "failur",
      "start_offset" : 40,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "almost",
      "start_offset" : 52,
      "end_offset" : 58,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "exact",
      "start_offset" : 59,
      "end_offset" : 66,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "same.",
      "start_offset" : 71,
      "end_offset" : 76,
      "type" : "word",
      "position" : 13
    }
  ]
}
반응형