Elasticsearch nori plugin 사용해보기

Docker + Elasticsearch, Kibana 구성 (cluster)

구성 1. node는 총 3개 구성 2. node1, node2는 master node 겸 data node 사용 3. node3은 data node로만 사용 # elasticsearch image docker.elastic.co/elasticsearch/elasticsearch:7.9.1 # kibana image doc..

1995-dev.tistory.com

Elasticsearch 로컬 환경#2

GitHub - lgm3555/docker-elk-setting: docker-elk-setting

docker-elk-setting. Contribute to lgm3555/docker-elk-setting development by creating an account on GitHub.

github.com

한국어 분석 플러그인 Nori 설치

# docker 설치
RUN elasticsearch-plugin install analysis-nori
# elasticsearch 직접 설치
elasticsearch-plugin install analysis-nori

#설치 확인
GET /_cat/plugins

result => es1_ojt analysis-nori 7.8.1

한국어 분석 플러그인 Nori 적용 비교

POST /_analyze
{
  "text": ["대한민국에 오신것을 환영합니다."]
}

result => 
{
  "tokens" : [
    {
      "token" : "대한민국에",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "오신것을",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<HANGUL>",
      "position" : 1
    },
    {
      "token" : "환영합니다",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<HANGUL>",
      "position" : 2
    }
  ]
}

POST /_analyze
{
  "tokenizer": "nori_tokenizer",
  "text": ["대한민국에 오신것을 환영합니다."]
}

result =>
{
  "tokens" : [
    {
      "token" : "대한",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "민국",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "에",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "오",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "시",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "ᆫ",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "것",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "을",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "환영",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "하",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "ᄇ니다",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "word",
      "position" : 10
    }
  ]
}

동의어 사전 생성

# usr/share/elasticsearch/config/ 하위에 파일 생성
# 도커 volumes 지정
#"${PWD}/synonyms.txt:/usr/share/elasticsearch/config/synonyms.txt"

우리나라,대한민국,한국,코리아,korea

대한민국 = 우리나라 = 한국 = 코리아 = korea를 동의어로 지정

동의어 사전을 생성하여 index에 setting 해주면 아래와 같은 에러 발생

동의어 사전에 "대한민국"이 analyzer를 거치면서 "대한", "민국"으로 나눠져 filiter를 빌드하는 과정에 오류가 발생.(아마)

사용자 사전에 문제가 되는 동의어를 작성

{
  "error":{
    "root_cause":[
      {
        "type":"illegal_argument_exception",
        "reason":"failed to build synonyms"
      }
    ],
    "type":"illegal_argument_exception",
    "reason":"failed to build synonyms",
    "caused_by":{
      "type":"parse_exception",
      "reason":"Invalid synonym rule at line 1",
      "caused_by":{
        "type":"illegal_argument_exception",
        "reason":"term: 대한민국 analyzed to a token (대한) with position increment != 1 (got: 0)"
      }
    }
  },
  "status":400
}

사용자 사전에 대한민국을 추가

사용자 사전 생성

# usr/share/elasticsearch/config/ 하위에 파일 생성
# 도커 volumes 지정
#"${PWD}/user_dictionary.txt:/usr/share/elasticsearch/config/user_dictionary.txt"

우리나라
대한민국

사용자 사전과 동의어 사전을 analyzer에 추가

PUT /synonyms_dic_test
{
  "mappings": {
    "properties": {
      "product_title": {
        "type": "text",
        "analyzer": "synonym_test"
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "tokenizer":{
            "korean_nori_tokenizer":{
              "type":"nori_tokenizer",
              "decompound_mode":"mixed",
              "user_dictionary":"user_dictionary.txt"
            }
          },
          "analyzer": {
            "synonym_test": {
              "type": "custom",
              "tokenizer": "korean_nori_tokenizer",
              "filter": [
                "synonym"
              ]
            }
          },
          "filter": {
            "synonym": {
              "type": "synonym",
              "synonyms_path": "synonyms.txt"
            }
          }
      }
    }
  }
}

analyzer 사용

POST synonyms_dic_test/_analyze
{
  "analyzer":"synonym_test",
  "text":"대한민국에 오신것을 환영합니다."
}

result =>
{
  "tokens" : [
    {
      "token" : "대한민국",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "우리나라",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "한국",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "korea",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "SYNONYM",
      "position" : 0
    },
    ...

'ELK' 카테고리의 다른 글

Elasticsearch Index mapping type 종류 및 차이점 (0)	2022.09.13
Elastic APM 구성해보기 (0)	2022.08.24
Elasticsearch Index Lifecycle Management (ILM) 사용해 보기 (0)	2022.07.19
Elasticsearch Ingest Node PipeLine 사용해 보기 (0)	2022.07.19
Elasticsearch Analyzer 분석 (1)	2022.07.18

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

1995 Dev