ko-nlp / korpora Goto Github PK

Korean corpus repository

License: Creative Commons Attribution 4.0 International

Python 99.96% Shell 0.04%

korpora's Introduction

Korpora: Korean Corpora Archives

최근 자연어 처리에 관심이 높아지면서 정부와 기업은 물론 뜻있는 개인에 이르기까지 데이터를 무료로 공개하는 추세입니다. 하지만 데이터가 곳곳에 산재해 있다보니 품질 좋은 말뭉치임에도 그 존재조차 잘 알려지지 않은 경우가 많습니다. 파일 포맷과 저장 형식 등이 각기 달라 사용이 쉽지 않습니다. 개별 사용자들은 다운로드나 전처리 코드를 그때그때 개발해서 써야 하는 수고로움이 있습니다.

Korpora는 이 같은 불편함을 조금이나마 덜어드리기 위해 개발한 오픈소스 파이썬 패키지입니다. Korpora는 말뭉치라는 뜻의 영어 단어 corpus의 복수형인 corpora에서 착안해 이름 지었습니다. Korpora는 Korean Corpora의 준말입니다. Korpora가 마중물이 되어 한국어 데이터셋이 더 많이 공개되고 이를 통해 한국어 자연어 처리 수준이 한 단계 업그레이드되기를 희망합니다.

말뭉치 목록

Korpora가 제공하는 말뭉치 목록은 다음과 같습니다.

corpus_name	description	link
korean_chatbot_data	챗봇 트레이닝용 문답 페어	https://github.com/songys/Chatbot_data
kcbert	KcBERT 모델 학습용 댓글 데이터	https://github.com/Beomi/KcBERT
korean_hate_speech	한국어 혐오 데이터셋	https://github.com/kocohub/korean-hate-speech
korean_petitions	청와대 국민 청원	https://github.com/lovit/petitions_archive
kornli	Korean NLI	https://github.com/kakaobrain/KorNLUDatasets
korsts	Korean STS	https://github.com/kakaobrain/KorNLUDatasets
kowikitext	한국어 위키 텍스트	https://github.com/lovit/kowikitext/
namuwikitext	나무위키 텍스트	https://github.com/lovit/namuwikitext
naver_changwon_ner	네이버 x 창원대 개체명 인식 데이터셋	https://github.com/naver/nlp-challenge/tree/master/missions/ner
nsmc	NAVER Sentiment Movie Corpus	https://github.com/e9t/nsmc
question_pair	한국어 질문쌍 데이터셋	https://github.com/songys/Question_pair
modu_news	모두의 말뭉치: 신문	https://corpus.korean.go.kr
modu_messenger	모두의 말뭉치: 메신저	https://corpus.korean.go.kr
modu_mp	모두의 말뭉치: 형태 분석	https://corpus.korean.go.kr
modu_ne	모두의 말뭉치: 개체명 분석	https://corpus.korean.go.kr
modu_spoken	모두의 말뭉치: 구어	https://corpus.korean.go.kr
modu_web	모두의 말뭉치: 웹	https://corpus.korean.go.kr
modu_written	모두의 말뭉치: 문어	https://corpus.korean.go.kr
aihub_translation	한국어-영어 번역 말뭉치	https://aihub.or.kr/aidata/87
open_subtitles	영화 자막 한영 병렬 말뭉치	http://opus.nlpl.eu/OpenSubtitles-v2018.php
korean_parallel_koen_news	한국어-영어 병렬 말뭉치	https://github.com/jungyeul/korean-parallel-corpora

안내 페이지

Korpora 사용법을 자세히 안내하는 페이지는 다음과 같습니다. 아래의 페이지는 한글과 영어로 기술되어 있습니다. 영어 번역에 힘써주신 Han Kyul Kim (@hank110) Won Ik Cho (@warnikchow) (Alphabet order) 님에게 감사드립니다.

https://ko-nlp.github.io/Korpora

핵심 기능 위주로 빠르게 살펴보고 싶은 분들은 아래 빠른 사용법 파트를 참고하세요. 실행시 주의점, 옵션 추가 및 변경 등은 위 페이지를 보시면 됩니다.

빠른 사용법

설치

From source

git clone https://github.com/ko-nlp/Korpora
python setup.py install

Using pip

pip install Korpora

파이썬에서 사용하기

Korpora는 오픈소스 파이썬 패키지입니다. 기본적으로 파이썬 콘솔(console)에서 동작합니다. 말뭉치 목록을 확인하는 파이썬 예제는 다음과 같습니다.

from Korpora import Korpora
Korpora.corpus_list()

{
   'kcbert': 'beomi@github 님이 만드신 KcBERT 학습데이터',
   'korean_chatbot_data': 'songys@github 님이 만드신 챗봇 문답 데이터',
   'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github 님이 만드신 혐오댓글데이터',
   'korean_petitions': 'lovit@github 님이 만드신 2017.08 ~ 2019.03 청와대 청원데이터',
   'kornli': 'KakaoBrain 에서 제공하는 Natural Language Inference (NLI) 데이터',
   'korsts': 'KakaoBrain 에서 제공하는 Semantic Textual Similarity (STS) 데이터',
   'kowikitext': "lovit@github 님이 만드신 wikitext 형식의 한국어 위키피디아 데이터",
   'namuwikitext': 'lovit@github 님이 만드신 wikitext 형식의 나무위키 데이터',
   'naver_changwon_ner': '네이버 + 창원대 NER shared task data',
   'nsmc': 'e9t@github 님이 만드신 Naver sentiment movie corpus v1.0',
   'question_pair': 'songys@github 님이 만드신 질문쌍(Paired Question v.2)',
   'modu_news': '국립국어원에서 만든 모두의 말뭉치: 뉴스 말뭉치',
   'modu_messenger': '국립국어원에서 만든 모두의 말뭉치: 메신저 말뭉치',
   'modu_mp': '국립국어원에서 만든 모두의 말뭉치: 형태 분석 말뭉치',
   'modu_ne': '국립국어원에서 만든 모두의 말뭉치: 개체명 분석 말뭉치',
   'modu_spoken': '국립국어원에서 만든 모두의 말뭉치: 구어 말뭉치',
   'modu_web': '국립국어원에서 만든 모두의 말뭉치: 웹 말뭉치',
   'modu_written': '국립국어원에서 만든 모두의 말뭉치: 문어 말뭉치',
   'aihub_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어 + 대화 + 뉴스 + 한국문화 + 조례 + 지자체웹사이트)",
   'aihub_spoken_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어)",
   'aihub_conversation_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (대화)",
   'aihub_news_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (뉴스)",
   'aihub_korean_culture_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (한국문화)",
   'aihub_decree_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (조례)",
   'aihub_government_website_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (지자체웹사이트)",
   'open_subtitles': 'Open parallel corpus (OPUS) 에서 제공하는 영화 자막 번역 병렬 말뭉치',
}

파이썬 콘솔에서 KcBERT 학습데이터를 내려 받는 파이썬 예제는 다음과 같습니다. 사용자의 로컬 컴퓨터 루트 하위의 Korpora라는 디렉토리(~/Korpora)에 말뭉치를 내려 받습니다. 다른 데이터를 받고 싶다면 위에서 확인한 말뭉치 이름을 인자로 주면 됩니다.

from Korpora import Korpora
Korpora.fetch("kcbert")

Korpora가 제공하는 모든 말뭉치를 내려받고 싶다면 다음과 같이 실행하세요. ~/Korpora에 말뭉치를 내려 받습니다.

from Korpora import Korpora
Korpora.fetch('all')

KcBERT 학습데이터를 파이썬 콘솔에서 읽어들이는 예제는 다음과 같습니다. 데이터가 로컬에 없다면 ~/Korpora에 말뭉치를 내려 받습니다. 이후 corpus라는 파이썬 변수에 말뭉치 데이터가 담기게 됩니다. 다른 데이터를 읽고 싶다면 위에서 확인한 말뭉치 이름을 인자로 주면 됩니다.

from Korpora import Korpora
corpus = Korpora.load("kcbert")

터미널에서 사용하기

Korpora는 터미널에서도 동작합니다(Command Line Interface, CLI). 파이썬 콘솔 실행 없이 Korpora를 사용할 수 있습니다. 터미널에서 KcBERT 학습데이터 하나를 다운받는 예제는 다음과 같습니다. ~/Korpora에 말뭉치를 내려 받습니다.

korpora fetch --corpus kcbert

터미널에서 KcBERT 학습데이터와 챗봇 문답 데이터 두 개를 동시에 다운로드 받는 예제는 다음과 같습니다. 이같은 방식으로 3개 이상의 데이터도 동시에 내려받을 수 있습니다. ~/Korpora에 말뭉치를 내려 받습니다.

korpora fetch --corpus kcbert korean_chatbot_data

터미널에서 Korpora가 제공하는 모든 말뭉치를 내려받는 예제는 다음과 같습니다. ~/Korpora에 말뭉치를 내려 받습니다.

korpora fetch --corpus all

터미널에서 언어모델(Language Model) 학습용 데이터를 만들 수 있습니다. 언어모델 학습용 데이터 구축이라고 함은, Korpora가 제공하는 코퍼스에서 문장만을 떼어서 텍스트 파일로 덤프하는 걸 가리킵니다. 기본 예제 코드는 다음과 같습니다. 다음 코드는 Korpora가 제공하는 모든 코퍼스(all)를 언어모델 학습용 말뭉치로 일괄 처리하는 역할을 합니다. 다운로드와 전처리를 동시에 수행합니다. 로컬에 데이터가 없다면 ~/Korpora에 말뭉치를 내려 받습니다. 결과물은 all.train이라는 파일 하나입니다. output_dir에 생성됩니다.

korpora lmdata \
  --corpus all \
  --output_dir ~/works/lmdata

License

Korpora 라이센스는 Creative Commons License(CCL) 4.0의 CC-BY입니다. 이 라이센스는 Korpora 패키지 및 그 부속물에 한정됩니다.
이용자는 다음의 권리를 갖습니다.
- 공유 : 복제, 배포, 전시, 공연 및 공중 송신(포맷 변경도 포함) 등을 자유롭게 할 수 있습니다.
- 변경 : 리믹스, 변형, 2차적 저작물의 작성이 가능합니다. 영리 목적으로도 이용이 가능합니다.
이용자는 다음의 의무가 있습니다. 아래 의무를 지키는 한 위의 권리가 유효합니다.
- 저작자표시 : Korpora를 이용했다는 정보를 표시해야 합니다.
- 추가제한금지 : 이용자는 Korpora를 활용한 2차적 저작물에 CC-BY보다 엄격한 라이센스를 부가할 수 없습니다.
- 예컨대 Korpora를 내려 받아 단순히 사용하기만 했다면 '저작자표시'만 지키면 됩니다. Korpora를 활용해 모델이나 문서 등 2차 저작물을 만들고 이를 배포할 경우 '저작자표시'뿐 아니라 '추가제한금지' 의무도 지켜야 합니다.
한편 말뭉치의 라이센스는 말뭉치별로 별도 적용됩니다. 자신이 사용할 말뭉치의 라이센스가 어떤 내용인지 활용 전에 반드시 확인하세요!

Korpora: Korean Corpora Archives

Due to the growing interest in natural language processing, governments, businesses, and individuals are disclosing their data for free. However, even for a high-quality corpus, its existence is often unknown as datasets are scattered in different locations. Furthermore, each of their file or saved format is often different, making it even more difficult to use them. Therefore, individuals need to painstakingly create download or preprocessing codes for every instance.

Korpora is an open-source Python package that aims to minimize such inconvenience. The name Korpora comes from the word corpora, a plural form of the word corpus. Korpora is an acronym that stands for Korean Corpora. We hope that Korpora will serve as a starting point that encourages more Korean datasets to be released and improve the state of Korean natural language processing to the next level.

List of corpora

Korpora provides following corpora.

corpus_name	description	link
korean_chatbot_data	Question and answer pairs for training a chatbot	https://github.com/songys/Chatbot_data
kcbert	Comment data used for training KcBERT model	https://github.com/Beomi/KcBERT
korean_hate_speech	Korean hate speech dataset	https://github.com/kocohub/korean-hate-speech
korean_petitions	Petitions to Blue House	https://github.com/lovit/petitions_archive
kornli	Korean NLI	https://github.com/kakaobrain/KorNLUDatasets
korsts	Korean STS	https://github.com/kakaobrain/KorNLUDatasets
kowikitext	Korean Wikipedia text	https://github.com/lovit/kowikitext/
namuwikitext	Namuwiki text	https://github.com/lovit/namuwikitext
naver_changwon_ner	NAVER x Changwon National University NER dataset	https://github.com/naver/nlp-challenge/tree/master/missions/ner
nsmc	NAVER Sentiment Movie Corpus	https://github.com/e9t/nsmc
question_pair	Korean question and answer pair dataset	https://github.com/songys/Question_pair
modu_news	Modu Corpus: Newspaper	https://corpus.korean.go.kr
modu_messenger	Modu Corpus: Messenger	https://corpus.korean.go.kr
modu_mp	Modu Corpus: Morphemes	https://corpus.korean.go.kr
modu_ne	Modu Corpus: Named Entity	https://corpus.korean.go.kr
modu_spoken	Modu Corpus: Spoken	https://corpus.korean.go.kr
modu_web	Modu Corpus: Web	https://corpus.korean.go.kr
modu_written	Modu Corpus: Written	https://corpus.korean.go.kr
aihub_translation	Korean-English translation corpus	https://aihub.or.kr/aidata/87
open_subtitles	Korean-English parallel corpus from movie subtitles	http://opus.nlpl.eu/OpenSubtitles-v2018.php
korean_parallel_koen_news	Korean-English parallel corpus	https://github.com/jungyeul/korean-parallel-corpora

Information page

Detailed information on Korpora is available from the link below. The information page is written in both Korean and English. We like to thank Han Kyul Kim (@hank110) and Won Ik Cho (@warnikchow) (Alphabet order) for the English translation.

https://ko-nlp.github.io/Korpora

For those who would like to quickly go through the core functions, please refer to the Quick overview part below. For more information about notes on execution or option modifications, please refer to the information page linked above.

Quick overview

Installation

From source

git clone https://github.com/ko-nlp/Korpora
python setup.py install

Using pip

pip install Korpora

Using in Python

Korpora is an open-source Python package. By default, it can be executed in a Python console. You can check the list of the available corpus with the following Python codes.

from Korpora import Korpora
Korpora.corpus_list()

{
   'kcbert': 'beomi@github 님이 만드신 KcBERT 학습데이터',
   'korean_chatbot_data': 'songys@github 님이 만드신 챗봇 문답 데이터',
   'korean_hate_speech': '{inmoonlight,warnikchow,beomi}@github 님이 만드신 혐오댓글데이터',
   'korean_petitions': 'lovit@github 님이 만드신 2017.08 ~ 2019.03 청와대 청원데이터',
   'kornli': 'KakaoBrain 에서 제공하는 Natural Language Inference (NLI) 데이터',
   'korsts': 'KakaoBrain 에서 제공하는 Semantic Textual Similarity (STS) 데이터',
   'kowikitext': "lovit@github 님이 만드신 wikitext 형식의 한국어 위키피디아 데이터",
   'namuwikitext': 'lovit@github 님이 만드신 wikitext 형식의 나무위키 데이터',
   'naver_changwon_ner': '네이버 + 창원대 NER shared task data',
   'nsmc': 'e9t@github 님이 만드신 Naver sentiment movie corpus v1.0',
   'question_pair': 'songys@github 님이 만드신 질문쌍(Paired Question v.2)',
   'modu_news': '국립국어원에서 만든 모두의 말뭉치: 뉴스 말뭉치',
   'modu_messenger': '국립국어원에서 만든 모두의 말뭉치: 메신저 말뭉치',
   'modu_mp': '국립국어원에서 만든 모두의 말뭉치: 형태 분석 말뭉치',
   'modu_ne': '국립국어원에서 만든 모두의 말뭉치: 개체명 분석 말뭉치',
   'modu_spoken': '국립국어원에서 만든 모두의 말뭉치: 구어 말뭉치',
   'modu_web': '국립국어원에서 만든 모두의 말뭉치: 웹 말뭉치',
   'modu_written': '국립국어원에서 만든 모두의 말뭉치: 문어 말뭉치',
   'aihub_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어 + 대화 + 뉴스 + 한국문화 + 조례 + 지자체웹사이트)",
   'aihub_spoken_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (구어)",
   'aihub_conversation_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (대화)",
   'aihub_news_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (뉴스)",
   'aihub_korean_culture_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (한국문화)",
   'aihub_decree_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (조례)",
   'aihub_government_website_translation': "AI Hub 에서 제공하는 번역용 병렬 말뭉치 (지자체웹사이트)",
   'open_subtitles': 'Open parallel corpus (OPUS) 에서 제공하는 영화 자막 번역 병렬 말뭉치',
}

From the Python console, you can download KcBERT training data with the following Python codes. The corpus is downloaded to the Korpora directory within the user's root directory (~/Korpora). If you want to download a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora
Korpora.fetch("kcbert")

If you want to download all corpora provided by Korpora, use the following Python codes. All datasets are downloaded to ~/Korpora.

from Korpora import Korpora
Korpora.fetch('all')

Using the following codes, you can load the KcBERT training dataset from your Python console. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora as well. Then, the corpus data is stored in a Python variable corpus. To load a different dataset, please change the name of the corpus in the argument by the name of the dataset as expressed in the list above.

from Korpora import Korpora
corpus = Korpora.load("kcbert")

Using in a terminal

You can execute Korpora through your terminal as well (Command Line Interface, CLI). Korpora can be used without executing your Python console. You can download the KcBERT training dataset from your terminal with the following command. The dataset is downloaded to ~/Korpora.

korpora fetch --corpus kcbert

With the following command, you can simultaneously download the KcBERT training dataset and the chatbot Q&A pair dataset. With this command, you can also simultaneously download three or more datasets. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus kcbert korean_chatbot_data

You can download all corpora provided by Korpora from your terminal with the following command. Datasets are downloaded to ~/Korpora.

korpora fetch --corpus all

From your terminal, you can also create a dataset for training a language model. Creating this training dataset for a language model refers to a process of extracting only the sentences from all corpora provided by Korpora and saving them in a text file. A sample command is as follows. It simultaneously processes all corpora provided by Korpora and creates a single training dataset for a language model. Downloading the corpus and preprocessing its text occur simultaneously as well. If the corpus does not exist in the local directory, it is downloaded to ~/Korpora. A single output file named all.train will be created. It is created within output_dir.

korpora lmdata \
  --corpus all \
  --output_dir ~/works/lmdata

License

Korpora is licensed under the Creative Commons License(CCL) 4.0 CC-BY. This license covers the Korpora package and all of its components.
Its users have the following rights.
- Share : They are free to reproduce, distribute, exhibit, perform and transmit via air (including changes in the format).
- Adapt : They can remix, transform, and build upon the material for any purpose, even commercially.
Its users have the following obligations. As long as these obligations are fulfilled, the user rights listed above are valid.
- Attribution : They must indicate that they have used Korpora.
- No additional restrictions : For all derivative works of Korpora, they cannot impose stricter license than CC-BY permits.
- For example, if you have downloaded and used Korpora, you need to fulfill only the 'attribution' obligation. However, if you are creating and distributing models, documents or any other derivative works of Korpora, you must fulfill both the 'attribution' and 'no additional restrictions' obligations.
Each corpus adheres to its own license policy. Please check the license of the corpus before using it!

korpora's People

Stargazers

Watchers

korpora's Issues

모두의 말뭉치 신청 현황

개요

국립국어원 모두의 말뭉치 신청 현황을 정리한다

KorNLI

#6 에서 언급된 KorNLI 데이터
repository: https://github.com/kakaobrain/KorNLUDatasets/tree/master/KorNLI
(sent1, sent2, label) 의 triplet 구조인 데이터로, labeled sentence pair 를 재활용하기 위하여 다음의 data class 를 만들고 이를 상속

(data snapshot)

sentence1	sentence2	gold_label
그리고 그가 말했다, "엄마, 저 왔어요."	그는 학교 버스가 그를 내려주자마자 엄마에게 전화를 걸었다.	neutral
그리고 그가 말했다, "엄마, 저 왔어요."	그는 한마디도 하지 않았다.	contradiction
그리고 그가 말했다, "엄마, 저 왔어요."	그는 엄마에게 집에 갔다고 말했다.	entailment

class LabeledSentencePairKorpusData(KorpusData):
    pairs : List[str]
    labels : List[Optional[str, int, float]]

    def __init__(self, texts, pairs, labels):
        if not (len(texts) == len(pairs) == len(labels)):
            raise ValueError('All length of `texts`, `pairs`, `labels` should be same')
        self.texts = texts
        self.pairs = pairs
        self.labels = labels

class KorNLIKorpusData(LabeledSentencePairKorpusData):
    def __init__(self, texts, pairs, labels):
        super().__init__(texts, pairs, labels)

Show license and description when loading or fetching corpus

@ratsgo @hungry-wook

말뭉치를 공개해주신 분들의 노력을 알리는 것은 지속적으로 해야 하는 일이며, 말뭉치를 사용할 때 라이센스를 확인하는 작업은 정말로 중요하다고 생각합니다. 하여 fetch 혹은 Korpus class instance 를 만들 때 라이센스와 reference 가 포함된 description 을 print 하면 어떨까요?
그리고 이 기능은 0.1.0 부터 포함되어야 한다고 생각합니다.
자연어처리 커뮤니티 발전을 위해서 코퍼스 작업을 해주신 분들의 노력을 알려주는 것도 Korpora project 의 목표 중 하나라고 생각하며, 처음 릴리즈 할 때 이러한 의도가 패키지 내에서 보여졌으면 해서입니다.

Korean Hate Speech Corpus

#6 에서 언급된 데이터
논문 reference : https://arxiv.org/abs/2005.12503
https://github.com/kocohub/korean-hate-speech

{'comments': '2,30대 골빈여자들은 이 기사에 다 모이는건가ㅋㅋㅋㅋ 이래서 여자는 투표권 주면 안된다. 엠넷사전투표나 하고 살아야지 계집들은',
 'contain_gender_bias': True,
 'bias': 'gender',
 'hate': 'hate',
 'news_title': '"“8년째 연애 중”…‘인생술집’ 블락비 유권♥전선혜, 4살차 연상연하 커플"'}

class KoreanHateSpeechKorpusData(KorpusData):
    gender_biases : List[bool]
    biases : List[str]
    hates : List[str]
    news_titles : List[str]

    def __init__(self, texts, news_titles, gender_biases, biases, hates):
        if not (len(texts) == len(news_titles) == len(gender_biases) == len(biases) == len(hates)):
            raise ValueError('All 5 arguments must be same length')
        self.texts = texts
        self.news_titles = news_titles
        self.gender_biases = gender_biases
        self.biases = biases
        self.hates = hates

데이터 관리 방법에 대한 질문

KoreanPetitions 데이터를 예시로 들어보면,

현재 코퍼스의 instance하나를 KoreanPetition, 데이터를 KoreanPetitionsData 클래스로 구현하고 있고 KoreanPetitionsData의 __getitem__에서 on-the-fly로 KoreanPetition dataclass를 만드는 식으로 관리되고 있는 것 같습니다.

KorpusData가 데이터를 List[KoreanPetition] (or List[CorpusSpecificSingleData]) 형태로 들고 있으면 굳이 코퍼스마다 *Data 클래스를 새로 정의해줄 필요가 없어 관리 차원에서 조금 더 유용할것 같은데요! 혹시 현재 구현처럼 관리되고 있는 특별한 이유가 있을까요?

코퍼스 리스트 출력함수

다음의 기능을 구현합니다. 현재의 PR (#66, #67, #68) 이 merge 된 이후에 진행합니다.

Korpora.corpus_names()

Korean Chat Data

개요

송영숙 님이 공개해주신 Korean Chat Data를 개발한다
송영숙님 awesome korean data에도 포함되어 있는 데이터이다

valid test code

개요

마스터에 있는 테스트 코드들을 모두 정상 작동하도록 만들기

Korpora 패키지 내 모든 파일들을 수정해야 하는 이슈 정리

description format 통일 (PR #61)
get_all_xxx 이름의 함수 중 __getitem__ 으로 대체 가능 한 것들의 제거
fetch 함수를 __init__ 에서 분리하여 Korpora.fetch('all') 기능을 제공 (PR #63)
파일 및 클래스 이름의 형식 통일

utils.fetch 를 positional arguments 로 이용하는 함수들의 수정이 필요합니다.

최근에 fetch 함수에 method argument 가 arguments 중간에 새로 들어오면서 이전에 fetch 함수를 positional arguments 로 이용하던 함수들에서 문제가 발생했습니다.

(now) in korpora_nsmc.py

fetch(info['url'], local_path, 'nsmc', force_download)

위의 상황에서 fetch 함수에 들어가는 인자들을 확인하기 위하여 모든 인자를 풀어서 입력하던제

(desired)

fetch(info['url'], local_path, 'nsmc', info['method'], force_download)

혹은 keyword arguments 로 이용하는 것이 안전해 보입니다.

(desired)

fetch(
  url=info['url'],
  local_path=local_path,
  ..
)

대용량 데이터 로딩 시 경고 메시지 출력 및 동의 후 로딩

일부 데이터는 그 크기가 기가 단위이기 때문에 메모리에 올려서 작업하기가 어렵습니다. 이러한 데이터는 Korpora.load() 함수를 이용하여 로딩할 때 warning 을 하고, 사용자 동의를 얻은 뒤 로딩하는것이 어떨까요?

네이버 x 창원대 NER

#6 링크 중 NER competitions 의 데이터
https://github.com/naver/nlp-challenge/tree/master/missions/ner
data loader 제공되어 있으며, 아래와 같은 형식의 데이터
이 데이터를 통하여 word-level tagging 데이터 형식에 대한 합의를 할 수 있을 것으로 기대

1	비토리오	PER_B
2	양일	DAT_B
3	만에	-
4	영사관	ORG_B
5	감호	CVL_B
6	용퇴,	-
7	항룡	-
8	압력설	-
9	의심만	-
10	가율	-

1	이	-
2	음경동맥의	-
3	직경이	-
4	8	NUM_B
5	19mm입니다	NUM_B
6	.	-

1	9세이브로	NUM_B
2	구완	-
3	30위인	NUM_B
4	LG	ORG_B

데이터 포함된 도커 컨테이너로 CI

개요

제공 데이터가 늘어나면서 CI 테스트에 드는 시간이 증가
데이터가 모두 다운로드된 도커 컨테이너로 CI 수행

송영숙님 awesome korean data

https://github.com/songys/AwesomeKorean_Data

KcBERT.getitem 구현

해당 기능이 구현되어있지 않습니다.

Korpus class 내에 데이터 다운로드 정보 두기

개요

현재 데이터 다운로드 정보는 fetch.py의 DATA_LOCATIONS 변수에 담겨 있음
이 정보를 Korpus 클래스를 상속받는 클래스에 두는 것으로 변경
데이터마다 다운로드 정보는 유니크하기 때문에 Korpus 클래스에 두고 관리해도 큰 문제가 없음
이렇게 했을 경우 데이터 추가 외부 PR시 Korpus 클래스를 상속받아 구현(이때 데이터 다운로드 위치, 전처리/cleansing 등만 정의)하면 된다는 장점이 있음

README 에 프로젝트 개요와 사용법 기술

Korpora==0.1.0 용 README 를 작성합니다 (#48 )

NLI type custom dataclass 제공

KorNLI 외에 각자가 NLI type 으로 만드는 데이터가 있을 경우, 이를 로딩할 수 있도록 custom dataclass 를 제공합니다.

class CustomLabeledSentencePairKorpus(LabeledSentencePairKorpus):
    def __init__(self, files, ... ):
        # files 내 파일 이름 혹은 prefix 에 따라 LabeledSentencePairData class instance 생성

질문쌍 데이터

#6 에서 언급된 데이터로, #10 의 데이터 형식을 그대로 이용할 수 있음
https://github.com/songys/Question_pair

question1	question2	is_duplicate
밤만 되면 미치겟네.	밥먹기 참 힘드네	1
나 같이 헤어진 경우도 있을까?	나 잘하는 거 맞을까?	1
매일 아침 피곤해	매일 아침 피곤해	0
정말 힘드네	정말. 정말 쉽지가 않네. 이럴 땐 어떡해야 할까	1

labels 에 대한 설명을 데이터셋에 추가 필요
- 0 : same meaning
- 1: different meaning

[Corpus] Korean Parallel Corpora

#6 에서 언급된 병렬말뭉치
https://github.com/jungyeul/korean-parallel-corpora

Fetch only 기능 제공

데이터를 손쉽게 로딩하는 기능 뿐 아니라, 데이터를 다운로드만 받는 기능도 제공해야 합니다.
언어 모델을 학습하는 경우에는 용량이 큰 파일을 다운로드만 받은 후 파일 상태에서 학습에 이용하기도 합니다. 이를 위해서는 Korpora.load 기능 외에 아래와 같은 기능이 제공되어야 합니다.

Korpora.fetch('all')
Korpora.fetch('namuwikitext')

[Corpus] Fix namuwikitext typo

안녕하세요
먼저, 이렇게 훌륭한 프로젝트를 시작해주신 것에 대해 감사의 말씀드립니다.

나무위키 데이터를 fetch 하던 중에 namuwiki 경로에서 데이터를 직접 불려오려고 했는데
not found 에러가 발생해서 보니 디렉토리명이 namiwiki 로 되어있었네요. link

의도하신 것인지 모르겠으나 혹시나 namuwiki를 잘못 작성하신가하여 올립니다.
에러와 관련된 이슈는 아니라 송구합니다.

다시 한번 감사드립니다.

NAMUWIKI_FETCH_INFORMATION = [
    {
        'url': 'https://github.com/lovit/namuwikitext/releases/download/v0.1/namuwikitext_20200302.v0.1.train.zip',
        'destination': 'namiwiki/namuwikitext_20200302.train.zip',
        'method': 'download & unzip'
    },
    {
        'url': 'https://github.com/lovit/namuwikitext/releases/download/v0.1/namuwikitext_20200302.v0.1.test.zip',
        'destination': 'namiwiki/namuwikitext_20200302.test.zip',
        'method': 'download & unzip'
    },
    {
        'url': 'https://github.com/lovit/namuwikitext/releases/download/v0.1/namuwikitext_20200302.v0.1.dev.zip',
        'destination': 'namiwiki/namuwikitext_20200302.dev.zip',
        'method': 'download & unzip'
    }
]

                    dirname = os.path.abspath(f'{root_dir}/namiwiki')
                    self.train = f'Namuwikitext corpus is downloaded. Open local directory {dirname}'
                    print('Continue to load `dev` and `test`')

샘플데이터 로딩 기능

용량이 큰 데이터의 일부만 샘플로 보고 싶은 경우 데이터의 개수를 제한하여 샘플만 로딩할 수 있는 기능을 제공하면 좋을듯 합니다.

Release Korpora=0.1.0

`0.1.0` 에서 제공하기 위해 추가로 작업해야 하는 말뭉치 리스트

Korean Hate Speech Corpus
- github 에서 다운로드
질문쌍 데이터
- github 에서 다운로드
네이버 x 창원대 NER
- github 에서 다운로드
나무위키 wikitext 형식
- https://github.com/lovit/namuwikitext 에 작업 완료
KcBERT 학습데이터
- kaggle 에서 다운로드하지 않고, release 된 곳에 파일을 분할압축하여 저자가 공유함

`0.1.0` 에서 제공하기 위해 추가로 작업해야 하는 이슈들

custom dataclass
말뭉치 별 download remote path 관리
파일 별 fetch 함수 가동
- 용량 확인 후 알려진 용량과 다를 경우 파일 오버라이딩

모두의 말뭉치

신청 현황
- 국립국어원의 원시말뭉치 사용 계약 때문에 데이터 다운로드 횟수 제한이 있음 (@ratsgo 님이 확인)
- 데이터셋 다운로드용으로 신청한 내용은 위의 이유로 거절
- login 후 파이썬 환경에서 다운로드 하는 것 역시 웹서버에서 기능 지원 불가하여 거절
local 에 데이터가 다운로드 되어있다고 가정한 뒤, class 를 이용하여 로딩하는 기능만 지원 가능
이는 0.2.0 에서 지원하기로 결정 (@lovit , @ratsgo )

CORPUS_INFORMATION 의 item 의 indentation 수정

korpora_xxx.py 의 xxx_CORPUS_INFORMATION 내의 아이템들이 4 칸 들여쓰기가 아닌 경우들이 있습니다. 전체 파일들에 대하여 indentation 을 수정합니다.

Description and license in `Korpus` and `KorpusData`

@ratsgo

데이터 별 attributes 의 특징, reference 등의 내용을 기술하는 것이 필요해보입니다.
또한 데이터 별로 라이센스가 다를 수 있기에 이 값도 명시하는 게 좋다고 생각합니다.
description, license 이라는 properties 를 만들고, 이 값을 class 에 적게 하는 것이 어떨까요?

class KorpusData:
    description : str  # description about each train / dev / test data including size ... 

    @property
    def description(self):
        return self.description

class Korpus:
    description : str  # description about all train / dev / test data including reference, size, composition of [train/dev/test] ...
    license : Union[str, None] = None

    @property
    def description(self):
        return self.description

    @property
    def license(self):
        return self.license

Data split 기능 제공

train data 만 제공된 말뭉치의 경우 (random seed, ratio) 를 입력받아 subdata 를 만드는 기능 제공

KcBERT Pre-Training Corpus (Korean News Comments)

https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments
kaggle api 참고: https://github.com/Kaggle/kaggle-api

Make KorpusData be subscriptable

usage scenario

from Korpora import NSMC

nsmc  = NSMC()
text, label = nsmc.train[0]

Make KorpusData as Iterable

usage scenario

from Korpora import NSMC

nsmc  = NSMC()
for text, label in nsmc:
    print(type(text))  # str
    print(type(label))  # int

[Corpus] Wikipedia (kowiki)

https://dumps.wikimedia.org/kowiki/20200901/

NSMCData format 통일

NSMCData class 는 KorpusData class 를 상속하지 않습니다. 이에 대한 협의 및 수정이 필요합니다.

KorSTS

#6 에서 언급된 데이터로, #10 와 같은 repository 에서 획득가능
데이터가 (sent1, sent2, label) 외에 genre 정보가 추가로 존재

genre	filename	year	id	score	sentence1	sentence2
main-captions	MSRvid	2012test	0000	5.000	안전모를 가진 한 남자가 춤을 추고 있다.	안전모를 쓴 한 남자가 춤을 추고 있다.
main-captions	MSRvid	2012test	0002	4.750	어린아이가 말을 타고 있다.	아이가 말을 타고 있다.
main-captions	MSRvid	2012test	0003	5.000	한 남자가 뱀에게 쥐를 먹이고 있다.	남자가 뱀에게 쥐를 먹이고 있다.

#10 구현 뒤, genre attribute 추가하여 다음처럼 data class 만드는 것을 제안

class KorSTSKorpusData(LabeledSentencePairKorpusData):
    genres : List[str]

    def __init__(self, texts, pairs, labels, genres):
        super().__init__(texts, pairs, labels)
        if len(labels) != len(genres):
            raise ValueError('All length of `texts`, `pairs`, `labels`, `genres` should be same')
        self.genres = genres

Conventions

며칠 간 작업을 함께 하면서 패키지 구조를 변경하는 일들이 발생했고, 그와 동시에 여러 개의 코퍼스에 대한 작업을 수행하다보니 conflict 가 날 가능성이 높아지고 있습니다.
코드 관리 및 이후에 다른 분들과의 협업을 위해서도 간단한 convention 을 논의해 보면 좋을 것 같아요.
아래에 comments 로 conventions 을 정하면 좋을 항목들을 열어두겠습니다. 수정을 통하여 내용을 보완해 보아요

commit conventions

모든 commit 은 관련 이슈를 먼저 생성하고, 해당 이슈에 대한 설명은 issue/comments 에서 설명/논의합니다.
commit message 에 해당 이슈를 적어둡니다.
```
Implement __len__ (#123)
```

branch conventions

master 는 배포/공개용으로만, dev 를 개발용으로 이용합니다.
- 세부 개발은 dev 에서 브랜칭을 한 뒤에 dev < feature 로 PR 보내주세요.
- 버전업데이트, 배포가 될 때 master < dev 로 머징합니다.
- 그러므로 README 등의 문서 내 typo 수정 역시 dev 이하의 브랜치에서 수행합니다.
한 브랜치 내에서는 그 브랜치에서 해결하려는 이슈에 관련된 파일만 수정합니다.
petitions#5 branch 에서는 KoreanPetitions 만 수정해야하며, NSMC 는 수정하지 않습니다.
NSMC 수정이 필요하다면 nsmc branch 에서 따로 작업합니다.
rebase 로 commit history 순서를 정렬합니다.

PR conventions

PR template 에 관련 이슈를 적는다. 두 개 이상의 이슈가 관련되어 있다면 모두 적는다.
- 이슈가 없을 경우 생성하여 자세한 내용은 이슈에 기록한다.
- typo 와 같이 이슈로 적을 필요가 없는 일에 대해서는 이슈를 적지 않아도 되며, 이때는 template 에 해당 내용을 간단히 기술한다.
PR 은 1명 이상 approved 가 되면 merge 한다.

Description convension

Korpus 에 입력되는 description 의 첫줄에는 author 를, 두번째 줄에는 repo 를, 세번째 줄에는 관련 reference 가 있다면 해당 reference 를 기술합니다. references 가 여러 개거나, 길이가 긴 경우에는 한 줄 띈 다음 - 기호로 구분하여 기록합니다. 코퍼스에 대한 설명은 한 줄 띈 다음에 자유롭게 기술합니다. 또한 description, license 는 4칸 들여쓰기 를 기준으로 작성합니다.
(KorStS)

    Author : KakaoBrain
    Repository : https://github.com/kakaobrain/KorNLUDatasets
    References : 
        - Ham, J., Choe, Y. J., Park, K., Choi, I., & Soh, H. (2020). KorNLI and KorSTS: New Benchmark
           Datasets for Korean Natural Language Understanding. arXiv preprint arXiv:2004.03289.
           (https://arxiv.org/abs/2004.03289)

Readme typo

사용 예제에서 import하는 모듈명에 Korpus가 누락되었습니다.
KoreanHateSpeech -> KoreanHateSpeechKorpus
KoreanPetitions -> KoreanPetitionsKorpus
KorNLI -> KorNLIKorpus
KorSTS -> KorSTSKorpus
NSMC -> NSMCKorpus

NSMCExample 을 LabeledSentence 로 변경하기

(now)

@dataclass
class NSMCExample:
    text: str
    label: int

위 형식의 클래스는 sentence / document classification 에서 모두 이용될 수 있으므로 korpora.py 의 LabeledSentence 를 만든 뒤, 대체하는 것이 적절

Continuous Integration

PR 시 수행할 테스트 환경 구성

Package license 작성

@ratsgo

정작 중요한 패키지의 라이센스를 명시하지 않았네요. 말뭉치들을 직접 재배포 하지 않고 원 위치를 공유하며 이를 다운로드 하는 경우에 말뭉치들의 라이센스를 그대로 따라야 하는지 확인이 필요합니다.
패키지 제작의 의도대로 가장 배타적이지 않은 라이센스를 정하도록해요.

NSMC getitem 구현

(now)

nsmc = Korpora.load(nsmc)
nsmc[0]

~/git/Korpora/Korpora/korpora.py in __getitem__(self, index)
     12 
     13     def __getitem__(self, index):
---> 14         raise NotImplementedError('Implement __getitem__')
     15 
     16     def __iter__(self):

NotImplementedError: Implement __getitem__

SentencePair, LabeledSentencePair 의 get_all_pairs, get_all_labels 기능 제공

LabeledSentencePair, (e.g. = KorNLI) 를 이용하는 모델의 학습은 ((sent1, sent2), label) 단위로 이뤄기 때문에 [(sent1, sent2), (sent1, sent2), ... ] 을 get_all_pairs 에서 [label, label, ...] 을 get_all_labels 에서 제공.
- get_all_texts 는 list of str 을 return 하는 것으로 오해할 수 있으므로 get_all_pairs 로 명명
이와 비슷하게 SentencePair 도 get_all_pairs 기능을 제공
KorNLI 처럼 LabeledSentencePairKorpus 를 상속하는 클래스는 get_all_pairs, get_all_labels 를 제공

언어 모델 학습용 병합 말뭉치 생성 기능 제공

여러 종류의 Korpus 로부터 texts 성분만 취하여 이들을 병합하여 언어 모델을 학습할 때 사용할 수 있는 데이터로 정제하는 기능을 CLI 형태로 제공하면 좋을듯 합니다.

test code 에서 대용량 데이터를 로딩할 때 사용자로부터 동의를 입력받는 부분 해결

namuwikitext 와 kcbert 는 텍스트 파일을 로딩하기 전에 사용자에게 large size file 을 로딩할 것에 동의하는지 입력을 받습니다. 이에 따라 test code 가 다른 korpus 와는 다른 방식으로 작성되어야 합니다.

SentencePair & LabeledSentencePair class

(question, answer), (sent1, sent2, label) 형식의 데이터에 모두 이용할 수 있는 General class 를 만든 뒤, KorpusData 를 상속하는 클래스에서 해당 클래스를 이용

파일 별 다운로드

여러 개의 파일로 구성된 코퍼스의 경우, 파일이 하나라도 없으면 전체를 다운로드 받습니다.
파일별로 재 다운로드 받도록 fetch 기능을 수정하면 어떨까요?

NSMC 의 root_dir=None 이용할 수 있도록 설정

Korpus 의 root_dir 을 None 으로 입력할 경우 ~/Korpora/ 를 이용하기로 정책을 바꿨으나, NSMC 에 이 부분이 반영되지 않음.

청와대 국민청원 아카이브

https://github.com/lovit/petitions_archive

KorpusData 를 상속하는 모든 클래스의 iter단일화

(now)
KorpusData 를 상속하는 모든 클래스가 각자 iter 를 구현해야 함.

(desired)
KorpusData 에서 getitem 을 이용하여 len(self) 만큼 for loop 을 돌면서 getitem 결과값을 yield

namuwikitext

wikitext-2, wikitext-103 는 multiline texts 형식으로 위키피디아의 텍스트와 카테고리 이름을 정제한 데이터 (참고)
나무위키에는 온라인에서 사용되는 문체가 포함된 다양한 종류의 문서들이 존재. 이를 wikitext 형식의 데이터로 가공하여 재배포

데이터 다운로드 도중 취소한 케이스 대응

데이터 다운로드 도중에 취소한 경우,
해당 파일을 처음부터 새로 받는것이 아니라 그 다음 파일 다운로드로 넘어갑니다.
파일 용량 기준으로 기다운로드 여부를 체크하여야 할것 같습니다.

나무위키텍스트에서 url 제거

동일 이슈가 lovit/namuwikitext#8 에 언급되었습니다. 해당 이슈 처리 후 Korpora 에서 updated 된 namuwikitext remote url 업데이트 해야 합니다.

ko-nlp / korpora Goto Github PK

korpora's Introduction

Korpora: Korean Corpora Archives

말뭉치 목록

안내 페이지

빠른 사용법

설치

파이썬에서 사용하기

터미널에서 사용하기

License

Korpora: Korean Corpora Archives

List of corpora

Information page

Quick overview

Installation

Using in Python

Using in a terminal

License

korpora's People

Stargazers

Watchers

Forkers

korpora's Issues

0.1.0 에서 제공하기 위해 추가로 작업해야 하는 말뭉치 리스트

0.1.0 에서 제공하기 위해 추가로 작업해야 하는 이슈들

모두의 말뭉치

commit conventions

branch conventions

PR conventions

Description convension

Recommend Projects

Recommend Topics

Recommend Org

`0.1.0` 에서 제공하기 위해 추가로 작업해야 하는 말뭉치 리스트

`0.1.0` 에서 제공하기 위해 추가로 작업해야 하는 이슈들