Giter Club home page Giter Club logo

textmining-tutorial's Introduction

(한국어) 텍스트 마이닝을 위한 튜토리얼

텍스트 마이닝을 공부하기 위한 자료입니다. 언어에 상관없이 적용할 수 있는 자연어처리 / 머신러닝 관련 자료도 포함되지만, 한국어 분석을 위한 자료들도 포함됩니다.

  • 이 자료는 현재 작업중이며, slide와 jupyter notebook example codes가 포함되어 있습니다.
  • 이 자료는 soynlp package를 이용합니다. 한국어 분석을 위한 자연어처리 코드입니다. soynlp 역시 현재 작업중입니다.
  • Slides 내용에 관련된 texts 는 blog 에 포스팅 중입니다.
  • 실습코드는 코드 repository 에 있습니다.

Contents

  1. Python basic
    1. jupyter tutorial
  2. From text to vector (KoNLPy)
    1. n-gram
    2. from text to vector using KoNLPy
  3. Word extraction and tokenization (Korean)
    1. word extractor
    2. unsupervised tokenizer
    3. noun extractor
    4. dictionary based pos tagger
  4. Document classification
    1. Logistic Regression and Lasso regression
    2. SVM (linear, RBF)
    3. k-nearest neighbors classifier
    4. Feed-forward neural network
    5. Decision Tree
    6. Naive Bayes
  5. Sequential labeling
    1. Conditional Random Field
  6. Embedding for representation
    1. Word2Vec / Doc2Vec
    2. GloVe
    3. FastText (word embedding using subword)
    4. FastText (supervised word embedding)
    5. Sparse Coding
    6. Nonnegative Matrix Factorization (NMF) for topic modeling
  7. Embedding for vector visualization
    1. MDS, ISOMAP, Locally Linear Embedding, PCA, Kernel PCA
    2. t-SNE
    3. t-SNE (detailed)
  8. Keyword / Related words analysis
    1. co-occurrence based keyword / related word analysis
  9. Document clustering
    1. k-means is good for document clustering
    2. DBSCAN, hierarchical, GMM, BGMM are not appropriate for document clustering
  10. Finding similar documents (neighbor search)
    1. Random Projection
    2. Locality Sensitive Hashing
    3. Inverted Index
  11. Graph similarity and ranking (centrality)
    1. SimRank & Random Walk with Restart
    2. PageRank, HITS, WordRank, TextRank
    3. kr-wordrank keyword extraction
  12. String similarity
    1. Levenshtein / Cosine / Jaccard distance
  13. Convolutional Neural Network (CNN)
    1. Introduction of CNN
    2. Word-level CNN for sentence classification (Yoon Kim)
    3. Character-level CNN (LeCun)
    4. BOW-CNN
  14. Recurrent Neural Network (RNN)
    1. Introduction of RNN
    2. LSTM, GRU
    3. Deep RNN & ELMo
    4. Sequence to sequence & seq2seq with attention
    5. Skip-thought vector
    6. Attention mechanism for sentence classification
    7. Hierarchical Attention Network (HAN) for document classification
    8. Transformer & BERT
  15. Applications
    1. soyspacing: heuristic Korean space correction
    2. crf-based Korean soace correction
    3. HMM & CRF-based part-of-speech tagger (morphological analyzer)
    4. semantic movie search using IMDB
  16. TBD

Thanks to

자료를 리뷰하고 함께 토론해주는 고마운 동료들이 많습니다. 특히 많은 시간과 정성을 들여 도와주는 태욱에게 고마움을 표합니다.

textmining-tutorial's People

Contributors

lovit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

textmining-tutorial's Issues

튜토리얼 진행상 문제

안녕하세요,
한국어 텍스트 분석을 하고 싶어서 찾아왔습니다.
좋은 패키지 감사합니다.
nounextractor-v2_usage.ipynb 튜토리얼을 따라하고 있는데요,
세번째 셀
"%%time
noun_extractor = LRNounExtractor_v2(verbose=True, extract_compound=True)
noun_extractor.train(sents)
nouns = noun_extractor.extract()"
에서 다음과 같은 에러가 납니다:
[Noun Extractor] use default predictors
[Noun Extractor] num features: pos=3929, neg=2321, common=107
[Noun Extractor] counting eojeols
local variable 'f' referenced before assignment
어떻게 해야 할까요?
감사합니다.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.