Giter Club home page Giter Club logo

project-thematic-investments's Introduction

Project-Thematic-Investments

Using Python and NLP, predicting stock themes with high potential based on news data

Summary

  • Used news data from 'Naver'
  • Gathered meaningful stock themes mentioned in everyday news over the last year
  • Gathered 200 news data for each stock theme
  • Trained with word2vec model
  • Used cosine-similarity to predict theme

1. Background

What is thematic investing?

  • Thematic investing is a form of investment which aims to identify macro-level trends, and the underlying investments that stand to benefit from the materialisation of those trends
  • A stock theme is a particular group of stocks that share a similar trend or trait

Hypothesis

  1. Individual investers are limited in obtaining information and therfore, rely on the news for such information. As a result, stock themes and the news have a strong correlation.
  2. When an event occurs in the news regarding a particular stock theme, some time must pass by for a noticable change to happen for that stock theme's price.
  3. If a particular stock theme is mentioned too much in the news, its price will already be affected.

2. Data preprocessing

Crawling Naver news data using Beautifulsoup

  1. Stock themes: Searched stock themes that were frequently mentioned in the news. Removed themes that were fewly referenced or were too specific in its meaning to be useful. A total of 168 themes were finalized. Each theme consists a list of corporations belonging to that theme.
  2. News data: 200 news data were crawled for each of the 168 themes. Unusable data(photo news, video news) were manually removed from the dataset.

3. Model training

KoBERT tokenizer + Word2Vec + cosine-similarity
Model train file (wv_model_train.ipynb)

KoBERT tokenizer: Developed by SKTBrain
Word2Vec: Represents words in vectors

Model parameters:
vector dimension = 300, window = 8

Model (Architecture.ipynb)

  • Added all word vectors in a news data to make a news representation
  • Generated each theme representations by adding all 200 news representations
  • Normalization was not necessary since more information leads to accurate representations
  • When a news data is given as input, the model will vectorize the data and use cosine-similarity to determine and return the most similar theme

4. Testing

Model

  • Input: Today news(approximately 2000 data each for IT, economy, society, lifestyle, international, politics)
  • Output: A list of themes and its subordinate corporations that are considered to have high potential
  • The model finds a similar theme for each news data and counts the number of its appearance. However, it only counts when the similarity is higher than 95%.
  • When all of the input data is processed, the model generates a list of themes, whose count is less than 5(hypothesis 3).

Market testing

  • Select one corporation for each theme, whose fluctuation is less than 5% and has the highest market capitalization.
  • Calculate profit with the following rules.
    • Sell when a stock's price increases more than 10%
    • Sell when a stock's price decreases more than 5%
    • If neither of above, sell after 5 days of purchase

Result

7/1 news data

  • profit after 5 days = 2.23%

project-thematic-investments's People

Contributors

gjwlsdnr0115 avatar

Stargazers

 avatar  avatar liigo avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.