Insightful Tutorials and Papers about Knowledge Graphs

entity-extraction entity-linking entity-resolution information-extraction knowledge-base knowledge-graph named-entity-recognition kg mmkg survey

knowledge-graph-tutorials-and-papers's Introduction

Research Topics in Knowledge Graph

Papers and Materials from All Areas

Note 1: The papers from the database/data science communities are marked with 🌟.

Note 2: In Nov 2023, I start to add a new session related to LLM in each topic. 🔥

1. Knowledge Extraction and Integration to Construct KGs

Knowledge Base Construction (Demo or System) [link]
About Domain-Specific Knowledge Bases [link]
About Multi-Modal Knowledge Graph (MMKG) [link]
Named Entity Recognition, Entity Extraction and Entity Typing [link]
Coreference Resolution [link]
Entity Linking and Entity Disambiguation [link]
Entity Resolution, Entity Matching and Entity Alignment [link]
General Relation Extraction [link]
General Information Extraction and Open Information Extraction [link]
Relation Linking and Relation Disambiguation [link]

2. Mining and Refinement of KGs

Knowledge Graph Embedding, Learning, Reasoning, Rule Mining, and Path Finding [link]
Knowledge Base Refinement (Incompleteness, Incorrectness, and Freshness) [link]
Knowledge Fusion, Cleaning, Evaluation and Truth Discovery [link]

3. Applications Supported by KGs

Knowledge Graph Question Answering (KGQA) [link]
Knowledge Graph Recommendation [link]
Knowledge Graph Enhanced Machine Learning [link]
Knowledge Graphs and Large Language Models (LLMs) [link] 🔥🔥🔥

4. Schema and Query of KGs

Knowledge Graph Representation (RDF and Property Graph), Schema and Query [link]
Knowledge Graph Taxonomy Construction and Improvement [link] (TBD)

5. Others

Other Interesting Works [link]
Good DB Papers [link]

Papers and Materials from the Database Communities

Note: Papers from SIGMOD/VLDB/ICDE/KDD/TKDE/VLDBJ

[2018] [2019] [2020] [2021] [2022] [2023]

Tutorials and Notes from Talented People

Tutorials, Discussion, Courses and Communities

An introduction to knowledge graph and knowledge extraction from unstructured text. [Link]
Information Extraction by Niranjan Balasubramanian {Slides in my Mac}
CS 520 - Knowledge Graphs (seminar) - provided by Stanford
OpenKG.cn

GitHub Repos that Summarize the Papers/Projects/Data related to Knowledge Graphs

A Collection of KG Surveys, Papers (WWW+ACL+AAAI) and Data [GitHub]
KG SOTA [GitHub]
Awesome KG tutorials/papers/projects/communities [GitHub]
Knowledge Graph Construction (from zero to everything, in Chinese) [GitHub]
KG SOTA (Chinese) [Zhihu]
Tracking Progress in Natural Language Processing [GitHub]
KG Embedding SOTA [GitHub]
Entity Related Papers [GitHub]
Information Extraction Resources [GitHub]
KGQA [Giters]
Open-Environment Knowledge Graph Construction and Reasoning: Challenges, Approaches, and Opportunities [GitHub]
KG-LLM-Papers [Link]
Awesome LLM-KGs [Link]

Tutorials and Notes of Other Related Insightful Topics

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing [Slides]
Introduction to Conditional Random Fields [Blog]
Network Community Detection: A Review and Visual Survey [Paper]

Section 2.3. Community Detection Techniques

Fast unfolding of communities in large networks [Paper]

[A discussion of the Louvain method], [wiki of the Louvein Modularity]

How do they design the function Q: Finding and evaluating community structure in networks [Paper]

A compendium of NP optimization problems [Paper]
[Notes about LSH]
[Survey about Min Hash Sketch]
MinHash Tutorial with Python Code: [Notes] [Code]
Must-read papers on GNN [GitHub]
Graph-based deap learning literatures [GitHub]
Data Management for Machine Learning Applications [Course site]
Stanford CS224W: Machine Learning with Graphs [Course site]
Explainability for Natural Language Processing (AAAI 2020 tutorial) [Link] [Video]
Graph Mining & Learning (Neurips 2020 tutorial) [Link]
Discussion about GNN (Chinese) [Link]
Stanford CS224n: Natural Language Processing with Deep Learning [Course site]
Clique Relaxation Models in Networks: Theory, Algorithms, and Applications [Slides]
KG Applications in Baidu (Chinese) [Link]
Paper Digest (Database area) [Link]
Complex Network (Collection of Notes and Tutorials) [GitHub]

Useful Tools or APIs

Named Entity Recogntion and Entity Linking

TagMe [Python API] [API] [GitHub1] [GitHub2]
Stanford NER [Link]
DBpedia Spotlight [Link]
NLTK Tagger [Link]
SpaCy [Link1] [Link2]
- spacy-llm [Link]
EARL (including Relation Linking) [Link]
Falcon (including Relatoin Linking) [DBpedia version] [Wikidata version]
MonkeyLearn [Link]
GERBIL - General Entity Annotator Benchmark [Link]
PIKES [Link]

Benchmark Datasets

Entity Disambiguation:

MSNBC and ACE2004 [Link]

WebQuestions
QA datasets summary [GitHub]

Entity Resolution [GitHub]
KGE, KBC and KG Reasoning

FB13 [Paper]
FB15K [Paper]
FB15K237 [Paper]
WN11 [Paper]
WN18 [Paper]
WN18RR [Paper]

Other Useflul Tools

From Freebase to Wikidata: The Great Migration [Paper and useful links]
SPARQL tutorial [Link]
Installing and running ElasticSearch [Link]
Open KG on COVID-19 [Link]
BOOKNLP [Link] (Pronominal Coreference Resolution, a natural language processing pipeline that scales to books and other long documents (in English))
Wikidata Integrator [GitHub]
OpenTapioca [Link]
Grakn KGLIB (Knowledge Graph Library) [GitHub]
SPASQL server on Freebase [GitHub] [About VOS]
LATEX Code Search [Link]

knowledge-graph-tutorials-and-papers's People

Contributors

Stargazers

Watchers

Forkers

meelement chim3y linjianxia1031 fuqix catuhub everythingallaccount sumehta allensmile yyht geminifox2019 xrosliang showersky shayue111 niloufarbrv hacking-for-humanity prakhargurawa ren98feng jprashant21 zinc-30 y-sui ymirsss sibodiamond sunatthegilddotcom alexliu9800 lliby huygens12 wangdongde crystal22 kecenguo zhao-ht wsgan001 richardhgl fusion-research vivekatwal sontran1001 hieudx149 royahe vinhsuhi colawyee zhengtongyan dharanipunithan hanhaohh paulrich1234 anhtunguyen98 liane886 fanfanxiaozu qsguo soomsoomee pierrelarmande shengguanwsu worachot-n yhleeee romakoks huggingaha wusw14 nurjamia jalshboul nduatik rpiryani zwytop zhao6300 nguyenhott sixlife derrickyifeng jason-huanghao whnmt nike-adidas lukematic fulayjan willafan sophiavictory asithaindrajith dayanandsagark haklaekim elena-kalinina pakkinlau wushaowen1992 sum-coderepo tianyuzelin zeebah-zhl annaisdevil shiwuxie chenzongxiong hodachuy lx249 sazam0 elle2222 zxlzr xuangestallone liangke23 luliu-fighting dbguilherme johnsonwangjr soon14 iparkirk aetee yuriwerewolf buke2016 anhngml ifrasa

knowledge-graph-tutorials-and-papers's Issues

Improving Entity Linking through Semantic Reinforced Entity Embeddings (ACL 2020)

Source

Motivation

Fine-grained semantic types of entities can let the linking models learn contextual commonality about semantic relatedness.

Contribution

FGS2EE (the proposed method) uses the word embeddings of semantic words that represent the hallmarks of entities (e.g., writer, carmaker) to generate semantic embeddings.
Training converges faster when using semantic reinforced entity embeddings

Key Idea in Method

(i) creating a dictionary of fine-grained semantic words; (ii) extracting semantic type words from each entity’s Wikipedia article; (iii) generating se- mantic embedding for each entity; (iv) combining semantic embeddings with existing embeddings through linear aggregation.
Heuristically extract fine-grained semantic types from the first sentence of Wikipedia articles.
fine-grained semantic words appear frequently as apposition (e.g., Defense contractor Raytheon), coreference (e.g., the company) or anonymous mentions (e.g., American defense firms). These fine-grained types of entities can help capture local contexts and relations of entities.

[Notes] Text Tokenization in LLM

我们可将文本tokenization分为三个阶段

第一步，就是原本的subword based算法（e.g., BPE, WordPiece, Unigram, SentencePiece。注：以前的BPE就是byte-level的)；
第二步，就是进阶版的subword based算法（主要指从openai跳槽的karpathy 在今年二月propose的一种新的bpe方法，https://github.com/karpathy/minbpe 。我简单看了一下，这个minbpe更像是一个tutorial，提出minimal and clean code）；

Let's build the GPT Tokenizer (Video by Karpathy): https://www.youtube.com/watch?v=zduSFxRajkE

larger vocabulary, less tokens (note: # of tokens are determined as a hyperparameter, roughly 1000000 in GPT4), larger Compression Ratio

Forced splits using re (as in GPT2, to avoid the vocabularies such as dog., dog?, dog!)

titoken, officiallly provided by OpenAI, u can choose gpt2 (not merge spaces) or cl100k_base (gpt4, merges spaces)

There are special tokens. For example, in GPT2: - if you like, you can also add special tokens such as FIM_PREFIX, FIM_SUFFIX, ENDOFPROMPT...

minbpe: you can follow the [exercise.md] for the 5 steps required for building a tokenizer.

SentencePiece: it can both train and inference BPE tokenizers (used in Llama and Mistral series) - Note: it has a lot of training arguments

- when byte_fallback = True, add_dummy_prefix=Ture

About vocab_size: a very important hyperparameter

Learning to compress prompts with Gist Tokens (Stanford, NeurIPS 2023) - introduces new tokens, compress prompts into smaller set of "gist" tokens, and make the model performance the same as when it has long prompts.

Other modality also benefit from this tokenization (so that you can use the same transformer architecture on other modality)! e.g., (1) VQGAN - both hard tokens (integers) and soft tokens (do notr have to be discreate, but have bottlenecks like in autoencoders)

(Sora) you can either process discreate tokens with regressive model or soft tokens with diffusion models.

第三步，是进进阶版的文本分词，代表为megabyte， https://github.com/lucidrains/MEGABYTE-pytorch （主要是meta整的新tokenzination方法。按他们自己所说：MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling）

Survey for Knowledge Graphs Meet Multi-Modal Learning

Hi, we have preprinted our Survey recently: Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey [Repo].

Would you mind involving it in your Repo?

Thank you so much!

[Notes] LLM memory requirement

Size of KV cache per token in bytes = 2 * (num_layers) * (num_heads * dim_head) * precision_in_bytes
Why?

[Notes] A investigation of the Data Processing of Gemini

SFT:

Cross-Task Generalization via Natural Language Crowdsourcing Instructions (ACL 2022) [Paper]

[Notes] A discussion about KG and LLM (Chinese)

万字长文讲述大模型与知识图谱的关系 [Link]

Q3: 有了这种大模型以后，知识图谱还有意义吗？

知识图谱是有意义的，因为神经网络目前无法解决或者非常难以解决事实性准确的问题 (Google LaMDA, Deepmind)

专业知识in垂直领域

可以把 GPT 这种更加自由的交互和领域内的专家知识去做更深度的融合。当然也有可能未来把知识图谱通过编码的方式进入到大模型的空间。

大模型参数量很大，没法对它进行重复训练和时效性不好的问题

Q4: 对于大模型的问题是回答缺乏事实的验证，知识图谱是不是可以来帮助解决问题？就像当年有搜索引擎也用了知识链接的方式去辅助搜索或者做引证。

另一件事情是可以考虑在过程中将知识做卸载，并不需要去过分的将很多的事实性知识去放在模型里面，可以将事实知识放在本身的外部库里，然后通过一个类似于向量数据库，或者说是向量的近邻检索的 reader 的方式去做搭配，最后作为一个整体仍然具有一样的能力，对于时效性和验真性的部分就完全通过外部的数据库系统去考虑。因为对大模型去做更新，或者基于神经的知识编辑，代价还是很高，而且具体应该编辑哪一块还是黑盒，可能会影响其他任务的表现，所以还不如让它外置在外部数据库（或者数据仓库或者数据湖）里，然后通过积累的数据治理和知识验证的数据去作为辅助。这样从工程落地的角度来说，它的可行性和成本可控性会更好。

在需要可控知识或可控逻辑的时候，知识图谱会发挥更大作用；需要更自由的交互任务理解和生成时，大模型发挥更好的作用。

Q5：知识图谱在大模型中如何去使用？(Summary)

1. 把知识图谱转化成文本，作为文本语料的一部分，或者说作为比较好的语料，成为训练的一部分去做。
	○ 缺点：对于现在的这类大模型来说，它的训练数据是非常大的，直接通过知识图谱来对预训练模型或者说大模型中的训练数据做知识扩充没有必要。因为数据量已经足够大到能包含这些扩充的知识的。
2. 把知识图谱本身的结构化信息保留，作为训练的时候的一个特殊的内容或者结构加在大模型里边，
	○ 缺点：不是特别容易，因为对大模型的结构有一些要求，至少 GPT 系列的不大会采用结构，还是通过某种方式转换成文本语料的一部分去做可能会更合适。
3. Finetune：
	○ 缺点：在 GPT4 这种大级别情况下fine-tune很难，它自己的技术报告里面讲微调成本很高
4. 外部检索(like今天早上讨论的langchain+neo4j/networkx)
	○ Motivation: You do not need to machine learn Obama's birthday evert time you need it, it costs a lot and does not guarantee correctness. Just query it.
	○ Google LaMDA/New Bing/Meta: ，把知识图谱做一个外部知识库来检索，做一个纯外挂。摆事实、讲道理的时候，摆事实部分去找知识图谱，讲道理部分再用语言模型。
	○ 外部检索存在一些查询的disambiguation问题，可以参考传统graph QA的方法解决
5. 检索增强的大模型retrieval-augmented language modeling （NLP 2021及2022年十大突破的方向）
	○ 与4比较类似。随着大模型参数量越来越大，外挂知识图谱帮它去解决事实一致性以及时效性的问题。
	○ Example：Deepmind-RETRO, OpenAI-webGPT, Google-LaMDA, FAIR-Blender
6. 其他小方向：（1）将知识图谱的链式关系和预训练模型进行融合。（2）将知识图谱作为一个约束加到训练模型里面，引入一个额外的任务。比如预训练模型是一个独立的任务，然后再单独用知识图谱训练一个表示学习模型，然后将这两个模型进行交互。通过知识图谱构建的表示学习的任务，是作为一个辅助任务帮助预训练模型进行效果的提升。

heathersherry / knowledge-graph-tutorials-and-papers Goto Github PK