yuchenlin / rebiber Goto Github PK

View Code? Open in Web Editor NEW

2.6K 15.0 155.0 65.63 MB

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Home Page: https://yuchenlin.xyz/

License: MIT License

Python 96.40% Shell 3.60%

bibtex latex publication research-paper bibliography natural-language-processing machine-learning

rebiber's Introduction

Rebiber: A tool for normalizing bibtex with official info.

We often cite papers using their arXiv versions without noting that they are already PUBLISHED in some conferences. These unofficial bib entries might violate rules about submissions or camera-ready versions for some conferences. We introduce Rebiber, a simple tool in Python to fix them automatically. It is based on the official conference information from the DBLP or the ACL anthology (for NLP conferences)! You can check the list of supported conferences here. Apart from handling outdated arXiv citations, Rebiber also normalizes citations in a unified way (DBLP-style), supporting abbreviation and value selection.

Demo on Huggingface Space https://huggingface.co/spaces/yuchenlin/Rebiber (recommended)

Colab notebook: here

Changelog

2023.06.01 New demo ready to use on Huggingface's Space via Gradio. Also, a few conferences are added.
2021.09.06 We fixed a few minor bugs and added features such as sorting and urls to arXiv (if the paper is not in any conferences; thanks to @nicola-decao). We also updated the ACL anthology bib/json to the latest version as well as other conferences.
2021.05.30 We build a beta version of our web app for Rebiber; add new conferences to our dataset; fix a few minor bugs. (It is not working anymore. Please use the new huggingface space demo.)
2021.02.08 We now support multiple useful features: 1) turning off some certain values, e.g., "-r url,pages,address" for removing the values from the output, 2) using abbr. to shorten the booktitle values, e.g., Proceedings of the .* Annual Meeting of the Association for Computational Linguistics --> Proc. of ACL. More examples are here.
2021.01.30 We build a colab notebook as a simple web demo. link

Installation

# pip install rebiber -U # for the stable version
pip install -e git+https://github.com/yuchenlin/rebiber.git#egg=rebiber -U
# rebiber --update  # (optional) update the bib data and the abbr. info  (using wget)

git clone https://github.com/yuchenlin/rebiber.git
cd rebiber/
pip install -e .

If you would like to use the latest github version with more bug fixes, please use the second installation method.

Usage（v1.1.3）

Normalize your bibtex file with the official conference information:

rebiber -i /path/to/input.bib -o /path/to/output.bib

You can find a pair of example input and output files in rebiber/example_input.bib and rebiber/example_output.bib.

argument	usage
`-i`	or `--input_bib`. The path to the input bib file that you want to update
`-o`	or `--output_bib`. The path to the output bib file that you want to save. If you don't specify any `-o` then it will be the same as the `-i`.
`-r`	or `--remove`. A comma-separated list of value names that you want to remove, such as "-r pages,editor,volume,month,url,biburl,address,publisher,bibsource,timestamp,doi". Empty by default.
`-s`	or `--shorten`. A bool argument that is `"False"` by default, used for replacing `booktitle` with abbreviation in `-a`. Used as `-s True`.
`-d`	or `--deduplicate`. A bool argument that is `"True"` by default, used for removing the duplicate bib entries sharing the same key. Used as `-d True`.
`-l`	or `--bib_list`. The path to the list of the bib json files to be loaded. Check rebiber/bib_list.txt for the default file. Usually you don't need to set this argument.
`-a`	or `--abbr_tsv`. The list of conference abbreviation data. Check rebiber/abbr.tsv for the default file. Usually you don't need to set this argument.
`-u`	or `--update`. Update the local bib-related data with the latest Github version.
`-v`	or `--version`. Print the version of current Rebiber.
`-st`	or `--sort`. A bool argument that is `"False"` by default. used for keeping the original order of the bib entries of the input file. By setting it to be `"True"`, the bib entries are ordered alphabetically in the output file. Used as `-st True`.

Example Input and Output

An example input entry with the arXiv information (from Google Scholar or somewhere):

@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}

An example normalized output entry with the official information:

@inproceedings{lin2020birds,
    title = "{B}irds have four legs?! {N}umer{S}ense: {P}robing {N}umerical {C}ommonsense {K}nowledge of {P}re-{T}rained {L}anguage {M}odels",
    author = "Lin, Bill Yuchen  and
      Lee, Seyeon  and
      Khanna, Rahul  and
      Ren, Xiang",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.557",
    doi = "10.18653/v1/2020.emnlp-main.557",
    pages = "6862--6868",
}

Supported Conferences

The bib_list.txt contains a list of converted json files of the official bib data. In this repo, we now support the full ACL anthology, i.e., all papers that are published at *CL conferences (ACL, EMNLP, NAACL, etc.) as well as workshops. Also, we support any conference proceedings that can be downloaded from DBLP, for example, ICLR2020.

Note that to DBLP only allows you to download in batches of 1000 using &h=1000&f=0, where f=0|1000|2000 specifies the starting index. So we have to manually download the bib files of each conference and concatenate them together. add_conf.sh takes care of that, too.

The following conferences are supported and their bib/json files are in our data folder. You can turn each item on/off in bib_list.txt. Please feel free to create PR for adding new conferences following this!

Name	Years
ACL Anthology	(until 2023-06)
AAAI	2010 -- 2020
AISTATS	2013 -- 2020
ALENEX	2010 -- 2020
ASONAM	2010 -- 2019
BigDataConf	2013 -- 2019
BMVC	2010 -- 2020
CHI	2010 -- 2020
CIDR	2009 -- 2020
CIKM	2010 -- 2020
COLT	2000 -- 2020
CVPR	2000 -- 2020
ICASSP	2015 -- 2020
ICCV	2003 -- 2019
ICLR	2013 -- 2020
ICML	2000 -- 2020
IJCAI	2011 -- 2020
INTERSPEECH	2016 -- 2021
KDD	2010 -- 2020
MLSys	2019 -- 2020
MM	2016 -- 2020
NeurIPS	2000 -- 2020
RECSYS	2010 -- 2020
SDM	2010 -- 2020
SIGIR	2010 -- 2020
SIGMOD	2010 -- 2020
SODA	2010 -- 2020
STOC	2010 -- 2020
UAI	2010 -- 2020
WSDM	2008 -- 2020
WWW (The Web Conf)	2001 -- 2020

Thanks for Anton Tsitsulin's great work on collecting such a complete set bib files!

Adding a new conference

You can manually add any conferences from DBLP by downloading their bib files to our raw_data folder, and run a prepared script add_conf.sh.

Take ICLR2020 and ICLR2019 as an example:

Step 1: Go to DBLP
Step 2: Download the bib files, and put them here as raw_data/iclr2020.bib and raw_data/iclr2019.bib (name should be in the format as {conf_name}{year}.bib)
Step 3: Run script

bash add_conf.sh iclr 2019 2020

Particularly, to update *CL conference, we can

python bib2json.py -i raw_data/anthology.bib -o data/acl.json

Star History

Contact

Please email [email protected] or create Github issues here if you have any questions or suggestions.

rebiber's People

Contributors

Stargazers

Watchers

Forkers

shizhediao ruizewang sujikim6 clairegyn frankfan007 sabirdvd cerisara dorarad hugochan xinxin12345 xuanlin1991 lihuibng stevenyesz xuaikun eccstartup sc1054 jxh4945777 lianglili guoxinfei milkigit hell-to-heaven sckangz timothyxxx dangowski geminifox2019 ssheikholeslami xiaoqiuxuan cvelazquezr cookiegg cshen senwang98 kashifinayat jwlcool flyfish-space zstbackcourt protonish brandonhanx lyu-xg zlou zdhscdj rationalspark yueyedeai sikastar aust-hansen lerylee medical-projects xingyuren wutaiqiang tianyu-z ivanchenph shenghuo123 entilzha volkancirik rka97 qcwthu wangclnlp tejas-gokhale frankxu1 sunsishining binliang-nlp htfhxx cenchaojun csudragonzl semitable lileicc npujcong shanhaiying zueigung1419 xvshiting jiangongwang trendingtechnology nicola-decao manojprabhakar gahykim rainbowcatszy espressoandcode zshwuhan herais valeman hevohel xinlongye dbggg sean-xia chenshen03 yydg1 ru-system-software-and-security leonnnop python-repository-hub kashu7100 saber258 tankmermaid wyy511511 aabbccgithub jcl-gx complexkaka uangeihl ces-dengzeyuan yzfxmu woshishui1 hbb1

rebiber's Issues

The way `abbr.tsv` is loaded removes entries from the file

abbr.tsv has two entries for ICML:

Proc. of ICML | Proceedings of the .* International Conference on Machine Learning
Proc. of ICML | Machine Learning, Proceedings of the .* International Conference

but they are not both loaded, because in load_abbr_tsv() a dictionary is used such that the second entry overwrites the first one:

ls = line.split("|")
if len(ls) == 2:
    abbr_dict[ls[0].strip()] = ls[1].strip()

I see two solutions here: either don't load the file into a dictionary (but just a list of tuples), or allow specifying or regex patterns (i.e., (pattern1|pattern2)) which would require using a different character than | to separate the left- and right-hand sides in abbr.tsv.

[feature request] "rebiber --update" to update the list of bibjsons

With this feature, users do not need to upgrade via pip if the code doesn't change at all.

Some references are filtered by `load_bib_file`

It's a great tools, but when I try to transfer my .bib file, which is generated by an application BibDesk, the references are filtered, here is a minimal example of my bib file.

@inproceedings{zhang2019heterogeneous,
        author = {Zhang, Chuxu and Song, Dongjin and Huang, Chao and Swami, Ananthram and Chawla, Nitesh V},
        booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
        date-added = {2021-04-03 01:39:20 +0800},
        date-modified = {2021-04-03 01:44:13 +0800},
        keywords = {Recommender system, Graph Neural Network},
        pages = {793--803},
        title = {Heterogeneous graph neural network},
        year = {2019},
        Bdsk-Url-1 = {https://doi.org/10.1145/3292500.3330961}}

I think this is due to load_bib_file. The last line of this reference contains {, so load_bib_file skipped this reference.

However, in the BibtexParser, this kind of bib file can be recognized.

Confusing behavior with some author names

The ImageNet paper has its last author listed as Li Fei-Fei, which is how she publishes in general, both on the paper and in the IEEE metadata; their .bib has her as Li Fei-Fei in the author.

The DBLP record lists her as Li Fei{-}Fei.

And yet rebiber/data/cvpr2009.bib.json has her as Fei{-}Fei Li, and so running either through rebiber incorrectly changes it to that ordering.

The same is true for most (but not all) of her papers in the database. No idea why this would be, since DBLP consistently has her as Li Fei{-}Fei.

cc @pranav-ust

Hope add a command for batch files execution

I have multiple bib files for several research fields and hope convert their information in one-click. I've written a bat file to automatically execute bib files in work directory:

@echo off
for %%i in (*.bib) do echo "%%i"
for %%i in (*.bib) do rebiber -i %%i -o Pub%%i
pause
exit

But a build-in command would be easier to use. Would you like to add this?

Incomplete bib entry for conference

Hello, I find that some papers accepted by some conferences (e.g. AAAI 2020) cannot be indexed. The reason might be that we can only download the first 1000 entries when the accepted papers are more than 1,000 from DBLP. Is there any way to address such problem? Thanks very much!

Add some medical image conferences

I can work on this soon. The main one is MICCAI.

Comments in bib file are transformed into `@comments`

I come across two issues here:

Somehow the tool transforms my comments (ones starting with %) in bib file into @comment{} and places them at the head of the file;
All the bibs are by default organized in an alphabetic manner, is there a way (option) I can remain the order of bibs (and thus keep the comments where they are) .

Great tool by the way :)

Question about month

Hi Yuchen,
It seems you try to ignore the month field in a bib entry in is_contain_var() and build_json(). Can you please explain why is that necessary?
You also ignore '@string' entry. Why not just let bibtexparser parse the entire bib file?
Thank you!

Deleted entry after using rebiber

Hi,

Thanks for the great tool! I faced an issue where an entry was deleted after using rebiber though. It's this one:

@article{loon,
  title={Autonomous navigation of stratospheric balloons using reinforcement learning.},
  author={Marc G. Bellemare and Salvatore Candido and P. S. Castro and J. Gong and Marlos C. Machado and Subhodeep Moitra and Sameera S. Ponda and Ziyu Wang},
  journal={Nature},
  year={2020},
  volume={588 7836},
  pages={
          77-82
        }
}

Do you have any idea what could be wrong?

fix for windows

rebiber --update fails on windows, because a series of linux commands such as wget

ModuleNotFoundError: No module named 'bibtexparser'

When trying to run the code in README.

Not keeping the @software entries

This will be ignored.

@software{spacy,
  author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane},
  title = {{spaCy: Industrial-strength Natural Language Processing in Python}},
  year = 2020,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.1212303},
  url = {https://doi.org/10.5281/zenodo.1212303}
}

Handle @string

Nice tool!
It seems that currently it doesn't handle @string in BibTeX. Any plan to add this feature?

Example:

@string{emnlp = "Empirical Methods in Natural Language Processing (EMNLP)"}

@inproceedings{li2020efficient,
 title={Efficient One-Pass End-to-End Entity Linking for Questions},
 author={Li, Belinda Z. and Min, Sewon and Iyer, Srinivasan and Mehdad, Yashar and Yih, Wen-tau},
 booktitle=emnlp,
 year={2020}
}

The booktitle contains too much information

I found that the booktitle of many papers in DBLP has too many names and information.

For example：

@inproceedings{seo-etal-2016-bidirectional,
 author = {Min Joon Seo and
Aniruddha Kembhavi and
Ali Farhadi and
Hannaneh Hajishirzi},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/iclr/SeoKFH17.bib},
 booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
 publisher = {OpenReview.net},
 timestamp = {Thu, 25 Jul 2019 01:00:00 +0200},
 title = {Bidirectional Attention Flow for Machine Comprehension},
 url = {https://openreview.net/forum?id=HJ0UKP9ge},
 year = {2017}
}

The booktitle here contains the full name and abbreviation of ICLR, as well as their location.
Can you keep only the first one of this information?

For example:
booktitle =“5th International Conference on Learning Representations”

看看zotero怎么支持？

zotero是个常见的开源文献管理软件，可以导出为bib格式
或许可以基于rebiber写一个zotero插件，批量把zotero中的条目的arkiv引用更新

A more complete set of conferences

Hi! It's a great tool. I have downloaded some other conferences of personal interest, and wanted to share them with others. There are some gaps in downloads here and there (specifically, for multi-volume conferences), but it's more complete than the one present in the repo. I have included more years as well.

The data is >400 MB uncompressed, and 30Mb compressed. I leave a link here for those who may find it useful.
http://tsitsul.in/data/confdata.zip

Matching Heuristic causes Mismatches, e.g. "Deep Learning". Check for >=1 same author?

Big thanks to the people behind rebiber. It is a very helpful tool.

I noticed some peculiarities with the matching heuristic (see example below).
TLDR: rebiber turned the entries for the Deep Learning book (Goodfellow) and the Deep Learning nature article (LeCun, Bengio, Hinton) into entries referring to a SIGKDD paper (Salakuthdinov) with the same title--without any warning.

Based on this behaviour, I assume the matching is done only based on title.

I suggest to check additionally for at least one common author. Or (less invasive): to emit an adequate warning when there is not a single same-author-name in a new entry compared to the original entry.

Input

@article{deeplearning,
  title = {Deep Learning},
  author = {LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
  year = {2015},
  month = may,
  journal = {Nature},
  volume = {521},
  number = {7553},
  pages = {436--444},
  publisher = {{Nature Publishing Group}},
  issn = {1476-4687},
  doi = {10.1038/nature14539},
  copyright = {2015 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.},
}

@book{goodfellow_deep_2016,
	series = {Adaptive computation and machine learning},
	title = {Deep {Learning}},
	isbn = {978-0-262-03561-3},
	url = {http://www.deeplearningbook.org/},
	publisher = {MIT Press},
	author = {Goodfellow, Ian J. and Bengio, Yoshua and Courville, Aaron C.},
	year = {2016},
}

Cmd: rebiber -i main.bib -o main.bib -r editor -d True -s True

Output

@inproceedings{deeplearning,
 author = {Ruslan Salakhutdinov},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/kdd/Salakhutdinov14.bib},
 booktitle = {Proc. of KDD},
 doi = {10.1145/2623330.2630809},
 pages = {1973},
 publisher = {{ACM}},
 timestamp = {Tue, 06 Nov 2018 00:00:00 +0100},
 title = {Deep learning},
 url = {https://doi.org/10.1145/2623330.2630809},
 year = {2014}
}

@inproceedings{goodfellow_deep_2016,
 author = {Ruslan Salakhutdinov},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/kdd/Salakhutdinov14.bib},
 booktitle = {Proc. of KDD},
 doi = {10.1145/2623330.2630809},
 pages = {1973},
 publisher = {{ACM}},
 timestamp = {Tue, 06 Nov 2018 00:00:00 +0100},
 title = {Deep learning},
 url = {https://doi.org/10.1145/2623330.2630809},
 year = {2014}
}

The command line output of rebiber only state

Converted. ID: deeplearning ; Title: Deep Learning
Converted. ID: goodfellow_deep_2016 ; Title: Deep {Learning}

Desired output

a) Do not have the entries replaced, as there is not a single author name in common

or b) Emit a major warning on the command line, when replacing without a single author name in common

[feature request] add a follow-up bib post-processor

so that one can choose to turn on/off some entries such as date/year/URL/DOI stuff.

Wrapping it as a pip module

Computer Vision Bib Files + Scripts

Thank you for your wonderful tool ! Saved me a lot of time !

I used the following scripts to extract bib tex file from dplb using the API if that can be useful :

confs=['ecml', 'wacv']#['eccv', 'iccv', 'bmvc','cvpr', 'accv', 'neurips','miccai', 'ecml']
years=list(range(2000, 2024))

for conf in confs : 
    for year in years : 
        cites = ''
        for step in range(5) :
            print(step, conf, year)
            s=f'https://dblp.org/search/publ/api?q=conf/{conf}/{year}&h=1000&f={step*1000}&format=bib'
            cs = requests.get(s).text
            if cs == '' : 
                print('stop')
                break
            cites+= cs
        with open(f'/rebiber/rebiber/raw_data/{conf}{year}.bib', 'w') as f : 
            f.write(cites)

For journals you can use

journals=['ijcv']#'pami'
editions=range(100, 132)
  
s=f'https://dblp.org/search/publ/api?q=toc:db/journals/{journal}/{journal}{edition}.bht:&h=1000&f={step*1000}&format=bib'

I also attached a zip with the classical computer vision conferences / journals.
CV-bib-files.zip

Whether to consider providing Python API ？

Although a scripting approach is provided, would you consider providing a Python API ？

for example

import rebiber
str = '@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}'
res = rebiber.trans(str)
print(res)

Double-closed braces generate an extra @comment block

For @article and @book entries, a double-closed braces at the end will generate an extra @comment block in the output bib file.

Input example:

@article{Ando2005,
	Acmid = {1194905},
	Author = {Ando, Rie Kubota and Zhang, Tong},
	Issn = {1532-4435},
	Pages = {1817--1853},
	Publisher = {JMLR.org},
	Title = {A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data},
	Volume = {6},
	Year = {2005}}

Output:

@comment{}}

@article{Ando2005,
 acmid = {1194905},
 author = {Ando, Rie Kubota and Zhang, Tong},
 issn = {1532-4435},
 issue_date = {12/1/2005},
 journal = {Journal of Machine Learning Research},
 numpages = {37},
 pages = {1817--1853},
 publisher = {JMLR.org},
 title = {A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data},
 volume = {6},
 year = {2005}
}

Some refs disappear when generating in Hugging Face

Try this one in Hugging Face , it will generate an empty output.

@inproceedings{LocalAlgorithmFinding2013zhu,
title = {A Local Algorithm for Finding Well-Connected Clusters},
booktitle = {International {{Conference}} on {{Machine Learning}}},
author = {Zhu, Zeyuan Allen and Lattanzi, Silvio and Mirrokni, Vahab},
year = {2013},
pages = {396--404},
publisher = {{PMLR}}
}

Online web demo

Work in progress and suggestions are welcome!

Add LREC + Automatically sync

The LREC Sign Language workshop has this website -
https://www.sign-lang.uni-hamburg.de/lrec/index.html

Which links to two bib files:
without abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec.bib
with abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec_a.bib

While one can add them manually to this repo, I was wondering if there is a setting somewhere to just put this link, and whenever someone runs an "update" script it will re-fetch the bib file and process it?