Giter Club home page Giter Club logo

rebiber's Introduction

Rebiber: A tool for normalizing bibtex with official info.

We often cite papers using their arXiv versions without noting that they are already PUBLISHED in some conferences. These unofficial bib entries might violate rules about submissions or camera-ready versions for some conferences. We introduce Rebiber, a simple tool in Python to fix them automatically. It is based on the official conference information from the DBLP or the ACL anthology (for NLP conferences)! You can check the list of supported conferences here. Apart from handling outdated arXiv citations, Rebiber also normalizes citations in a unified way (DBLP-style), supporting abbreviation and value selection.

Demo on Huggingface Space https://huggingface.co/spaces/yuchenlin/Rebiber (recommended)

Colab notebook: here

Changelog

  • 2023.06.01 New demo ready to use on Huggingface's Space via Gradio. Also, a few conferences are added.

  • 2021.09.06 We fixed a few minor bugs and added features such as sorting and urls to arXiv (if the paper is not in any conferences; thanks to @nicola-decao). We also updated the ACL anthology bib/json to the latest version as well as other conferences.

  • 2021.05.30 We build a beta version of our web app for Rebiber; add new conferences to our dataset; fix a few minor bugs. (It is not working anymore. Please use the new huggingface space demo.)

  • 2021.02.08 We now support multiple useful features: 1) turning off some certain values, e.g., "-r url,pages,address" for removing the values from the output, 2) using abbr. to shorten the booktitle values, e.g., Proceedings of the .* Annual Meeting of the Association for Computational Linguistics --> Proc. of ACL. More examples are here.

  • 2021.01.30 We build a colab notebook as a simple web demo. link

Installation

# pip install rebiber -U # for the stable version
pip install -e git+https://github.com/yuchenlin/rebiber.git#egg=rebiber -U
# rebiber --update  # (optional) update the bib data and the abbr. info  (using wget)

OR

git clone https://github.com/yuchenlin/rebiber.git
cd rebiber/
pip install -e .

If you would like to use the latest github version with more bug fixes, please use the second installation method.

Usage(v1.1.3)

Normalize your bibtex file with the official conference information:

rebiber -i /path/to/input.bib -o /path/to/output.bib

You can find a pair of example input and output files in rebiber/example_input.bib and rebiber/example_output.bib.

argument usage
-i or --input_bib. The path to the input bib file that you want to update
-o or --output_bib. The path to the output bib file that you want to save. If you don't specify any -o then it will be the same as the -i.
-r or --remove. A comma-separated list of value names that you want to remove, such as "-r pages,editor,volume,month,url,biburl,address,publisher,bibsource,timestamp,doi". Empty by default.
-s or --shorten. A bool argument that is "False" by default, used for replacing booktitle with abbreviation in -a. Used as -s True.
-d or --deduplicate. A bool argument that is "True" by default, used for removing the duplicate bib entries sharing the same key. Used as -d True.
-l or --bib_list. The path to the list of the bib json files to be loaded. Check rebiber/bib_list.txt for the default file. Usually you don't need to set this argument.
-a or --abbr_tsv. The list of conference abbreviation data. Check rebiber/abbr.tsv for the default file. Usually you don't need to set this argument.
-u or --update. Update the local bib-related data with the latest Github version.
-v or --version. Print the version of current Rebiber.
-st or --sort. A bool argument that is "False" by default. used for keeping the original order of the bib entries of the input file. By setting it to be "True", the bib entries are ordered alphabetically in the output file. Used as -st True.

Example Input and Output

An example input entry with the arXiv information (from Google Scholar or somewhere):

@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}

An example normalized output entry with the official information:

@inproceedings{lin2020birds,
    title = "{B}irds have four legs?! {N}umer{S}ense: {P}robing {N}umerical {C}ommonsense {K}nowledge of {P}re-{T}rained {L}anguage {M}odels",
    author = "Lin, Bill Yuchen  and
      Lee, Seyeon  and
      Khanna, Rahul  and
      Ren, Xiang",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.557",
    doi = "10.18653/v1/2020.emnlp-main.557",
    pages = "6862--6868",
}

Supported Conferences

The bib_list.txt contains a list of converted json files of the official bib data. In this repo, we now support the full ACL anthology, i.e., all papers that are published at *CL conferences (ACL, EMNLP, NAACL, etc.) as well as workshops. Also, we support any conference proceedings that can be downloaded from DBLP, for example, ICLR2020.

Note that to DBLP only allows you to download in batches of 1000 using &h=1000&f=0, where f=0|1000|2000 specifies the starting index. So we have to manually download the bib files of each conference and concatenate them together. add_conf.sh takes care of that, too.

The following conferences are supported and their bib/json files are in our data folder. You can turn each item on/off in bib_list.txt. Please feel free to create PR for adding new conferences following this!

Name Years
ACL Anthology (until 2023-06)
AAAI 2010 -- 2020
AISTATS 2013 -- 2020
ALENEX 2010 -- 2020
ASONAM 2010 -- 2019
BigDataConf 2013 -- 2019
BMVC 2010 -- 2020
CHI 2010 -- 2020
CIDR 2009 -- 2020
CIKM 2010 -- 2020
COLT 2000 -- 2020
CVPR 2000 -- 2020
ICASSP 2015 -- 2020
ICCV 2003 -- 2019
ICLR 2013 -- 2020
ICML 2000 -- 2020
IJCAI 2011 -- 2020
INTERSPEECH 2016 -- 2021
KDD 2010 -- 2020
MLSys 2019 -- 2020
MM 2016 -- 2020
NeurIPS 2000 -- 2020
RECSYS 2010 -- 2020
SDM 2010 -- 2020
SIGIR 2010 -- 2020
SIGMOD 2010 -- 2020
SODA 2010 -- 2020
STOC 2010 -- 2020
UAI 2010 -- 2020
WSDM 2008 -- 2020
WWW (The Web Conf) 2001 -- 2020

Thanks for Anton Tsitsulin's great work on collecting such a complete set bib files!

Adding a new conference

You can manually add any conferences from DBLP by downloading their bib files to our raw_data folder, and run a prepared script add_conf.sh.

Take ICLR2020 and ICLR2019 as an example:

  • Step 1: Go to DBLP
  • Step 2: Download the bib files, and put them here as raw_data/iclr2020.bib and raw_data/iclr2019.bib (name should be in the format as {conf_name}{year}.bib)
  • Step 3: Run script
bash add_conf.sh iclr 2019 2020

Particularly, to update *CL conference, we can

python bib2json.py -i raw_data/anthology.bib -o data/acl.json

Star History

Star History Chart

Contact

Please email [email protected] or create Github issues here if you have any questions or suggestions.

rebiber's People

Contributors

asvskartheek avatar blackboxo avatar blackhc avatar brandonhanx avatar cerisara avatar erjanmx avatar herais avatar lileicc avatar nicola-decao avatar rka97 avatar shizhouxing avatar ssheikholeslami avatar sujikim6 avatar tianyu-z avatar trucndt avatar yangxqiao avatar yuchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rebiber's Issues

The way `abbr.tsv` is loaded removes entries from the file

abbr.tsv has two entries for ICML:

Proc. of ICML | Proceedings of the .* International Conference on Machine Learning
Proc. of ICML | Machine Learning, Proceedings of the .* International Conference

but they are not both loaded, because in load_abbr_tsv() a dictionary is used such that the second entry overwrites the first one:

ls = line.split("|")
if len(ls) == 2:
    abbr_dict[ls[0].strip()] = ls[1].strip()

I see two solutions here: either don't load the file into a dictionary (but just a list of tuples), or allow specifying or regex patterns (i.e., (pattern1|pattern2)) which would require using a different character than | to separate the left- and right-hand sides in abbr.tsv.

Some references are filtered by `load_bib_file`

It's a great tools, but when I try to transfer my .bib file, which is generated by an application BibDesk, the references are filtered, here is a minimal example of my bib file.

@inproceedings{zhang2019heterogeneous,
        author = {Zhang, Chuxu and Song, Dongjin and Huang, Chao and Swami, Ananthram and Chawla, Nitesh V},
        booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
        date-added = {2021-04-03 01:39:20 +0800},
        date-modified = {2021-04-03 01:44:13 +0800},
        keywords = {Recommender system, Graph Neural Network},
        pages = {793--803},
        title = {Heterogeneous graph neural network},
        year = {2019},
        Bdsk-Url-1 = {https://doi.org/10.1145/3292500.3330961}}

I think this is due to load_bib_file. The last line of this reference contains {, so load_bib_file skipped this reference.

However, in the BibtexParser, this kind of bib file can be recognized.

Confusing behavior with some author names

The ImageNet paper has its last author listed as Li Fei-Fei, which is how she publishes in general, both on the paper and in the IEEE metadata; their .bib has her as Li Fei-Fei in the author.

The DBLP record lists her as Li Fei{-}Fei.

And yet rebiber/data/cvpr2009.bib.json has her as Fei{-}Fei Li, and so running either through rebiber incorrectly changes it to that ordering.

The same is true for most (but not all) of her papers in the database. No idea why this would be, since DBLP consistently has her as Li Fei{-}Fei.

cc @pranav-ust

Hope add a command for batch files execution

I have multiple bib files for several research fields and hope convert their information in one-click. I've written a bat file to automatically execute bib files in work directory:

@echo off
for %%i in (*.bib) do echo "%%i"
for %%i in (*.bib) do rebiber -i %%i -o Pub%%i
pause
exit

But a build-in command would be easier to use. Would you like to add this?

Incomplete bib entry for conference

Hello, I find that some papers accepted by some conferences (e.g. AAAI 2020) cannot be indexed. The reason might be that we can only download the first 1000 entries when the accepted papers are more than 1,000 from DBLP. Is there any way to address such problem? Thanks very much!

Comments in bib file are transformed into `@comments`

I come across two issues here:

  • Somehow the tool transforms my comments (ones starting with %) in bib file into @comment{} and places them at the head of the file;
  • All the bibs are by default organized in an alphabetic manner, is there a way (option) I can remain the order of bibs (and thus keep the comments where they are) .

Great tool by the way :)

Question about month

Hi Yuchen,
It seems you try to ignore the month field in a bib entry in is_contain_var() and build_json(). Can you please explain why is that necessary?
You also ignore '@string' entry. Why not just let bibtexparser parse the entire bib file?
Thank you!

Deleted entry after using rebiber

Hi,

Thanks for the great tool! I faced an issue where an entry was deleted after using rebiber though. It's this one:

@article{loon,
  title={Autonomous navigation of stratospheric balloons using reinforcement learning.},
  author={Marc G. Bellemare and Salvatore Candido and P. S. Castro and J. Gong and Marlos C. Machado and Subhodeep Moitra and Sameera S. Ponda and Ziyu Wang},
  journal={Nature},
  year={2020},
  volume={588 7836},
  pages={
          77-82
        }
}

Do you have any idea what could be wrong?

fix for windows

rebiber --update fails on windows, because a series of linux commands such as wget

Not keeping the @software entries

This will be ignored.

@software{spacy,
  author = {Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane},
  title = {{spaCy: Industrial-strength Natural Language Processing in Python}},
  year = 2020,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.1212303},
  url = {https://doi.org/10.5281/zenodo.1212303}
}

Handle @string

Nice tool!
It seems that currently it doesn't handle @string in BibTeX. Any plan to add this feature?

Example:

@string{emnlp = "Empirical Methods in Natural Language Processing (EMNLP)"}

@inproceedings{li2020efficient,
 title={Efficient One-Pass End-to-End Entity Linking for Questions},
 author={Li, Belinda Z. and Min, Sewon and Iyer, Srinivasan and Mehdad, Yashar and Yih, Wen-tau},
 booktitle=emnlp,
 year={2020}
}

The booktitle contains too much information

I found that the booktitle of many papers in DBLP has too many names and information.

For example:

@inproceedings{seo-etal-2016-bidirectional,
 author = {Min Joon Seo and
Aniruddha Kembhavi and
Ali Farhadi and
Hannaneh Hajishirzi},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/iclr/SeoKFH17.bib},
 booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
 publisher = {OpenReview.net},
 timestamp = {Thu, 25 Jul 2019 01:00:00 +0200},
 title = {Bidirectional Attention Flow for Machine Comprehension},
 url = {https://openreview.net/forum?id=HJ0UKP9ge},
 year = {2017}
}

The booktitle here contains the full name and abbreviation of ICLR, as well as their location.
Can you keep only the first one of this information?

For example:
booktitle =“5th International Conference on Learning Representations”

看看zotero怎么支持?

zotero是个常见的开源文献管理软件,可以导出为bib格式
或许可以基于rebiber写一个zotero插件,批量把zotero中的条目的arkiv引用更新

A more complete set of conferences

Hi! It's a great tool. I have downloaded some other conferences of personal interest, and wanted to share them with others. There are some gaps in downloads here and there (specifically, for multi-volume conferences), but it's more complete than the one present in the repo. I have included more years as well.

The data is >400 MB uncompressed, and 30Mb compressed. I leave a link here for those who may find it useful.
http://tsitsul.in/data/confdata.zip

Matching Heuristic causes Mismatches, e.g. "Deep Learning". Check for >=1 same author?

Big thanks to the people behind rebiber. It is a very helpful tool.

I noticed some peculiarities with the matching heuristic (see example below).
TLDR: rebiber turned the entries for the Deep Learning book (Goodfellow) and the Deep Learning nature article (LeCun, Bengio, Hinton) into entries referring to a SIGKDD paper (Salakuthdinov) with the same title--without any warning.

Based on this behaviour, I assume the matching is done only based on title.

I suggest to check additionally for at least one common author. Or (less invasive): to emit an adequate warning when there is not a single same-author-name in a new entry compared to the original entry.

Input

@article{deeplearning,
  title = {Deep Learning},
  author = {LeCun, Yann and Bengio, Yoshua and Hinton, Geoffrey},
  year = {2015},
  month = may,
  journal = {Nature},
  volume = {521},
  number = {7553},
  pages = {436--444},
  publisher = {{Nature Publishing Group}},
  issn = {1476-4687},
  doi = {10.1038/nature14539},
  copyright = {2015 Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.},
}

@book{goodfellow_deep_2016,
	series = {Adaptive computation and machine learning},
	title = {Deep {Learning}},
	isbn = {978-0-262-03561-3},
	url = {http://www.deeplearningbook.org/},
	publisher = {MIT Press},
	author = {Goodfellow, Ian J. and Bengio, Yoshua and Courville, Aaron C.},
	year = {2016},
}

Cmd: rebiber -i main.bib -o main.bib -r editor -d True -s True

Output

@inproceedings{deeplearning,
 author = {Ruslan Salakhutdinov},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/kdd/Salakhutdinov14.bib},
 booktitle = {Proc. of KDD},
 doi = {10.1145/2623330.2630809},
 pages = {1973},
 publisher = {{ACM}},
 timestamp = {Tue, 06 Nov 2018 00:00:00 +0100},
 title = {Deep learning},
 url = {https://doi.org/10.1145/2623330.2630809},
 year = {2014}
}

@inproceedings{goodfellow_deep_2016,
 author = {Ruslan Salakhutdinov},
 bibsource = {dblp computer science bibliography, https://dblp.org},
 biburl = {https://dblp.org/rec/conf/kdd/Salakhutdinov14.bib},
 booktitle = {Proc. of KDD},
 doi = {10.1145/2623330.2630809},
 pages = {1973},
 publisher = {{ACM}},
 timestamp = {Tue, 06 Nov 2018 00:00:00 +0100},
 title = {Deep learning},
 url = {https://doi.org/10.1145/2623330.2630809},
 year = {2014}
}

The command line output of rebiber only state

  • Converted. ID: deeplearning ; Title: Deep Learning
  • Converted. ID: goodfellow_deep_2016 ; Title: Deep {Learning}

Desired output

a) Do not have the entries replaced, as there is not a single author name in common

or b) Emit a major warning on the command line, when replacing without a single author name in common

Computer Vision Bib Files + Scripts

Thank you for your wonderful tool ! Saved me a lot of time !

I used the following scripts to extract bib tex file from dplb using the API if that can be useful :

confs=['ecml', 'wacv']#['eccv', 'iccv', 'bmvc','cvpr', 'accv', 'neurips','miccai', 'ecml']
years=list(range(2000, 2024))

for conf in confs : 
    for year in years : 
        cites = ''
        for step in range(5) :
            print(step, conf, year)
            s=f'https://dblp.org/search/publ/api?q=conf/{conf}/{year}&h=1000&f={step*1000}&format=bib'
            cs = requests.get(s).text
            if cs == '' : 
                print('stop')
                break
            cites+= cs
        with open(f'/rebiber/rebiber/raw_data/{conf}{year}.bib', 'w') as f : 
            f.write(cites)

For journals you can use

journals=['ijcv']#'pami'
editions=range(100, 132)
  
s=f'https://dblp.org/search/publ/api?q=toc:db/journals/{journal}/{journal}{edition}.bht:&h=1000&f={step*1000}&format=bib'

I also attached a zip with the classical computer vision conferences / journals.
CV-bib-files.zip

Whether to consider providing Python API ?

Although a scripting approach is provided, would you consider providing a Python API ?

for example

import rebiber
str = '@article{lin2020birds,
	title={Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models},
	author={Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang},
	journal={arXiv preprint arXiv:2005.00683},
	year={2020}
}'
res = rebiber.trans(str)
print(res)

Double-closed braces generate an extra @comment block

For @article and @book entries, a double-closed braces at the end will generate an extra @comment block in the output bib file.

Input example:

@article{Ando2005,
	Acmid = {1194905},
	Author = {Ando, Rie Kubota and Zhang, Tong},
	Issn = {1532-4435},
	Pages = {1817--1853},
	Publisher = {JMLR.org},
	Title = {A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data},
	Volume = {6},
	Year = {2005}}

Output:

@comment{}}

@article{Ando2005,
 acmid = {1194905},
 author = {Ando, Rie Kubota and Zhang, Tong},
 issn = {1532-4435},
 issue_date = {12/1/2005},
 journal = {Journal of Machine Learning Research},
 numpages = {37},
 pages = {1817--1853},
 publisher = {JMLR.org},
 title = {A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data},
 volume = {6},
 year = {2005}
}

Some refs disappear when generating in Hugging Face

Try this one in Hugging Face , it will generate an empty output.

@inproceedings{LocalAlgorithmFinding2013zhu,
title = {A Local Algorithm for Finding Well-Connected Clusters},
booktitle = {International {{Conference}} on {{Machine Learning}}},
author = {Zhu, Zeyuan Allen and Lattanzi, Silvio and Mirrokni, Vahab},
year = {2013},
pages = {396--404},
publisher = {{PMLR}}
}

Add LREC + Automatically sync

The LREC Sign Language workshop has this website -
https://www.sign-lang.uni-hamburg.de/lrec/index.html

Which links to two bib files:
without abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec.bib
with abstracts: https://www.sign-lang.uni-hamburg.de/lrec/sign-lang_lrec_a.bib

While one can add them manually to this repo, I was wondering if there is a setting somewhere to just put this link, and whenever someone runs an "update" script it will re-fetch the bib file and process it?

Auto add DOIs from bib file

Hi,

Is there any way to auto-add DOIs from the BibTeX files? This would be highly useful. I see a similar thing in one of the forks but idk if this is useful since it's only for arxiv I think.

Some refs disappear when generating in Hugging Face

Try this one in Hugging Face , it will generate an empty output.

@inproceedings{LocalAlgorithmFinding2013zhu,
title = {A Local Algorithm for Finding Well-Connected Clusters},
booktitle = {International {{Conference}} on {{Machine Learning}}},
author = {Zhu, Zeyuan Allen and Lattanzi, Silvio and Mirrokni, Vahab},
year = {2013},
pages = {396--404},
publisher = {{PMLR}}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.