hazyresearch / data-centric-ai Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 116.0 939 KB

Resources for Data Centric AI

License: Apache License 2.0

TeX 100.00%

ai artificial-intelligence data-centric-ai machine-learning

data-centric-ai's People

Contributors

Stargazers

Watchers

Forkers

rengglic w4nderlust cudbg dumpmemory yutong-zhou-cv codyaustun ivan-zhou haoyu0408 rkuwahara mattvilim aisecure michael-aloys stanislas0 hendrycks gehongpeng wangdian215 nhanthien sandy4321 kili-technology oliviasheng fundou huaizhengzhang curiousstack ucbrise m-a-r-p nbswords jichengyuan golamrashed polyrhythml arnav-ladkat shaima-haque ykwon0407 edmontdants greatscherzo robiriondo bobycv06fpm b2dl-uit achuthasubhash zeeroocooll suveshbaskar jessiehe970311 georgepearse aaronlyt richard-h-wang nikalras colinavrech hangzhang10 swagshaw vimal-m tahararib ivanrs297 yuanhaitao eashanadhikarla david-lee-1990 xrosliang tianhaofu kristine-li nelsoncardenas jakobls xzy-dlut wangpichao wesleyclode jichoi0000 python-repository-hub brunoscaglione thiagonoma hyzcn sscdotopen yjang43 punitkataria timinovvo pnrajan coreyabs-db gaohuan2015 arsyed markovml wufan-tb dongkuanx27 nanbhas dijkspicy yulonghui wbing520 neuron1682 lordakims mbrukman luisoala standardgalactic rpatil524 eliasjacob dherath gmartinsribeiro johnhostetter stanleyjacob yeha-adry stravanni songnous cboylston daochenzha yunusgumussoy haolun-wu

data-centric-ai's Issues

[Area Page]

Use this template to propose a new area page. A good rule-of-thumb to decide whether a topic merits an area page is whether it would pass muster as a workshop at a machine learning venue (e.g. ICML).

Please address the following questions when raising this issue:

what story you might tell about the topic's importance to data-centric AI
whether this topic is related to other areas in data-centric AI, and why existing discussions may not be sufficient
what subtopics, resources and related work you may discuss in the area page

If your request to add the new area is approved, you can submit a PR with the following changes:

add an additional area in README.md
add an area page with relevant discussion

Dataframe preprocessing platform:

Automunge is a resource for encoding dataframes for supervised learning. Originally built as a resource for missing data infill, the library evolved to include a unique and refined API for engineering sets of feature transformations, including some novel approaches for integrating stochasticity into supervised learning training or inference. The library has been shared in workshops at venues like NeurIPS, ICML, and ICLR. We suggest the tutorials folder in the github account as a starting point, or full (and comprehensive) documentation is provided in the readme file. Further write-ups are available in the arxiv literature authored by the developer (Nicholas J. Teague) and linked from automunge.com

[Area Page] Data Selection

Data selection methods, such as active learning and core-set selection, are useful and important tools for machine learning on large datasets. Major AI/ML conferences such as NeurIPS and ICML have consistently featured workshops and tutorials on these topics:

SubSetML: Subset Selection in Machine Learning: From Theory to Practice, ICML 2021
Workshop on Dataset Curation and Security, NeurIPS 2020
Active Learning From Theory to Practice, ICML 2019

what story you might tell about the topic's importance to data-centric AI
Large-scale unlabeled datasets can contain millions or billions of examples covering a wide variety of underlying concepts. Yet, these massive datasets often skew towards a relatively small number of common concepts, for example ‘cats’, ‘dogs’, and ‘people’. Rare concepts, such as ‘harbor seals’, tend to only appear in a small fraction of the data (usually less than 1%). However, performance on these rare concepts is critical in many settings. For example, harmful or malicious content may comprise only a small percentage of user-generated content, but it can have a disproportionate impact on the overall user experience. Similarly, when debugging model behavior for safety-critical applications like autonomous vehicles, or when dealing with representational biases in models, obtaining data that captures rare concepts allows machine learning practitioners to combat blind spots in model performance. Even a simple task, such as stop sign detection by an autonomous vehicle, can be difficult due to the diversity of real-world data. Stop signs may appear in a variety of conditions (e.g., on a wall or held by a person), can be heavily occluded, or have modifiers (e.g., “Except Right Turn”). Large-scale datasets are essential but not sufficient; finding the relevant examples for these long-tail tasks is challenging. Data selection methods, active learning, active search, and core-set selection methods, have the potential to automate the process of identifying these rare, high-value data points. (See "Similarity Search for Efficient Active Learning and Search of Rare Concepts" for more detail)

whether this topic is related to other areas in data-centric AI, and why existing discussions may not be sufficient
All of the other areas focus on how we process data, not which data should we process.

what subtopics, resources and related work you may discuss in the area page

Active learning
Active search
Core-set selection
Cooperative learning

hazyresearch / data-centric-ai Goto Github PK

data-centric-ai's People

Contributors

Stargazers

Watchers

Forkers

data-centric-ai's Issues

[Area Page]

[Area Page] Data Selection

[Awesome List] Tools

[Awesome List] papers ：）

Unable to join the Discourse

[Awesome List] Startups

Similar awesome list?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent