Giter Club home page Giter Club logo

data-centric-ai's People

Contributors

ad12 avatar aisecure avatar alvinmingwisc avatar anarayan avatar bhancock8 avatar codyaustun avatar danfu09 avatar eashanadhikarla avatar hendrycks avatar huaizhengzhang avatar ivan-zhou avatar jdunnmon avatar khaledsaab avatar krandiash avatar lorr1 avatar luisoala avatar mattvilim avatar maya124 avatar mayeechen avatar michael-aloys avatar mleszczy avatar mzio avatar nelsoncardenas avatar oliviasheng avatar rengglic avatar rlnsanz avatar robiriondo avatar seyuboglu avatar w4nderlust avatar zhangce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-centric-ai's Issues

[Area Page]

Use this template to propose a new area page. A good rule-of-thumb to decide whether a topic merits an area page is whether it would pass muster as a workshop at a machine learning venue (e.g. ICML).

Please address the following questions when raising this issue:

  • what story you might tell about the topic's importance to data-centric AI
  • whether this topic is related to other areas in data-centric AI, and why existing discussions may not be sufficient
  • what subtopics, resources and related work you may discuss in the area page

If your request to add the new area is approved, you can submit a PR with the following changes:

  • add an additional area in README.md
  • add an area page with relevant discussion

Dataframe preprocessing platform:

Automunge is a resource for encoding dataframes for supervised learning. Originally built as a resource for missing data infill, the library evolved to include a unique and refined API for engineering sets of feature transformations, including some novel approaches for integrating stochasticity into supervised learning training or inference. The library has been shared in workshops at venues like NeurIPS, ICML, and ICLR. We suggest the tutorials folder in the github account as a starting point, or full (and comprehensive) documentation is provided in the readme file. Further write-ups are available in the arxiv literature authored by the developer (Nicholas J. Teague) and linked from automunge.com

[Area Page] Data Selection

Data selection methods, such as active learning and core-set selection, are useful and important tools for machine learning on large datasets. Major AI/ML conferences such as NeurIPS and ICML have consistently featured workshops and tutorials on these topics:

what story you might tell about the topic's importance to data-centric AI
Large-scale unlabeled datasets can contain millions or billions of examples covering a wide variety of underlying concepts. Yet, these massive datasets often skew towards a relatively small number of common concepts, for example ‘cats’, ‘dogs’, and ‘people’. Rare concepts, such as ‘harbor seals’, tend to only appear in a small fraction of the data (usually less than 1%). However, performance on these rare concepts is critical in many settings. For example, harmful or malicious content may comprise only a small percentage of user-generated content, but it can have a disproportionate impact on the overall user experience. Similarly, when debugging model behavior for safety-critical applications like autonomous vehicles, or when dealing with representational biases in models, obtaining data that captures rare concepts allows machine learning practitioners to combat blind spots in model performance. Even a simple task, such as stop sign detection by an autonomous vehicle, can be difficult due to the diversity of real-world data. Stop signs may appear in a variety of conditions (e.g., on a wall or held by a person), can be heavily occluded, or have modifiers (e.g., “Except Right Turn”). Large-scale datasets are essential but not sufficient; finding the relevant examples for these long-tail tasks is challenging. Data selection methods, active learning, active search, and core-set selection methods, have the potential to automate the process of identifying these rare, high-value data points. (See "Similarity Search for Efficient Active Learning and Search of Rare Concepts" for more detail)

whether this topic is related to other areas in data-centric AI, and why existing discussions may not be sufficient
All of the other areas focus on how we process data, not which data should we process.

what subtopics, resources and related work you may discuss in the area page

[Awesome List] Tools

Catalogue data-centric tools that are related to existing area pages, e.g. tools for data programming, weak supervision, data cleaning, data privacy, robustness, evaluation, monitoring, etc.

[Awesome List] papers :)

[Describe an awesome list that you would like to add or contribute to: what does the contribution require, and what would you add]

《The Re-Label Method For Data-Centric Machine Learning》
《Learning From How Humans Correct》
《Automatic Label Error Correction》
《Re-Label By Data Pattern For Controllable Deep Learning》

Unable to join the Discourse

Dear Concern,
I was trying to access the discourse on data-centric-ai, but when I "click to join if you are a new user", I get this.
image

[Awesome List] Startups

Catalogue startups that are working to develop data-centric AI solutions, including MLOps tools, ML platforms, data management solutions for ML, etc. Link here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.