Giter Club home page Giter Club logo

ner-furniture-names's Introduction

Veridion Challenge 2

Project: Furniture Stores Extraction

Goal

Develop a model capable of extracting product names from furniture store websites.

Inputs

  • A list of URLs from furniture store sites.

Outputs

  • A list of product names extracted from each URL.

Insights

Veridion provides the most comprehensive database of company data, gathered by AI with human precision.

Upon downloading a data sample, I needed to clarify whether the product names to extract were specific ("Hamar Plant Stand") or generic ("Plant Stand"). Inspection of the data sample led to the conclusion that "Plant Stand" is the target.

Veridion Data Sample: Data Dictionary - Product & Services Veridion Data Sample: Data Dictionary - Product & Services

Veridion Data Sample: Products & Services Sample Veridion Data Sample: Products & Services Sample

This challenge offers an opportunity to improve the extraction process, as some product names are currently not captured correctly.

Wrong product name example Veridion Data Sample: Products & Services Sample - Wrong product name example

Entity Recognizers Veridion Entity Recognizers - the basis for building the model to identify 'PRODUCTS' entities.

Guidelines

  1. Create a NER (Named Entity Recognition) model.
  2. Train the NER model to find 'PRODUCT' entities.
  3. Use ~100 pages from the URLs list for training data.
  4. Develop a method to tag sample products.
  5. Use the model to extract product names from unseen pages.
  6. Showcase the solution.

The Process

  1. URL Verification:

  2. Data Scraping:

  3. Data Cleaning:

  4. Data Organization:

  5. Text Annotation:

    • Annotated text using product_names.txt and extracted_product_data.csv with ner_tags.py, inspired by the wnut17 dataset structure.
  6. Data Splitting:

  7. Model Training:

  8. Model Testing and Solution Showcase:

    • Used the fine-tuned model to extract product names from the valid URLs and created some graphs about the products testing_ner.ipynb.

The Model and the Dataset

The model and the dataset can be found on Hugging Face:

Screenshot 2024-06-04 05 02 22

Screenshot 2024-06-04 05 02 58

Takeaways

  • created my first dataset from scratch;
  • fine-tuned my first LLM model;
  • deployed both on HuggingFace;
  • applied to my first machine learning internship;
  • confidence in working constantly with bash, vim, hf, different types of data;
  • understood how fine-tuning works for NER;
  • understood how LLM are processing data;

ner-furniture-names's People

Contributors

cetusian avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.