Giter Club home page Giter Club logo

jobbuzz's People

Contributors

as-greywing avatar dsychin avatar syahnur197 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

albertsyh

jobbuzz's Issues

Scrape company information

Currently only the company name is saved.

Change it so that the following information are also scraped and saved.

  • Logo
  • General information
  • Contact
  • etc

JSON returned is capitalised

If we use the following

type Job struct {
    gorm.Model
    Title     string         
    Company   string     
    Salary    string         
    Location  string 
}

It will return the following JSON in API

{
  "ID" : 1,
  "CreatedAt" : "foo",
  "UpdatedAt" : "foo",
  "DeletedAt": "foo",
  "Title" : "foo",
  "Company" : "foo",
  "Salary" : "foo",
  "Location" : "foo",
}

Annotating the struct with json"field" won't lowercase the fields in gorm.Model

type Job struct {
    gorm.Model
    Title     string `json:"title"` 
    Company   string `json:"company"`
    Salary    string `json:"salary"`
    Location  string `json:"location"`
}

It will return the following JSON in API

{
  "ID" : 1,
  "CreatedAt" : "foo",
  "UpdatedAt" : "foo",
  "DeletedAt": "foo",
  "title" : "foo",
  "company" : "foo",
  "salary" : "foo",
  "location" : "foo",
}

To lowercase all the fields we can't use gorm.Model

type Job struct {
	ID        uint           `gorm:"primarykey" json:"id"`
	CreatedAt time.Time      `json:"created_at"`
	UpdatedAt time.Time      `json:"updated_at"`
	DeletedAt gorm.DeletedAt `gorm:"index" json:"deleted_at"`
	Title     string         `json:"title"`
	Company   string         `json:"company"`
	Salary    string         `json:"salary"`
	Location  string         `json:"location"`
}

It will return the following JSON in API

{
  "id" : 1,
  "created_at" : "foo",
  "updated_at" : "foo",
  "deleted_at": "foo",
  "title" : "foo",
  "company" : "foo",
  "salary" : "foo",
  "location" : "foo",
}

I prefer the above JSON fields since it's the common convention in API response. What do you think?

Job notification subscription feature

  • User should be able to specify filter parameters

    • keywords
    • location
    • salary
  • Notify daily?

  • Notification medium

    • web push
    • email
    • app push (our own standalone app or something like pushover?)
  • Create database schema

  • Create system architecture

  • Create sequence diagram

  • Implement

Refactor scraper to be modular and testable

Separate logic with external dependencies.
External dependencies should have an interface.

Have a top level function which handles the state and call smaller functions to fetch more data at each stage.

  • Get job links and company links and add to map
  • Get job details
  • Get company details
  • Return results

WaitGroup details should be hidden in the implementation

Changing scraper logic to be more resilient

Currently with go-colly, there is a potential issue where if 1 page fails to load then the data is considered corrupted because we need the whole set of data in order to determine which job listings are active or inactive.

There is no retry functionality in go-colly and error handling is not very useful.

I think it might be better for us to fetch the html as string (where we can have our own retry logic) then use an HTML parser to process the data instead.

This will be more similar to the logic of the scraper in the .NET version.

Get html node in go with css selector: https://github.com/PuerkitoBio/goquery

Retry: https://github.com/avast/retry-go

Job uniqueness

When we run the scraper cmd, it automatically create new jobs based on the jobs returned by the scrapers. Sometimes, the jobs is already exist in the DB, how do we prevent it to be inserted?

Some ideas, store the job links in db as well, then we query based on links before we insert the job to DB

Removal of inactive job listings

Currently implementation adds new job listings but does not remove old ones.

2 ways to do this.

  1. If the job page is no longer accessible when it is no longer valid, then just checking the page regularly and marking it appropriately should be fine.
  2. During the scraper job, scrape all job listings and compare with all entries in the database and mark missing ones as inactive.

DB Seeder

It's not an enjoyable development experience to keep on scraping the websites especially when you have unreliable internet connection. Suggestion is to have another cmd programme that seeds data to the DB. We can use export an SQL file from existing data and create a new cmd programme that imports the SQL file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.