node start.js
starter.js
: deals with the display part. It uses cli-table library to display aggregation data as disired tableaggregator.js
: retrieves job description and location and aggregate them per category and continentdata_provider.js
: acts as a repository but read from given csv fileslocator.js
: provides a function to get continent from latitude and longitude, as well as an array containing all the continents.
npm test
Unit tests are written using Jest library.
I started naively parsing the csv separating new line symbols and comas but some of the entries requires a further parsing (escaping some characters).
A possible solution, would be to use a parser module, example: CSV module
The locator service currently returns a random continent, which allowed me to implement the display part easily, obviously this need to be changed. Two possibilities for this:
- using an API like geoapify.com or opencagedata.com
- using a library like Geojson-places
I started exploring the second part and pushed it on another branch, but this is not a complete job. Here is a quick look at the changes involved:
locator.js
const geo_json_places = require("geojson-places");
const continents_mapping = {
'EU': 'europe',
'AS': 'asie',
'AF': 'afrique',
'NA': 'amérique du sud',
'SA': 'amérique du nord',
'OC': 'océanie',
'AN': 'antarctique'
};
const get_continent = (lat, lon) => {
const lookup = geo_json_places.lookUp(lat, lon)
if (lookup) {
return continents_mapping[lookup.continent_code]
}
}
aggregator.js
const jobs_per_category_and_continent = async () => {
return Promise.all([
data_provider.read_jobs(),
data_provider.read_professions()
]).then(promises => {
const jobs = promises[0];
const professions = promises[1];
const infos = [];
for (const job of jobs) {
const continent = locator.get_continent(job.office_latitude, job.office_longitude)
if (continent) {
infos.push({
continent: continent,
profession: find_profession(job, professions)
});
} else {
console.error('cannot find location for job: ', JSON.stringify(job));
}
}
const result = {};
infos.forEach(info => {
if (info.profession) {
const category = info.profession.category_name;
if (!result[category]) {
result[category] = {};
}
if (!result['total']) {
result['total'] = {
total: 0
};
}
increment(result[category], 'total');
increment(result[category], info.continent);
increment(result['total'], 'total');
increment(result['total'], info.continent);
}
})
return result;
})
}
Since we are calling a third party service to find the location and this operation can take some times, these calls will eventually have to be parallelized and maybe done in batch to make this more performant and robust.
100 000 000 job offers in our database
1000 new job offers per second
One way would be to delegate the job of finding the continent to a database designed for this purpose (ex. PostGIS)
Then we can have 2 different apps (or worker/thread if we stay in the same application):
- the first one storing jobs in database
- the second one reading jobs in each different regions
For this purpose, we would have to configure:
- the boundaries of the continents in PostGIS
- maybe adding indexes on latitude/longitude would be required (I never worked intensively with this kind of Db)
Another way would be to have 2 applications communicating through a message broker (ex. Kafka, RabbitMQ)
The 1st one would:
- receive new job
- get the location (from an API/module)
- send a message containing job info and location over the broker
The 2nd one would:
- get messages from the broker
- store the info in Db
- provide an aggregation query: this one is simpler since there is no location to find and can be done with a dedicated SQL query.
The 2nd app write and read should be done in 2 different worker/thread to make it more performant, since the load on write part will increase.
With this Db model, we can even have many Db writer and consider this job table as an append-only table.
We could also separate reading and writing in 2 different apps:
The 1st would:
- receive new jobs
- find its location
- store in Db
The 2nd would:
- read Db with a dedicated query
With this architecture, we can scale our applications regarding its usage:
- if we get way more jobs: increase the number of job-receiver app running in parallel
- if we get more queries: increase the number of job-research app running in parallel
This aggregation could be a basis for job research and refinement.
The first aspect we need to design/define is the data we want to show our users:
- it can be people looking for a job
- it can also be an internal team looking for a way to visualize our current data
Depending on the user, we may show more/less info in our JSON model. I will assume it is aimed at people looking for a job.
Our base API endpoint would be:
GET /jobs
But it would not be really RESTful to present an aggregation on this endpoint: it would be more a list of jobs. For example the latest jobs with these default filters:
GET /jobs?page=0&issue_date=desc
===
{
data: [], // array of jobs
page: {
number: 0,
page_count: n,
element_per_page: m
}
}
As the number of jobs is supposed to increase, it seems logical to paginate the results.
We still want to provide a way to show data aggregation:
GET /jobs?view_type=aggregation
As well, we could offer a closer look, per country of a continent for example:
GET /jobs?view_type=aggregation&continent=europe
As we were aggregating per continent, we could also provide jobs listing per location:
GET /jobs?continent=europe
GET /jobs?continent=europe&country=germany
GET /jobs?continent=europe&country=germany&city=berlin
Job category is a useful filter as well:
GET /jobs?category=Tech
The name of the job would also be interesting to compare offer that are similar:
GET /jobs?name="bras droit"
This one would require a match on the name string name ilike 'bras droit'
. We should also be careful at properly escaping characters to avoid SQL injections.