anasaito / skillner Goto Github PK

View Code? Open in Web Editor NEW

127.0 6.0 46.0 15.6 MB

A (smart) rule based NLP module to extract job skills from text

Home Page: https://skillner.vercel.app/

License: MIT License

Python 79.97% Jupyter Notebook 20.03%

ner skills nlp rule-based python skillner spacy

skillner's Introduction

Live demo | Documentation | Website

Just looking to test out SkillNer? Check out our demo.

SkillNer is an NLP module to automatically Extract skills and certifications from unstructured job postings, texts, and applicant's resumes.

Skillner uses EMSI databse (an open source skill database) as a knowldge base linker to prevent skill duplications.

Installation

It is easy to get started with SkillNer and take advantage of its features.

First, install SkillNer through the pip

pip install skillNer

Next, run the following command to install spacy en_core_web_lg which is one of the main plugins of SkillNer. Thanks to its modular nature, you can customize SkillNer behavior just by adjusting | plugin | unplugin modules. Don't worry about these details, we will discuss them in detail in the upcoming Tutorial section.

python -m spacy download en_core_web_lg

Note: The later installation will take a few seconds before it gets done since spacy en_core_web_lg is a bit too large (800 MB). Yet, you need to wait only one time.

Example of usage

With these initial steps being accomplished, let’s dive a bit deeper into skillNer through a worked example.

Let’s say you want to extract skills from the following job posting:

“You are a Python developer with a solid experience in web development and can manage projects. 
You quickly adapt to new environments and speak fluently English and French”

Annotating skills

We start first by importing modules, particularly spacy and SkillExtractor. Note that if you are using skillNer for the first time, it might take a while to download SKILL_DB.

SKILL_DB is SkillNer default skills database. It was built upon EMSI skills database .

# imports
import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

# extract skills from job_description
job_description = """
You are a Python developer with a solid experience in web development
and can manage projects. You quickly adapt to new environments
and speak fluently English and French
"""

annotations = skill_extractor.annotate(job_description)

Exploit annotations

Voilà! Now you can inspect results by rendering the text with the annotated skills. You can achieve that through the .describe method. Note that the output of this method is literally an HTML document that gets rendered in your notebook.

Besides, you can use the raw result of the annotations. Below is the value of the annotations variable from the code above.

# output
{
    'text': 'you are a python developer with a solid experience in web development and can manage projects you quickly adapt to new environments and speak fluently english and french',
    'results': {
        'full_matches': [
            {
                'skill_id': 'KS122Z36QK3N5097B5JH', 
                'doc_node_value': 'web development', 
                'score': 1, 'doc_node_id': [10, 11]
            }
        ], '
        ngram_scored': [
            {
                'skill_id': 'KS125LS6N7WP4S6SFTCK', 
                'doc_node_id': [3], 
                'doc_node_value': 'python', 
                'type': 'fullUni', 
                'score': 1, 
                'len': 1
            }, 
        # the other annotated skills
        # ...
        ]
    }
}

Contribute

SkillNer is the first Open Source skill extractor. Hence it is a tool dedicated to the community and thereby relies on its contribution to evolve.

We did our best to adapt SkillNer for usage and fixed many of its bugs. Therefore, we believe its key features make it ready for a diversity of use cases. However, it still has not reached 100% stability. SkillNer needs the assistance of the community to be adapted further and broaden its usage.

You can contribute to SkillNer either by

Reporting issues. Indeed, you may encounter one while you are using SkillNer. So do not hesitate to mention them in the issue section of our GitHub repository. Also, you can use the issue as a way to suggest new features to be added.
Pushing code to our repository through pull requests. In case you fixed an issue or wanted to extend SkillNer features.
A third (friendly and not technical) option to contribute to SkillNer will be soon released. So, stay tuned...

Finally, make sure to read carefully our guidelines before contributing. It will specify standards to follow so that we can understand what you want to say.

Besides, it will help you setup SkillNer on your local machine, in case you are willing to push code.

Useful links

Visit our website to learn about SkillNer features, how it works, and particularly explore our roadmap
Get started with SkillNer and get to know its API by visiting the Documentation
Test our Demo to see some of SkillNer capabilities

skillner's People

Contributors

Stargazers

Watchers

skillner's Issues

add displaycy render method for skill extractor class

How to train to adapt other languages?

For those skill texts in other languages supported by spaCy, how to train?

Normalize form of extracted skills

Normalize the form of extracted skills

when calling SkillExtractor on an input text, we get a dict with the following form:

{
    "full_match": [] # arr of skills,
    "ngram_full_match": [] # arr of skills
    # ...
}

we better have a unified form of the skills extracted, particularly I suggest the following form

# example
{
  'skill_id': 'KS1218P5Y0HGBD3Z4L3Q',
  'doc_node_value': 'doc value'
  'score': 0.58,
  'doc_node_id': [12, 13]
}

Following that, a full_match would have a score of 1

new scorer that manage full_uni_matcher , low_form_matcher , token_matcher conflicts

integrate the three matchers results in one results to generate span conflicts
the span is taken by the skill that need more tokens
example : doc : manage project / skills : project management , management => project management

Span conflict detector should neglect stop words in span construction

Is your feature request related to a problem? Please describe.

Having stop words in the sentence fragment potential spans, making long skills' detection impossible

Actual behavior
Wanted behavior

Describe the solution you'd like

the solution reside in making the span detector ignore stop words when creating spans
this may be an optional prop when initializing the skillExtractor (span_detector_ignore_stop_words=True)

Is it possible to extract all the skill names into a list? as well as if it's a hard, soft skill or certification?

annotations = skill_extractor.annotate(job_description)
skill_extractor.describe(annotations)

these 2 lines of codes will annotate the texts visually, and I tried to use the followings to extract all the skills into a list:
doc_node_values = list(set([entry['doc_node_value'] for entry in annotations['results']['ngram_scored']]))
print(doc_node_values)

is there a way to also extract the skill label? (e.g., soft, hard, certification etc.)?

thank you

Update documentation after stricktly requiring "en_core_web_lg"

enhance readme

see https://github.com/KennethEnevoldsen/augmenty/blob/master/readme.md for ideas

Make text cleaning optional.

Is your feature request related to a problem? Please describe.
The cleaning of the text makes it impossible to link annotated spans to the character indices of the original text. This in turn makes it impossible to compare the performance of this model to other ner models.

Describe the solution you'd like
Make the text cleaning step optional. When the cleaning step is omitted, then abv_text == immutable_text.

Describe alternatives you've considered
Provide additional metadata containing the start and end character indices of each annotated span linked to the original text rather in addition to the boundaries linked to the cleaned text

Use all unique words for ngram ratio score

Use one hot vector to get words then set on them

span confict extractor should pass stopword in copus construction

having stopword as tokens in corpus construction may create diferent spans for the same skill
example Business-To-Business Marketing -> Business | Business Marketing

track uni/2_gram to score them

Rename general params (Upper case) to avoid confusion with usual variables

move lemmed full match to full match module for output normalization

add a stopword-less, lemmatized/stemmed as surface forms for 2>gram skills

Overview
It would probably be reasonable to expect a skill titled "Working as a team" to be extracted from "work in a team" or "work in teams", but it's not, and there are no low surface forms for this skill at all.

Proposal
for skills with Len > 2, can we add a stopword-less, lemmatized/stemmed form of the skill title as a low form? e.g. "work team" will be a low surface form for the skill, and hopefully matched with the stopword-less, lemmatized/stemmed text?

Switch from rule-based to lookup lemmatization

Switch from rule-based to lookup lemmatization in spacy pipeline to garantie matching between db skills and text entities
see : spacy doc / lemmatizer config section

visualise bug when annotation obkect is empty

Add new feature to Text class: start/end position of words

better score for uni_token using weigh_ratio

SkillNER usage with own db with skill name and aliases

Hi!

Thank you for such a great library!

I have a question - Is it possible to specify aliases of each skill in the current SkillNER implementation? I'd like to configure SkillNER with my own database, where I have root skill names and aliases for them.

Thanks!

Add little README for pip install skillNer

must include:

how to pip install it
how to specify version

BUG - unable to import ``skillner`` in a newly created python environment

Describe the bug
IPython is listed in requirements.txt but not setup.py. This causes import errors when installing skillNer via pip.

To Reproduce
Install skillNER and spacy

python -m pip install skillNER spacy
python -m spacy download en_core_web_lg

Import SkillExtractor

>>> from skillNer.skill_extractor_class import SkillExtractor

which produces the following exception:

Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/scottgbarnes/venvs/personifi-etl/lib/python3.10/site-packages/skillNer/skill_extractor_class.py", line 11, in <module>
    from skillNer.visualizer.html_elements import DOM, render_phrase
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
  File "/Users/scottgbarnes/venvs/personifi-etl/lib/python3.10/site-packages/skillNer/visualizer/html_elements.py", line 5, in <module>
    from IPython.core.display import HTML
  File "/Applications/PyCharm CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_import_hook.py", line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
ModuleNotFoundError: No module named 'IPython'

Expected behavior
Should import without error.

Screenshots
N/A

Desktop (please complete the following information):

Python 3.10

Smartphone (please complete the following information):
N/A

Additional context
Installing IPython manually fixes this issue, but dependencies should be handled within the skillNER package. Generally, your requirements.txt and setup.py dependencies list should be identical.

Add translation layer

use string distance instead of similarity if empty vectors

spacy similarity : Evaluating Token.similarity based on empty vectors.

disable nused modules in nlp spacy [ner component]

test pip packaging for skillner

IndexError: list index out of range

Some strings make the annotate function crash:

import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

# init params of skill extractor
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

If you run the code above you should get the following error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[69], line 1
----> 1 skill_extractor.annotate("Learn how to become a professional wedding makeup artist")

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/skill_extractor_class.py:129), in SkillExtractor.annotate(self, text, tresh)
    123 skills_abv, text_obj = self.skill_getters.get_abv_match_skills(
    124     text_obj, self.matchers['abv_matcher'])
    126 skills_uni_full, text_obj = self.skill_getters.get_full_uni_match_skills(
    127     text_obj, self.matchers['full_uni_matcher'])
--> 129 skills_low_form, text_obj = self.skill_getters.get_low_match_skills(
    130     text_obj, self.matchers['low_form_matcher'])
    132 skills_on_token = self.skill_getters.get_token_match_skills(
    133     text_obj, self.matchers['token_matcher'])
    134 full_sk = skills_full + skills_abv

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/matcher_class.py:332), in SkillsGetter.get_low_match_skills(self, text_obj, matcher)
    329 for match_id, start, end in matcher(doc):
    330     id_ = matcher.vocab.strings[match_id]
--> 332     if text_obj[start].is_matchable:
    333         skills.append({'skill_id': id_+'_lowSurf',
    334                        'doc_node_value': str(doc[start:end]),
    335                        'doc_node_id': list(range(start, end)),
    336                        'type': 'lw_surf'})
    338 return skills, text_obj

File [~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/jila/Documents/Innosuisse/datasets/coco/~/anaconda3/envs/skillner/lib/python3.9/site-packages/skillNer/text_class.py:304), in Text.__getitem__(self, index)
    277 def __getitem__(
    278     self,
    279     index: int
    280 ) -> Word:
    281     """To get the word at the specified position by index
    282 
    283     Parameters
   (...)
    302     english
    303     """
--> 304     return self.list_words[index]

IndexError: list index out of range

Adpat skillNer for remote use (when installing it via pip)

add surface form support for skills in db

Some skills have unique tokens that can be used as full matches which reduce sub matches that need to scored

Externalize Conflict solver rules as user given config

Is your feature request related to a problem? Please describe.

By default, Skillner use some rules to solve conflicts between skills that match with the same tokens in the user text. For example, management token in a text could match with management as a skill and project management in our case we give it to the lengthiest skill as an internal rule

Describe the solution you'd like

It's preferable to give the user the choice to decide about this conflict by choosing his own rules

No such file or directory: 'skill_db_relax_20.json' when "from skillNer.skill_extractor_class import SkillExtractor"

** I try to use skillNer on Unbuntu OS and python3.10 **
I have installed skillNer package + en_core_web_md
using - "pip3 install skillNer"
- "python3 -m spacy download en_core_web_md"

and when i tried to import skillExtractor as "from skillNer.skill_extractor_class import SkillExtractor"

I got the following error:

I have tried to put "skill_db_relax_20.json" on the same general_params.py folder and change it's permissions
It still the same error.

Thanks for helping

abreveation skills extractor

add module to detect skills abreveations CATIA , IT ,JS ...

STY - add gh-actions for code linting and docstyle

Issue in extracting skills in long text

Hi All,
I'm trying to extract the skills from (quite a long) job description. The issue is that it takes a couple of minutes to do the extract. Is there a way to speed it up maybe setting up some parameters? Or SkillNER meant to be used only with short texts?

Here my sample:

import spacy
from spacy.matcher import PhraseMatcher
from skillNer.general_params import SKILL_DB
from skillNer.skill_extractor_class import SkillExtractor
nlp = spacy.load("en_core_web_sm")
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)
job_description = """
Supply Chain Lead Manager - Logistic Warehouse Skip to main content Skip to footer Insights 5G Artificial Intelligence Blockchain Cloud Customer Experience Cybersecurity Digital Engineering & Manufacturing Digital Transformation Edge Computing Future of Work Supply Chain Sustainability Podcasts Blogs Services Application Services Artificial Intelligence Automation Business Process Outsourcing Business Strategy Change Management Cloud Customer Experience Data & Analytics Digital Commerce Digital Engineering & Manufacturing Ecosystem Services Finance Consulting Infrastructure Marketing Mergers & Acquisitions (M&A) Metaverse Operating Models Security Supply Chain Management Sustainability Technology Consulting Technology Innovation Zero-Based Transformation Industries Aerospace and Defense Automotive Banking Capital Markets Chemicals Communications and Media Consumer Goods and Services Energy Health High Tech Industrial Insurance Life Sciences Natural Resources Public Service Retail Software and Platforms Travel Utilities Careers Careers Home Careers Home Search Jobs Search Jobs Join Us Search and Apply Experienced Professionals Entry Level Jobs Students Training & Development Work Environment Executive Leaders Explore Jobs Search Jobs by Areas of Expertise Consulting Jobs Corporate Jobs Digital Jobs Operations Jobs Strategy Jobs Technology Jobs About Accenture Who We Are About Accenture Leadership How We Work with Clients Case Studies Newsroom Investor Relations Inclusion & Diversity Sustainability How We're Organized Strategy & Consulting Song Technology Operations Industry X In Netherlands Accenture Training Centre in the Netherlands Contact Us linkedin twitter facebook instagram-outline return to previous button NAN Current Country: Netherlands Job Search Supply Chain Lead Manager - Logistic Warehouse Multiple Locations +View All JOB NO. R00131305 APPLY NOW Register for Job Alerts SHARE SHARE Job link: Copy Error Thank you for your interest. If you wish to apply for a position outside of India, please reach out to your referrer to start a new referral process by referring you for the position in the desired country. You will now be redirected to India Jobs Portal to explore other opportunities within India. Close Job Description Excited to join us at Accenture Belgium and work with around 150 brightest minds, most capable individuals and technologists? Are you keen to work closely and learn from global experts, drive innovation hands-on and shape the future together with a diverse team and our clients to improve the way how Luxembourg works and lives?Accenture is a leading global professional services company, providing a broad range of services in strategy and consulting, interactive, technology and operations, with digital capabilities across all of these services.From transformation planning through blueprinting to implementation, there’s never been a better time to join our SAP team. Accenture global SAP practice grows rapidly by delivering digital roadmap and business case through functional and industry transformation, design thinking, agile development and the latest tech and platforms, including cloud, S/4HANA, SAP Leonardo, IoT, artificial intelligence, big data and mobile to build next-generation solutions.YOUR RESPONSABILITIESDecompose key business problems to identify value areas and structure and implement complex technology solutions for the clientxa0Create innovative and differentiated offerings, staying relevant and in sync with market demandDeliver advisory services in support of technology enabled business change, e.g.: package & third party integrator selection, vendor selection, application studies, solution architecture studies and/or program diagnostics and recovery.Deliver advisory services to the IT function: CIO/CTO/CDOGrow market share by leveraging relationships, winning work, and being integral to delivery of on-shore consulting engagementsSell on an individual basis or as a part of a team Show more Show less Qualifications Master’s degree from a leading university or business school or equal by experienceAt least 2 years of hands-on SAP experience in any SAP Supply Chain module (TM/LE, PP/QM, (E)WM, Ariba, MM, PM, …)At least 1 full life-cycle implementation experience: including estimating, planning and delivering a solution end-to-endExperience in delivering an end-to-end S/4HANA or SAP ECC project (S/4HANA implementation is considered as a key plus)Stakeholder management within complex SAP programsPassionate about SAP, innovation and delivery the latest technologies to your clients.An SAP certification in your areaYou shine with your deep business & industry experience in one or more of the industries, such as: mining & metal, energy & utilities, public services, health, retail & consumer goods, aviation, media, manufacturingEagerness to shape and work in a team-oriented environmentMotivated, persistent, eager to optimize, drive for excellence Locations Amsterdam,Brussels,London,Madrid Equal Employment Opportunity Statement All employment decisions shall be made without regard to age, race, creed, color, religion, sex, national origin, ancestry, disability status, veteran status, sexual orientation, gender identity or expression, genetic information, marital status, citizenship status or any other basis as protected by federal, state, or local law. Job candidates will not be obligated to disclose sealed or expunged records of conviction or arrest as part of the hiring process. Accenture is committed to providing veteran employment opportunities to our service men and women. Please read Accenture’s Recruiting and Hiring Statement for more information on how we process your data during the Recruiting and Hiring process. APPLY NOW Register for Job Alerts COVID-19 update:xa0 The safety and well-being of our candidates, our people and their families continues to be a top priority. Until travel restrictions change, interviews will continue to be conducted virtually.xa0 Share Related Jobs Multiple Locations SAP Data Migration Senior Consultant Business & Technology Integration Posted 1 day ago Netherlands Amsterdam Industry X - Experienced Consultant Business & Technology Integration Posted 2 days ago Netherlands Amsterdam Cloud Solution Architect Business & Technology Integration Posted 2 days ago View More Jobs Life at Accenture Work where you're inspired to explore your passions and where your talents are nurtured and cultivated. Innovate with leading-edge technologies on some of the coolest projects you can imagine. Training and Development Take time away to learn and learn all the time in our regional learning hubs, connected classrooms, online courses and learning boards. LEARN MORE Work Environment Be your best every day in a work environment that helps drive innovation in everything you do. LEARN MORE View All View Less Learn more about Accenture Our more than 700,000 people in more than 120 countries, combine unmatched experience and specialized skills across more than 40 industries. We embrace the power of change to create value and shared success for our clients, people, shareholders, partners and communities. Our Expertise See how we embrace the power of change to create value and shared success for our clients, people, shareholders, partners and communities. FIND OUT MORE Meet Our People From entry-level to leadership, across all business and industry segments, get to know our people harnessing technology to make a difference, every day. FIND OUT MORE View All View Less Stay connected Join Our Team Search open positions that match your skills and interest. We look for passionate, curious, creative and solution-driven team players. SEARCH ACCENTURE JOBS Keep Up to Date Stay ahead with careers tips, insider perspectives, and industry-leading insights you can put to use today–all from the people who work here. READ CAREERS BLOG Job Alert Emails Personalize your subscription to receive job alerts, latest news and insider tips tailored to your preferences. See what exciting and rewarding opportunities await. REGISTER FOR JOB ALERTS View All View Less About Us Contact Us Alumni PRIVACY STATEMENT instagram linkedin twitter facebook youtube Recruiting and Hiring Privacy Statement Terms & Conditions Cookie Policy Accessibility Statement Sitemap Global Meritocracy 2022 Accenture. All Rights Reserved. Okay Cancel Close View Transcript Close First Name field empty Valid Entry The first name is required and cannot be empty Last Name field empty validation error Valid Entry The last name is required and cannot be empty E-mail Address field empty validation error This email address is already in use Valid Entry This value is not valid This value is not valid This email address is already in use. Comments 2000 characters Field Empty Input text here Valid Entry Invalid Entry 2000 characters This value is not valid This value is not valid Send E-mail Cancel Close There is already a separate, active Accenture Careers account with the same email address as your LinkedIn account email address. Please try logging in with your registered email address and password. You can then update your LinkedIn sign-in connection through the Edit Profile section. Continue Cancel
"""

annotations = skill_extractor.annotate(job_description)
print(annotations)

IndexError while annotate certain skill

Describe the bug
while using skill_extractor.annotate(text) to annotate certain skill it returns error

To Reproduce
Here are some examples to reproduce the behavior:

text= "IDS/IPS sensors are collecting"
skill_extractor.annotate(text)

the main error is resulted from the skill "IDS"

Remove entirely loading nlp in skillner

French Language support

Does it support descriptions written in french ?

uni tool skills should use links metadat for context instead of similarity

update README for launch

Issue when creating skill db

Following how_new_db.md, created a custom skill_db_relax_20.json.
But there are slightly difference between mine and your version
for example in my version I got
"KS1201Q70VWZPS6KTMFR": { "skill_name": "3GPP2 (Telecommunication)", "skill_type": "Specialized Skill", "skill_len": 2, "high_surfce_forms": { "full": "3gpp2 telecommun" }, "low_surface_forms": [ "3gpp2 telecommun", "telecommun 3gpp2" ], "match_on_tokens": false }
while in your version:
"KS1201Q70VWZPS6KTMFR": { "skill_name": "3GPP2 (Telecommunication)", "skill_type": "Hard Skill", "skill_len": 1, "high_surfce_forms": {"full": "3gpp2"}, "low_surface_forms": [], "match_on_tokens": false, }
If using classic stem approach, the high_surfce_forms will be my version. However in your version it's the correct abbreviation form. Meanwhile I see you used abv = SKILL_DB[key]['abbreviation'] but there's never any abbreviation information from EMSI endpoint.
I'm wondering if you retrieve this information somewhere else, or it's just a manual work?

cleaning bug change token order

a more robust similarity scorer

The actual similarity module use avg vectors of doc words vector to compute similarity , we need a more robust similarity scorer
Check sen2vec to Train a personalized similarity scorer

2_gram tools sould use wrd distance insted of unmatch sim in scoring

update skills db to support unique token identifier

2gram distribution data should be cached

unify full matchers in one matcher (full name ,steemed full name , abrv, unique token matcher)

some skill are mis cleaned which make them undetectable

react.js -> reactjs / node.js -> nodejs

add module for ngram and unigram conflicts

Support for Extracting Skills from Custom Skill Lists

Hi 👋 Thanks for this great repo--I really liked how smart the tool is, especially being able to extract "Project Management" from the phrase "manage projects". I'd love to hear what you think about the following use case:

Is your feature request related to a problem? Please describe.
I am looking to use your tool with a custom skill list other than EMSI, e.g. O*NET skill lists

Describe the solution you'd like

It would be great to have API support for a custom skill list. However, I understand that this could involve a lot of work.
Alternatively, an instruction on how to create the skill_db_relax_20.json and token_dist.json files for custom skill lists would also be much appreciated.

Describe alternatives you've considered
I have traced the code a little bit, and found that we would probably need

skill_db_relax_20.json, which seems to be generated with skills_processor/create_surf_db.py based on token_dist.json and skills_processed.json, and
token_dist.json, which seems to be generated with skills_processor/create_token_dist.py based on skill_db_relax_20.json

My questions are:

Could you provide more description/script on how skills_processed.json is generated? More specifically, what are the rules (or data sources) that determine the following fields: unique_token, match_on_stemmed?
Per my previous observation, the required files for generating skill_db_relax_20.json and token_dist.json seems to be circular--they require each other to be generated.. What should be the correct order?
- Correct me if I'm wrong, it looks like token_dist.json could be generated first, with n_grams in this line being a list of strings of lowered, lemmatized skill titles (only if skill title is more than 1 word; otherwise it's the lowered skill title without the parenthesis).

Additional context
Once the two questions are resolved, I would be happy to write a modularized script that generates skills_processed.json, skill_db_relax_20.json, and token_dist.json from any given skill list/table, and create a pull request for it.

Looking forward to hearing from you 😃

anasaito / skillner Goto Github PK

skillner's Introduction

Installation

Example of usage

Annotating skills

Exploit annotations

Contribute

Useful links

skillner's People

Contributors

Stargazers

Watchers

Forkers

skillner's Issues

Recommend Projects

Recommend Topics

Recommend Org