Giter Club home page Giter Club logo

finstem's Introduction

finstem - simple tool for command-line Finnish stemming

Stems Finnish words. Takes any kinds of words you can throw at it. Even has its own tiny REPL!

image

📹 Video - silent install, 2023.12.07

output-fast.webm

The above is 10x to give a feel for how the commands work.

Normal-speed video: https://youtu.be/85qwsrGdwZs

Normal-speed video:

Quickstart

On Ubuntu 22.04

Tested on a totally fresh Vagrant install of Ubuntu 22.04. You probably already have some or all of these installed.

# Install the prerequisites
yes | sudo apt update
yes | sudo apt install pip python-is-python3
yes | sudo apt install voikko-fi python-libvoikko python3-click

# clone the repo and run the command!
git clone https://github.com/hiAndrewQuinn/finstem
cd finstem

python finstem.py --help
python finstem.py 'Näin' 'tervetuloa' 'kiltti' 'kissa' 'Nimeni' 'on' 'Jeff'

For scripters

finstem supports (experimental) CSV, TSV and JSON formats.

CSV format example

python finstem.py 'Näin' 'tervetuloa' 'kiltti' 'kissa' '.' 'Nimeni' 'on' 'Jeff' --format CSV | csvlook

image

TSV format example

python finstem.py 'Näin' 'tervetuloa' 'kiltti' 'kissa' '.' 'Nimeni' 'on' 'Jeff' --format TSV | awk '{print $3 " <~> " $2 " <~> " $1}'

image

JSON format example

python finstem.py 'hyvää' 'huomenta' --format JSON | \
while IFS= read -r line; do
    echo "$line" | jq .
done

image

Use with caution. I haven't used proper libraries for these yet.

Advanced

Passing a list of words in a text file

echo 'sana' > words.txt
echo 'vaimonille' >> words.txt
echo 'kirjoja' >> words.txt

# Pass each line as an argument to finstem.py
cat words.txt | xargs -n 1 python finstem.py

image

Interactive mode

Requires fzf.

echo '' | fzf --print-query \
	--preview-window='bottom:50%' \
	--preview "echo {q} | tr ' ' '\n' | xargs -I _ python finstem.py _" \
	--bind "enter:execute(echo {q} | tr ' ' '\n' | xargs -I _ python finstem.py _)+abort"

If you don't feel like typing out all that, just run finstem-interactive.sh.

For use with finfreq10k when reading a book

finfreq10k is an Anki deck containing the 10,0000 most common Anki words in order, made by yours truly. Using it in combination with finstem creates a powerful way to target your vocabulary practice to the words you have actually read that day.

Other screenshots

image

image

finstem's People

Contributors

hiandrewquinn avatar

Stargazers

Denis M. avatar Viktor Anikeenko avatar Christoffer Olsson avatar Steve Kemp avatar  avatar Tom Smalley avatar Murdho Savila avatar  avatar Eino Juhani Oltedal avatar K avatar  avatar Billal BEGUERADJ avatar  avatar Sandro Tanner avatar Sean Kipinä avatar Sam Spilsbury avatar Daniel Eke avatar Johannes avatar Mohamed Daahir avatar

Watchers

 avatar  avatar

finstem's Issues

Add --thick mode

Trying to script around this when it doesn't print the original word on any line is annoying, even if it makes a good first impression visually. Let's add a --thick flag, and turn it on by default for non-pretty printing, so that we don't frustrate developers trying to work around us.

Add subcommand: annotate

finstem annotate file.txt is an idea for a mode I had where, instead of running at the command line, we instead feed in a file and run finstem on each individual word. Where the lemmatizer gives 2 or more entries, we run fzf with a preview window to allow the user to pick the one that seems most obvious to them.

Strip punctuation from words

image

People can copy and paste sentences into this program, but only if there is no punctuation between the sentences that can get in the way of the word parsing. I can probably easily fix this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.