chozelinek / europarl Goto Github PK
View Code? Open in Web Editor NEWToolkit to compile a comparable/parallel corpus from European Parliament proceedings
License: MIT License
Toolkit to compile a comparable/parallel corpus from European Parliament proceedings
License: MIT License
Produce sentence alignment with HunAlign and InterText.
Sometimes, the annotation of the source language is inconsistent in the HTML files.
Provided that several languages have been processed:
sl
valuesIf versions agree, it is OK, if not:
Many foreign expressions (e.g. in other languages or latinates, and sometimes to stress irony, etc.) are annotated with italics in the source HTML files. It might be useful to keep this information through the pipeline.
After using proceedings_xml.py
, there are values for intervention
's attribute role
which are not only the relevant information but punctuation and more. Examples:
<intervention speaker_id="photo_generic" name="Algirdas Šemeta" is_mep="True" mode="spoken" role="Member of the Commission.">
<intervention speaker_id="photo_generic" name="László Kovács" is_mep="True" mode="spoken" role="Member of the Commission. −">
<intervention speaker_id="photo_generic" name="Vladimír Špidla" is_mep="True" mode="spoken" role="Member of the Commission. – (CS)">
Solution?
Modify proceedings_xml.py
before line 341 (self.intervention_to_xml(x_section, s_intervention)
) to clean s_intervention['role']
.
Produce a schema or DTD to validate the XML files and use it as documentation.
Format Python scripts properly and write docstrings.
There are a number of codes (references to documents, mainly) which degrade automatic linguistic preprocessing (e.g. tokenization, lemmatization, PoS, sentence splitting...).
An idea would be to annotate them to handle them properly by the automatic linguistic preprocessing.
Use the README.md for documentation of the scripts, but create a tutorial illustrating the pipeline (a n commented/explained version of the compile.sh
script).
Add to <p>
an attribute called trans
to make explicit if the paragraph is:
Or
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.