Diapositiva

1. WP5 Tasks & Deliverables2. Overview of parallel technology tools3. Parallel copora requirements4. Survey of language resources5. Work plan for t4-t146. Questions Input: parallel corpora produced in WP4
Output: language resources for MT in WP7/WP8
WP5.1 Sub-sentential alignment (DCU, ELDA, ILSP) WP5.2 Bilingual dictionary extraction (DCU, ILSP) D5.1 (t06): Report describing the inventory of parallel technology
tools to be developed and integrated in PANACEA and the characteristics of the resources to be produced.
D5.2 (t14) Aligners integrated into the platform, and documentation
D5.3 (t22) Parallel, sententially aligned texts, cleaned and prepared
for training/building translational models (20—50 million words) • D5.4 (t30) Final version of the Bilingual Dictionary Extractor
D5.5 (t30) Sample of bilingual dictionaries produced: EN—FR and
D5.6 (t30) Final version of the integrated Transfer Rules module, and
D5.7 (t30) Sample of transfer rules produced for EN—DE.
• Bil ingual dictionary extraction (WP5.2) Align bilingual corpus (existing or output from WP4) – Sentence– Word– Chunk / Syntactic • GIZA++, berkeleyaligner• word packing (“compound rich” languages, • Marker hypothesis: Marclator• Syntactic: TreeAligner – Integrate models: generative, syntactic, – Extend range of language pairs– Tune to text type, domain and genre– Check/filter corpora acquired (comparability – Baseline: phrase alignment in Moses– Extrinsic evaluation (SMT in WP7) Task: to derive bilingual dictionaries from aligned parallel
corpus
Methodology
– Expectation-Maximisation algorithm– Additional techniques on top of word correspondences → precision, fine-cleaning → reduce human intervention – Go beyond word level: MW translations (NPs, MWEs)– Baseline: word alignment in Moses– Evaluation? • Find criteria for lexical transfer selection • structural transfer (Probst, Sánchez-Martínez, et al.) – (matching of POS-sequences– independent of lexical material) • bilingual term extraction (Cabré 2001, Gamal o 2007) – structural transfer– lexical transfer • simple lexical• contextual lexical <- this is the task! conditions for transfer selection – with domain / subject area information („MEDICAL“)– with locale / variant („EN_UK“ „DE_CH“) – use information on local nodes (gender, number)– use structural contexts (arguments, prepositions, subcategorisation frames & fil ers) (main means of RMT) – use conceptual environment for disambiguation • using word sense disambiguation, statistical word alignment • supervised learning of most important disambiguation 1. domain tag assignment2. morphosyntactic tests • local features on gender / number• subcategorisation: Prepositions (for nouns and verbs)• presence / absence of verb arguments (trans./intrans.)• (relational Adj <-> compound specifier) • source language concept clusters (SMT uses target – Selection of disambiguation candidates (N, V, A)– Creation of paral el corpora – Creation of subcorpora for each translation 1. domain tags: do subcorpora differ in domain?2. morphosyntactic: • gender: do they differ in gender? in number?• arguments: do they differ in transitivity? in subcategorised prepositions? 1. conceptual: Can different SL concept clusters be built to • Verification with additional candidates or data – Sentence Segmentiser, Tokeniser, Dictionary Lookup • Parser to extract annotated subtrees• Tree matching component • target-sensitive word sense disambiguation – similar for the target side …) (if time permits) Quality:
– a really parallel (not comparable) corpora aligned on sentence level
– translation quality of aligned sentence pairs is essential for MT output
Linguistic pre-processing:
– tokenized plain text (plain PB-SMT)– POS tagging, lemmatization (factored PB-SMT, EBMT)– constitutency and dependency parsing (syntax motivated PB-SMT) – for a baseline system: at least 1M sentece pairs (~20M words)– for domain adaptation: 20K-200K sentece pairs (~400K-4M words) EuroParl *
JRC Acquis *
News Commentary
United Nations
English-French
OpenSubtitles
- numbers in millions of words from English to the target language- in corpora denoted by * all language pairs available News (WMT)
Gigaword
ILSP EL corpus
- numbers in millions of words- monolingual parts of the parallel corpora also available • A number of standard monolingual and parallel corpora available for al languages pairs of sufficient size & quality • Parliamentary proceedings and debates can be considered • Monolingual web-crawled corpora available for English, French, German, Italian (WaCky) – unspecified domain • No web-crawled paral el data available at al (Resnik's Strand is only a list of URLs, but quite outdated) – no fal back strategy EuroParl for baseline systems
– parliamentary proceedings and dabates – quite general domain suitable for adaptation • Evaluation data to be selected as a subset from
webcrawled in-domain data (including 500-2000 sentence pairs for test set and dev test set) • Focus on translation from English to other languages
Official deadlines:
– t6 Report on paral el technolgy tools (D5.1) – t14 Aligners integrated in the platform (D5.2) Internal deadlines:
– t6 decision on MT language pairs and domains – t12 resources to be included in the first evaluation produced (D4.3) Assumption: general and in-domain monolingual and
Possible approaches:
– one system build from mixture of the data– two systems and a domain classifier (for sentences)– two systems and system combination based on their n- • Distribution of webservices across partners?• Software requirements for webservices?• Hardware specifications (no HW budget)?• Example webservice wrapper? • Rich text format support?• Duplicate document/sentence detection? • Distribution of webservices? – TPC tools for one language on one site? • MT tools integrated into the platform? – alignment OK– language modelling?– phrase table extraction?– Decoding?– tuning? • Only extrinsic automatic evaluation feasible • Only extrinsic (MT) evaluation feasible

Source: http://panacea-lr.eu/images/PANACEA-Athens_2010-WP5.pdf

Mjesp.pdf

Médicos y juristas, servidores de la vida y de la libertad La vida de las sociedades contemporáneas esta atravesada por dos grandes corrientes políticas tradicionales: la corriente socialista y la corriente liberal. La corriente socialista pone de relieve la importancia de la sociedad con respecto a los individuos; recomienda la intervención del Estado para promover la igualdad

Microsoft word - estatute.doc

ESTATUTO Artículo I Costitución El Consorcio, nombrado Get Export, con actividad externa está constituido según los términos del artículo 2612 y siguientes del Código Civil. El Consorcio tiene domicilio legal in Teramo. La Junta de los Consorciados puede instituir o cerrar domicilios secundarios, filiales o sucursales sea en Italia como al extranjero. La duración del Consorcio

Copyright © 2010-2014 Drug Shortages pdf