1. WP5 Tasks & Deliverables2. Overview of parallel technology tools3. Parallel copora requirements4. Survey of language resources5. Work plan for t4-t146. Questions
Input: parallel corpora produced in WP4 Output: language resources for MT in WP7/WP8
WP5.1 Sub-sentential alignment (DCU, ELDA, ILSP)
WP5.2 Bilingual dictionary extraction (DCU, ILSP)
• D5.1 (t06): Report describing the inventory of parallel technology
tools to be developed and integrated in PANACEA and the
characteristics of the resources to be produced.
• D5.2 (t14) Aligners integrated into the platform, and documentation
• D5.3 (t22) Parallel, sententially aligned texts, cleaned and prepared
for training/building translational models (20—50 million words)
• D5.4 (t30) Final version of the Bilingual Dictionary Extractor
• D5.5 (t30) Sample of bilingual dictionaries produced: EN—FR and
• D5.6 (t30) Final version of the integrated Transfer Rules module, and
• D5.7 (t30) Sample of transfer rules produced for EN—DE.
• Bil ingual dictionary extraction (WP5.2)
Align bilingual corpus (existing or output from WP4)
– Sentence– Word– Chunk / Syntactic
• GIZA++, berkeleyaligner• word packing (“compound rich” languages,
• Marker hypothesis: Marclator• Syntactic: TreeAligner
– Integrate models: generative, syntactic,
– Extend range of language pairs– Tune to text type, domain and genre– Check/filter corpora acquired (comparability
– Baseline: phrase alignment in Moses– Extrinsic evaluation (SMT in WP7)
Task: to derive bilingual dictionaries from aligned parallel corpus Methodology
– Expectation-Maximisation algorithm– Additional techniques on top of word correspondences →
precision, fine-cleaning → reduce human intervention
– Go beyond word level: MW translations (NPs, MWEs)– Baseline: word alignment in Moses– Evaluation?
• Find criteria for lexical transfer selection
• structural transfer (Probst, Sánchez-Martínez, et al.)
– (matching of POS-sequences– independent of lexical material)
• bilingual term extraction (Cabré 2001, Gamal o 2007)
– structural transfer– lexical transfer
• simple lexical• contextual lexical <- this is the task! conditions for transfer selection
– with domain / subject area information („MEDICAL“)– with locale / variant („EN_UK“ „DE_CH“)
– use information on local nodes (gender, number)– use structural contexts (arguments, prepositions, subcategorisation
frames & fil ers) (main means of RMT)
– use conceptual environment for disambiguation
• using word sense disambiguation, statistical word alignment
• supervised learning of most important disambiguation
1. domain tag assignment2. morphosyntactic tests
• local features on gender / number• subcategorisation: Prepositions (for nouns and verbs)• presence / absence of verb arguments (trans./intrans.)• (relational Adj <-> compound specifier)
• source language concept clusters (SMT uses target
– Selection of disambiguation candidates (N, V, A)– Creation of paral el corpora – Creation of subcorpora for each translation
1. domain tags: do subcorpora differ in domain?2. morphosyntactic:
• gender: do they differ in gender? in number?• arguments: do they differ in transitivity? in subcategorised prepositions?
1. conceptual: Can different SL concept clusters be built to
• Verification with additional candidates or data
– Sentence Segmentiser, Tokeniser, Dictionary Lookup
• Parser to extract annotated subtrees• Tree matching component
• target-sensitive word sense disambiguation
– similar for the target side …) (if time permits)
Quality:
– a really parallel (not comparable) corpora aligned on sentence level – translation quality of aligned sentence pairs is essential for MT output Linguistic pre-processing:
– tokenized plain text (plain PB-SMT)– POS tagging, lemmatization (factored PB-SMT, EBMT)– constitutency and dependency parsing (syntax motivated PB-SMT)
– for a baseline system: at least 1M sentece pairs (~20M words)– for domain adaptation: 20K-200K sentece pairs (~400K-4M words)
EuroParl * JRC Acquis * News Commentary United Nations English-French OpenSubtitles
- numbers in millions of words from English to the target language- in corpora denoted by * all language pairs available
News (WMT) Gigaword ILSP EL corpus
- numbers in millions of words- monolingual parts of the parallel corpora also available
• A number of standard monolingual and parallel corpora available
for al languages pairs of sufficient size & quality
• Parliamentary proceedings and debates can be considered
• Monolingual web-crawled corpora available for English, French,
German, Italian (WaCky) – unspecified domain
• No web-crawled paral el data available at al (Resnik's Strand is
only a list of URLs, but quite outdated) – no fal back strategy
• EuroParl for baseline systems
– parliamentary proceedings and dabates
– quite general domain suitable for adaptation
• Evaluation data to be selected as a subset from
webcrawled in-domain data (including 500-2000 sentence pairs for test set and dev test set)
• Focus on translation from English to other languages Official deadlines:
– t6 Report on paral el technolgy tools (D5.1)
– t14 Aligners integrated in the platform (D5.2)
Internal deadlines:
– t6 decision on MT language pairs and domains
– t12 resources to be included in the first evaluation produced (D4.3)
Assumption: general and in-domain monolingual and Possible approaches:
– one system build from mixture of the data– two systems and a domain classifier (for sentences)– two systems and system combination based on their n-
• Distribution of webservices across partners?• Software requirements for webservices?• Hardware specifications (no HW budget)?• Example webservice wrapper?
• Rich text format support?• Duplicate document/sentence detection? • Distribution of webservices?
– TPC tools for one language on one site?
• MT tools integrated into the platform?
– alignment OK– language modelling?– phrase table extraction?– Decoding?– tuning?
• Only extrinsic automatic evaluation feasible
• Only extrinsic (MT) evaluation feasible
Médicos y juristas, servidores de la vida y de la libertad La vida de las sociedades contemporáneas esta atravesada por dos grandes corrientes políticas tradicionales: la corriente socialista y la corriente liberal. La corriente socialista pone de relieve la importancia de la sociedad con respecto a los individuos; recomienda la intervención del Estado para promover la igualdad
ESTATUTO Artículo I Costitución El Consorcio, nombrado Get Export, con actividad externa está constituido según los términos del artículo 2612 y siguientes del Código Civil. El Consorcio tiene domicilio legal in Teramo. La Junta de los Consorciados puede instituir o cerrar domicilios secundarios, filiales o sucursales sea en Italia como al extranjero. La duración del Consorcio