MultiMeteo also knows Basque
2001/11/01 Díaz de Ilarraza, Arantza | Sarasola, Kepa | Mayor, Aingeru | Loinaz, Miel | Chevreau, Karine | Coch, José Iturria: Elhuyar aldizkaria
The quality of the work of the human translator will undoubtedly be better and richer, but today it is possible to create documents in a specific and technical field such as meteorology, using automatic techniques. In
this article we present the interactive system Multimeteo that uses multilingual textual creation in the field of meteorology, as well as the adaptation we have made to the creation in Basque. The developed system offers daily weather forecasts at the following web address: http://www.ingurumena.net/udala //www.inm.es/wwi/Multimeteo/Multimeteo.html
Background
Although automatic text creation is not used, a system that automatically translates weather predictions must be mentioned here. The METEO system created by the Montreal TAUM group has been the most successful translation system of all time. It was difficult to find translators for boring translations that looked like daily, and Canada's official weather service began investigating automatic routes. The METEO system obtained has been translating meteorological newsletters from English into French since 1977, and 80% of its translation is totally direct. However, the success of meteorology has not spread, since although the system has adapted to other issues, no results of equal quality have been obtained. It seems that the field of weather predictions has a special adaptation to this type of automatic processes.
The Forecast Generator (FoG) work environment was also launched in Canada in 1993. In this system, the meteorologist uses a graphical editor to adapt the map showing the weather data and subsequently the system automatically generates the weather forecast in English and French for the region.
History of the multiMeteo system
Contact InformationIn 1995 the French Meteorological Service (Meteo France) promoted the MultiMeteo project for the publication of weather forecasts in several languages. He contacted the National Meteorological Institute (INM) of Spain, the Royal Meteorological Institute (RMI) of Belgium, the Zentralanstallt für Meteorologie und Geodynamik of Austria (ZAMG) and two companies specialized in linguistic creation: Lexiquest, based in Paris, and CL Language Services in Madrid. The German Meteorology Service (DWD) also joined initially, but was subsequently abandoned.
These associations presented the project called “Multilingual Production of Weather Forecasts” and obtained community funding. The system was developed in four languages: French, English, Spanish and German. The results of the evaluation carried out in February 1999 were very positive.
In 2000 INM and Lexiquest reached an agreement to extend the system to four more languages: Dutch, Catalan, Galician and Basque. The Ixa Group and the UZEI Terminology Center of the Faculty of Computer Science of San Sebastian have been in charge of broadcasting to Basque, and at this moment we are about to finish the development phase of the project.
Usual procedure for creating weather predictions
Two sources are used for collecting meteorological data: surface data collection and spatial collection. Surface data are taken at meteorological observatories, where physical variables describing the state of the atmosphere are measured and collected at all times. Other data obtained from space are meteorological satellites, geostationary satellites METEOSAT and polar satellites of the TIROS-NOAA series, which do not stop sending information.
All numerical data obtained are processed by complex mathematical models. Automatic processes simulate the evolution of physical variables in the coming days, generating data matrices for meteorological predictions. The meteorologist then has the opportunity to retouch these data matrices, that is, to complete and round the forecast with his experience. As a conclusion, as seen in Table 1, the matrices present data of temperature (Te), wind direction (DD) and force (FF), clouds, rain, etc. for different hours (periods of 3 hours in the case of the INM system). For each point of the map, an array of this type is obtained.
With this data meteorologists create weather forecasts manually. This work is very long and expensive, especially when a single prediction has to be made several versions in different languages or styles (general predictions, beaches, sea, mountain, by community, by province...).
There is the interest of MultiMeteo. It is not about replacing the work of meteorologists, but about contributing in an interactive way to their tasks, so that predictions can be disseminated in different languages and styles. In addition, it allows you to make predictions for different places on the map.
A support tool: interactive multilingual creation
This technique, first, by automatic creation, generates a draft from perhaps incomplete input data. Although it has the ability to create text in several languages, the meteorologist, to act as a corrector, is offered only in his native language. If the meteorologist wants to make a correction in a text snippet, click on the part you want to modify. Then the pop-up menu will offer you a number of options and alternative modifiers, choosing one of them to perform the correction comfortably. Taking into account the changes made, the system will generate predictive texts in all languages.
The advantages of this technique are the speed (to produce each text in each language it takes about 2 seconds; a human translator needs about 10 minutes); the feasibility of creation, although some data has not yet been collected, the high quality of the texts created (sometimes with human touches); the ease of maintenance and adaptation; and finally, the acceptance by human users (meteorologists will not them to write in foreign languages).
Automatic newsletter creation
MultiMeteo creates two ways:
- For the wording of the title of each paragraph a fixed text with the name of the provinces is used, and to write the header of the bulletins (see figure 1) a template with several internal variables is used, for example:
Weather forecast *IS *CO. *MO *FD.
Local time: *FP.
Ad value: *TT.
where:
- The value of IS can be "by provinces", "by islands" or nothing.
- Value of the CO - name of the communities (for example, for the "Autonomous Community of Galicia").
- Month MO ("June")
- Date of the DF, expressed in figures.
- FP indicates time
- Prediction period by TT (e.g., “today from 06:00 to 12:00 midnight”).
- A much more complex method is used to write the body of the paragraphs. The following points explain the architecture and modules needed to address automatic creation at this level.
General system architecture
The generation engine used by the system was developed in 1994 in French for the automatic generation of commercial cards. In 1995 it extended to English by integrating into a prototype translation of technical manuals. And the same year was also integrated into the project “Multilingual Production of Weather Forecasts” to incorporate new languages and functionalities in the creation of meteorological newsletters (interactive creation and management of stylistic knowledge).
The system architecture can be seen in figure 2. The first phase consists of obtaining and reformatting a meteorological database that allows the use of generation modules. Subsequently, the task of the creation module is divided into two parts: plan and execute.
Planning module
Planning uses knowledge bases of concepts and styles (EU) and is divided into two phases:
- General planning: the newsletter is organized in several paragraphs (header, paragraph for each province, etc.)
- Weather planning: from the input data the content of each paragraph is determined. The events ( event) that must appear in the paragraph and the relations between them are collected in a list using an interlingua, so that the description is independent of the languages. The following modules will be made for each language.
The event is a conceptual object associated with the meteorological situation or evolution of the situation. The phenomena are of two types: atomic and molecular.
The atomic event represents a meteorological parameter without evolution, with a single associated value ( Value attribute). For example, the atomic event representing the covered sky is:
Event_CloudCovering4: Event{} Value=Class CloudCovering_code4;
Time_Representation= Time<unk>
Mod{};}
Class CloudCovering_code4 is a set of simple concepts: Overcast, NoSun and VeryCloudy-Overcast. Each of these concepts is associated with a term in each language.
The molecular event indicates more than one parameter. For example, when we talk about wind we can have strength, direction and evolution data. They can carry several values ( Value0, Value1, etc. attributes), as well as an operator (Operator attribute) that specifies how to collect these values. For example, the molecular event to describe the cloudless sky to be covered is:
Cloudier_Min0: Event_mol{ Value0= Event_CloudCovering0;Value1= Event_CloudCovering4;
Operator=
Class <unk> Cloudier_Min0;
Time_Representation= Time<unk>
Mod{};}
This molecular event is manifested by two atomic episodes and an operator. It serves to situate the events time - representation in time (present, past or future) and indicates the period (day, morning, afternoon, night...).
At the exit of the planning module a concept is selected for each atomic event and for each class of Operator attribute of molecular events. In addition, other attributes can be added (automatically or in interaction with the meteorologist): probability index, phase, period...
Execution module
Simple concept
representation ( Rsem)
UsemR1_WINTER= Estali1Sem
Usem = Estali1Sem
The module to materialize linguistically the concepts obtained in each language is based on the Theory of Meaning - Text (Mel’cuk 1988, Polguère 1988). This phase uses a linguistic knowledge base that is divided into five stages: predenotation, semantics, deep syntax, surface syntax and morphology.
- Predenotation. At this stage a term corresponding to that language is selected for each simple concept derived from planning. For example, for the simple Overcast concept of the aforementioned Class CloudCovering_code4 group, one of the terms Sky, Covered or Covered will be selected. These terms are divided into semantic units ( USem), with which the semantic expression ( RS) is created (see).
- Semantic. From the semantic expression Rsem is formed the graph of the deep syntax formed by nodes and relationships, for which the lexical unit corresponding to each semantic unit is selected.
- Deep syntax. A graph is constructed that has all the words of the phrase to be created on the nodes.
- Cutaneous syntax. Nodes are ordered to determine the place each word should occupy in the phrase.
- Morphology. The corresponding word form according to the morphosyntactic information of each node is collected from the dictionary. In the dictionary all declined forms are stored to avoid morphological creation.
Adaptation to Basque
evolution
•
•
The computational work for the diffusion of the MultiMeteo system into Basque has been developed by the IXA group and the terminological work has been done by UZEI. The adaptations to Galician and Catalan have been made from the Castilian version, and they have had to work mainly the lexicon, since no major changes in syntax and morphology were required. For Basque, although we have left Spanish (and sometimes French), most of the sentence structures have been modified and we have had to work especially with morphological declination marks.
We started our work in three phases:
- collection and analysis of the corpus of time in Basque,
- Knowledge of the multiMeteo system and its architecture, and
- system adaptation.
The adaptation is carried out in three subphases: first we approach the atomic events (for example, the “sky, covered”), then the molecular events that were easy (for example, the “wind, weak, from the north”), and finally, the molecular events that presented special difficulties (for example, the sky, initially covered, with rain, later very temporarily covered).
In each of the adaptation phases, a previous linguistic analysis, an analysis and design of the information to be included in the knowledge base, an introduction and proof of the information of a representative example for each event and, finally, an introduction and proof of all the possibilities for each type of event.
The main characteristics of this adaptation are:
- Given that the predictions generated by the system had to follow the telegraphic style of the INM, we decided to delete the verbs. Also, name modifiers that are the area of the phrase will be separated by commas as an attribute syntagma. For example, instead of giving “weak North Wind” or “weak North Wind,” the system will generate “weak North Wind.”
- The meteorological evolutions expressed in French and Spanish by Gerund are done differently in Basque. For example, "Clear sky rising to cloudy" will be created in Basque as follows: “The sky, at first cloudy, then cloudy.”
- In the dictionary we have written all forms of words (sometimes multi-word units) that can be used in newsletters. In the newsletters two cases are used: absolute and sociative. The slogan of the word is also possible.
If you would later like to expand the system with other styles, more cases of decline should be used, so these cases should be introduced in the dictionary. Let us see, for example, the introduction of the vocabulary of the word rain:
BA_Euri1: LexemeNomBA{
CatMorph = NOM; SsCatMorph = COMMUN; UMorph=
[ morpho{Cas= ABS; Name=
SINGULIER; UMG= "euria"},
morpho}=
Phuns;
- The area of the sentence, by default, will have the case of the absolute decline, and the case of the area modifiers will be determined in the definition of the concept or term. For example, the concept that creates "The sky, covered, with rain" must specify that the term cover will occupy the singular absolutive and the singular sociative rain. In the singular absolutive the term zeru appears because it is the space of prayer.
- In Basque, the case of declining the syntagma adheres to the last word of each syntagma, and the system did not give the opportunity to manage it elegantly. Therefore, we have had to add a series of rules: on the one hand, at the conceptual level, the system pastes the case mark to all the words of each syntagma, and then when the words are sorted in the superficial syntax stage, it removes the case from those that are not the last word. For example, to create the phrase “The sky, covered, with general rains and storms”, it is indicated in a concept that all the syntagma of general rain and storms must carry the case of the sociative; for this it is necessary to mark all the terms with the case of rain (soz)+general(soz)+ekaitz(soz); for later the terms rain, and general are demarcated with “preceding”.
Table 3 shows how several atomic concepts have materialized in Basque (including Spanish and French reference).
Table 4 shows the execution of several molecular concepts. The variables indicate, when indicated, the values of this event: Variables N state of the clouds (oscarbia, under cloud, covered...); Variables DD wind direction (north, southwest, etc. ); FF variables are wind force (moderate, strong,...); Variables TS precipitation (rain, sirimiri...), PER period (mornings...)...
Works of the future
storms with hail
Expanding/Reducing to N2
Increasing/ N2 Decrease
storms to N2
FF2 Avancez
passenger FF2
The project is currently in the last stages of development. The next step is a massive test to analyze possible system errors. Then make the necessary changes and final evaluation. However, the adaptation is already integrated into the INM system and the weather forecasts of the Spanish state communities are offered every day on the web http://www.inm.es/wwi/ MultiMeteo/Multimeteo.html.
In addition to the telegraphic writing of the general objective, the realization of special purpose predictions (for beaches, mountaineers, skiers...) and the elaboration of richer writings (for example, the introduction of verbs with complete sentences) would be feasible steps in the medium term. This type of complete versions have been made in French and are currently used. At the moment it would be enough to analyze the usefulness of the system developed for the Basque language, and if later the need was detected, then the organization of the aforementioned improvements should be addressed.