Archivo

Posts Tagged ‘natural language’

Recognizing entities in a text: not as easy as you might think!

12 diciembre, 2013 Deja un comentario

Entities recognition: the engineering problem

As in every engineering endeavor, when you face the problem of automating the identification of entities (proper names: people, places, organizations, etc.) mentioned in a particular text, you should look for the right balance between quality (in terms of precision and recall) and cost from the perspective of your goals. You may be tempted to compile a simple list of such entities and apply simple but straightforward pattern matching techniques to identify a predefined set of entities appearing “literally” in a particular piece of news, in a tweet or in a (transcribed) phone call. If this solution is enough for your purposes (you can achieve high precision at the cost of a low recall), it is clear that quality was not among your priorities. However… What if you can add a bit of excellence to your solution without technological burden for… free? If you are interested in this proposition, skip the following detailed technological discussion and go directly to the final section by clicking here.

Where difficulties come from?

Now, I will summarize some of the difficulties that may arise when designing an automatic system for “Named Entities Recognition” (NER, in short, in the technical papers). Difficulties may come from two fronts:

  • Do you deal with texts in several languages? Do you know the language of each text in advance?
  • What is the source of the documents or items of text that you have to manage? Do they come from a professional newsroom? Did you ingest them from OCR (Optical Character Recognition) or ASR (Automatic Speech Recognition) systems? Did you catch them with the API of your favorite social network?
  • Do your texts follow strict academic conventions regarding spelling and typography? (i.e. do you always deal with well-written text?)  Did users generate them with their limited and error-prone devices (smartphones)? Did second language speakers or learners produce them?

Designing the perfect NER system: the language nightmare

The previous questions end up in a set of complex challenges:

Eiffel Tower

Eiffel Tower

1. Translingual equivalence:
Problem: When you deal with multilingual content, you are interested in recognizing not language-dependent names, but entities that are designated differently in different languages.
Example: Eiffel Tower (EN), Tour Eiffel (FR) and Torre Eiffel (ES) refer to the very same object.
Solution: You need to use semantic processing to identify meanings, relative to a consistent, language-independent world model (e.g. using ontologies or referring to linked data sources).

Madiba

Nelson Mandela

2. Intralingual or intratext equivalence:
Problem: For a particular language, texts usually refer to the same entities in different flavors (to avoid repetition, due to style considerations or communication purposes).
Example: Nelson Mandela, Dr. Mandela (depending on the context) and Madiba are recognized by English speakers as the same entity.
Solution: Again, in the general case, you need to link multiword strings (tokens) to meanings (representing real world objects or concepts).

3. Transliteration ambiguity:
Problem: translation of names between different alphabets.
Example: Gaddafi, Qaddafi, Qadhdhafi can refer to the same person.
Solution: It is always difficult to decide the strategy to attach a sense to an unknown word. Should you apply phonetic rules to find equivalents from Arabic or from Chinese? Expressing it otherwise: is the unknown word just a typo, a cognitive mistake, a spelling variant or even an intended transformation? Only when context information is available you can rely on specific disambiguation strategies. For example, if you know or you deduce that you are dealing with a well-written piece of news about Libya, you should surely try to find alternative transliterations from Arabic. This problem is usually treated at dictionary level, incorporating the most widespread variants of foreign names.

George Washington

George Washington

4. Homonyms disambiguation
Problem: Proper names have usually more than one bearer.
Example: Washington may refer to more or less known people (starting by George Washington), the state on the Pacific coast of the USA, the capital of the USA (Washington, D.C.) and quite a few other cities, institutions and installations in the same and other countries. It can even be a metonym for the Federal government of the United States.
Solution: Semantic and contextual clues are needed for proper disambiguation. Are there any other references to the same name (maybe in a more complete form) along the piece of text under scrutiny? Can semantic analysis tell us if we deal with a person (producing human actions) or a place (where things happen)? Can we establish with confidence a geographical context for the text? This could also lead to favorite particular interpretations.

5. Fuzzy recognition and disambiguation:
Problem: in the general case, how to deal with unknown words when you rely on (maybe huge) multilingual dictionaries plus (maybe smart) tokenizers and morphological analyzers?
Example: If you find in an English text the word “Genva”, should you better interpret it as Geneva (in French Genève) or Genoa (in Italian Genova).
Solution: the presence of unknown words is linked most of times to the source of the piece of text that you are analyzing. When the text has been typed with a keyboard, the writer may have failed to type the right keys. When the text comes from a scanned image through OCR, the result can be erroneous depending on image resolution, font type and size, etc. Something similar occurs when you get a text through ASR. The strategy to interpret correctly the unknown word (identifying the meaning intended by the author) implies using metrics for distance between the unknown word and other words that you can recognize as correct. In our example, if the text has been typed with a qwerty keyboard, it seems that the distance between Genva and Geneva involves a single deletion operation, while the distance between Genva and Genoa involves a single substitution using a letter that is quite far apart. So, using distance metrics, Geneva should be preferred. But contextual information is equally important for disambiguation. If our text includes mentions to places in Switzerland, or it can be established as the right geographical context, then Geneva gains chances. Otherwise, if the text is about Mediterranean cruises, Genoa seems to be the natural choice.

Meaning as a Service

Textalytics


Textalytics: semantic technology at your fingertips

Systems or platforms for Content Management (CMS), Customer Relationship Management (CRM), Business Intelligence (BI) or Market Surveillance incorporate information retrieval functionality allowing the search of individual tokens (typically alphanumeric strings) or literals in unstructured data. However, they are very limited in terms of recognition of semantic elements (entities, concepts, relationships, topics, etc.) This kind of text analytics is very useful not only for indexing and search purposes, but also for content enrichment. The final aim of these processes is adding value in terms of higher visibility and findability (e.g. for SEO purposes), content linkage and recommendation (related contents), ads placing (contextual advertisement), customer experience analysis (Voice of Customer, VoC analytics), social media analysis (reputation analysis), etc.

To facilitate the integration of semantic functionality in any software application, Daedalus opened its multilingual semantic APIs to the community through the cloud-based service Textalytics. On the client side, you can send a call (petition) to our service in order to process one item of text (a piece of news, a tweet, etc.): what you get is the result of our processing in an interchange format (XML or JSON). Textalytics APIs offer natural language processing functionality in two flavors:
  • Core APIs: one API call for each single process (extraction of entities, text classification, spell checking, sentiment analysis, content moderation, etc.) Fine tuning is achieved through multiple parameterization. Besides natural language core processing, audio transcription to text is also available, as well as auxiliary functions. Auxiliary APIs are useful, for example, to link entities with open linked data repositories, as DBpedia/Wikipedia, or to guess crucial demographic features (type, gender, age) for a given social media user.
  • Vertical APIs (Media Analysis, Semantic Publishing): one API call provides highly aggregated results (e.g. extraction of entities and topics, plus classification, plus sentiment analysis…), convenient for standard use in a vertical market (media industry, publishing industry…)

To end this post, let me stress other benefits of selecting Textalytics for semantic processing:

  • SDKs (Java, Python, PHP and Visual Basic) are offered for quick integration. Software developers take not more than half an hour to read the documentation and integrate our semantic capabilities in any environment.
  • You can register in Textalytics, subscribe to the API or APIs of your choice, get your personal key and send as many petitions as you want for free, up to a maximum of 500.000 words processed per month. For research, academic or commercial usage. No matter.
  • If you need processing higher volumes of text (exceeding the free basic plan) or in case you require launching more than five API calls per second, you can subscribe at affordable prices. No long-term commitment. Pay per month. Check out our pricing plans.

Curious? Try our demo!
Interested?  Contact us!
Believer? Follow us!

José C. González (@jc_gonzalez)

Language Technology and the Future of the Content Industry

A few days ago I had the opportunity to participate as a speaker at a conference organized by LT-Innovate (the European Industry of Language Technology Forum) oriented to the publishing and media industries. This initiative is part of the focus groups that LT-Innovate is organizing in order to boost and expand the activity of companies providing products and services based on language technology (intelligent content processing, speech technology and automatic translation). Representatives of around thirty European companies attended the forum, both customers and suppliers.

In my presentation I emphasized the transformation of the Content Industry as a result of a crisis with numerous facets: the changes in the way users consume contents, the departure from traditional supports and their prompt displacement to the Internet environment, the abundance of free content, with an enormous volume produced and published directly and instantly by users and the fall of advertising income. A scenario that is causing the failure of business models until recently successful and the rise of others still unpredictable.

Until not long ago, solutions based on language technology had little space in content management tools or were limited to isolated applications of the production environment.  Nevertheless, the progressive digitalization and the growth of the Internet’s segment dedicated to content consumption, the urgent need to reduce costs and time, the integration of media newsrooms independent of supports, etc. have let progressively grow our clients’ needs. Thus, gradually and throughout fifteen years, at Daedalus we have been covering those needs by increasing our catalog of solutions, among which are the following:

  • Spell, grammar and style checking oriented to the professional environment, which requires accuracy and uniform criteria.
  • Semantic publication, including the automatic identification of entities (people, organizations, places, facilities, concepts, time or currency references…) and significant concepts, the classification or grouping of texts according to journalistic or documentary standards.
  • Moderation or automatic filtering of forums and the immediate revision of user generated content.
  • Indexing and search of multilingual and multimedia content.
  • Approximate and natural language search interfaces.
  • Search in multilingual content by incorporating automatic translation systems.
  • Transcription of multimedia content and automatic video subtitling.
  • Automatic analysis of opinions, feelings and reputation in social media.

All these applications have use in the more and more diversified processes of content industry:

  • Delivery of content and contextual advertising adapted to the users’ interest profiles.
  • Production of transmedia content (simultaneous, complementary and synchronized distribution through multiple platforms: TV, Internet, tablets, smartphones).
  • Support to documentary research and data journalism, starting from the analysis and the advanced investigation of heterogeneous information sources.
  • Support to Search Engine Optimization features and marketing online.
  • Support for new business models based on the sale of single pieces of content or stories built up by the aggregation of content produced throughout the time on a subject, an event, a public figure, etc.

As we see, language technology has moved from marginal to central positions in all areas of this industry. At Daedalus we are proud of having served in this process to a good number of companies and groups of this industry for years, to which we feel closely committed.

We invite you to check out our presentation in the Publishing/Media Industry Forum organized by LT-Innovate (Berlin, April 12th, 2013).

Jose C. Gonzalez
@jc_gonzalez
@jgonzalez_es

How can the automatic proofreading help publishing professionals? (2nd part)

22 diciembre, 2011 1 comentario

As we have shown in the first part, automatic text verification systems aim to become useful resources. However, these applications are by definition tools that help in writing, and they should never replace the human proofreader, especially if the goal is publishing. Until now, there were a lot of questions that technology could not face.

Where should we focus our attention?

We cannot trust technology when text revision involves a comprehensive and careful reading in order to find ambiguous sentences or inconsistencies from the author (e.g. changing in a story the name of the same character), or decide whether a footnote would be necessary, etc.

Apart from this, we must give attention to another type of revision. It is called conceptual or technical revision, and it consists in examining the text to see if it conforms to the terminological conventions which are typical of the related subject. In fact, this task should not be assigned to a specialist in spelling and style, but rather to a specialist in the given subject (a physician for a handbook of medicine, an engineer for a technical text, etc.).

Despite these facts, we must note that language technologies specialists have begun to handle information on a semantic basis. Examples of this are the recognition of anaphoras and coreferences. We believe that, in the near future, there will be major advances in the detection of certain lexical ambiguities or misuses.

Why should publishing professionals make use of automatic proofreading?

We assume that revising a text is a time-consuming task. Thus, we believe that publishing professionals can go a step further, and not just confine themselves to the process of looking for information in dictionaries, grammars, and other reference books. The new automatic proofreading systems are certainly helpful:

  • You can save time on tedious tasks that the proofreader can perform easily.
  • You can focus your efforts on activities that involve human processing.
  • You can improve the quality of the final revision.
  • You will have more time left to meet the tight deadlines imposed by the publisher.

In conclusion, you can be more productive, increase your profits and, at the same time, maintain the quality of your work.

Try STILUS, our proofreading software.

[English version of  ¿Qué aporta la corrección automática al profesional de la edición? (parte 2)]

How can the automatic proofreading help publishing professionals? (1st part)

22 diciembre, 2011 4 comentarios

A human proofreader is a professional in charge of revising materials written by an author. He tries to ensure that the readers receive the message clearly and free from errors.

The editing process is commonly comprised of several different levels of textual revision: spelling and typographical checking, style checking, conceptual revision, and revision of translated texts, were that the case. All of the publishing houses are aware of this process, but only a few put it into practice. In reality, it is not common for a publishing house to properly assign each revision type to specialized proofreaders. Usually, the proofreader of a given text gets far too much work, as he carries out all the revision work that three or four specialists should have done. He stands as a mediatory demiurge who links ideas to something legible. How much are they paid for this? 0,72 € per 1000 matrixes (or characters with spaces) for proofreading on screen, and around 0,50 € for second galleys (proofreading on paper). In conclusion, they are working for five or six euros per hour in the most profitable cases.

Thus, these edition demiurges may want to explore ways of increasing productivity and, at the same time, protecting the quality of their work.

How can the automatic text verification technology contribute to the proofreading process?

Granted, philologists and some other language professionals are very reluctant to anything related to “automatic proofreading”, however, we want to make clear that prejudging a last-generation software tool is somewhat unfair. Language lovers might congratulate themselves on the new Natural Language Processing technologies that make it possible to automatically proofreading a text. These automatic proofreaders are able to check, with a high degree of linguistic precision and recall, many items regarding spelling and typography (according to the application’s degree of processing). Equally, they can make a text conform to the spelling and grammar rules. On the other hand, the majority of these applications do not rewrite the text automatically, but rather they give the user a choice among the different proposals that the application makes.

What issues can be addressed by automatic proofreading?

  • Spelling and typographic checking. An optimum level of orthographic recall can be reached if the system has a good lexical base. This avoids false warnings on existing words (even if they are not frequent), and also permits to check the spelling of national and foreign proper nouns (e.g. toponyms, persons’ names, institutions, brand names, etc.). In addition, many tools comprise personal dictionaries where new words are added, hence the lexical base is expanded. On the other hand, these new applications are becoming context-sensitive so that homophones and diacritic errors can be found. Finally, there are more issues concerning spelling and typography that a proofreading application considers too: it can now advise on the use of italics (e.g. foreign words), verify the opening and closing of pairs of signs, warn of wrong sequences of punctuation marks, verify the correct use of upper and lower case letters, check the spacing (double spaces, required spaces or joins between typographic signs and words), etc.
  • Grammar checking. Last-generation proofreading applications have the potential to disambiguate different senses. It allows for finding many agreement errors at different sentence levels, and other syntactic violations such as mismatched verb tenses, or errors in prepositional government.
  •  Style checking. These applications are able to make suggestions about spelling variations that are much preferred, lexical misuses or very colloquial registers. They can also provide alternatives to foreign words, and warn of phenomena that can make reading confusing (abusive use of prepositions, word repetitions, too long sentences, redundancies, unwanted technical words, etc.).
  •  Revision of translations. These applications are able to find loan translations between the source and the target language. They can also warn of false friends or wrong transliterations.

Try STILUS, our proofreading software

What is left for a human proofreader?

Read it on the next post.

[English version of  ¿Qué aporta la corrección automática al profesional de la edición? (parte 1)]

A %d blogueros les gusta esto: