Archive for the ‘Language Technology’ Category

Recognizing entities in a text: not as easy as you might think!

12 diciembre, 2013 Deja un comentario

Entities recognition: the engineering problem

As in every engineering endeavor, when you face the problem of automating the identification of entities (proper names: people, places, organizations, etc.) mentioned in a particular text, you should look for the right balance between quality (in terms of precision and recall) and cost from the perspective of your goals. You may be tempted to compile a simple list of such entities and apply simple but straightforward pattern matching techniques to identify a predefined set of entities appearing “literally” in a particular piece of news, in a tweet or in a (transcribed) phone call. If this solution is enough for your purposes (you can achieve high precision at the cost of a low recall), it is clear that quality was not among your priorities. However… What if you can add a bit of excellence to your solution without technological burden for… free? If you are interested in this proposition, skip the following detailed technological discussion and go directly to the final section by clicking here.

Where difficulties come from?

Now, I will summarize some of the difficulties that may arise when designing an automatic system for “Named Entities Recognition” (NER, in short, in the technical papers). Difficulties may come from two fronts:

  • Do you deal with texts in several languages? Do you know the language of each text in advance?
  • What is the source of the documents or items of text that you have to manage? Do they come from a professional newsroom? Did you ingest them from OCR (Optical Character Recognition) or ASR (Automatic Speech Recognition) systems? Did you catch them with the API of your favorite social network?
  • Do your texts follow strict academic conventions regarding spelling and typography? (i.e. do you always deal with well-written text?)  Did users generate them with their limited and error-prone devices (smartphones)? Did second language speakers or learners produce them?

Designing the perfect NER system: the language nightmare

The previous questions end up in a set of complex challenges:

Eiffel Tower

Eiffel Tower

1. Translingual equivalence:
Problem: When you deal with multilingual content, you are interested in recognizing not language-dependent names, but entities that are designated differently in different languages.
Example: Eiffel Tower (EN), Tour Eiffel (FR) and Torre Eiffel (ES) refer to the very same object.
Solution: You need to use semantic processing to identify meanings, relative to a consistent, language-independent world model (e.g. using ontologies or referring to linked data sources).


Nelson Mandela

2. Intralingual or intratext equivalence:
Problem: For a particular language, texts usually refer to the same entities in different flavors (to avoid repetition, due to style considerations or communication purposes).
Example: Nelson Mandela, Dr. Mandela (depending on the context) and Madiba are recognized by English speakers as the same entity.
Solution: Again, in the general case, you need to link multiword strings (tokens) to meanings (representing real world objects or concepts).

3. Transliteration ambiguity:
Problem: translation of names between different alphabets.
Example: Gaddafi, Qaddafi, Qadhdhafi can refer to the same person.
Solution: It is always difficult to decide the strategy to attach a sense to an unknown word. Should you apply phonetic rules to find equivalents from Arabic or from Chinese? Expressing it otherwise: is the unknown word just a typo, a cognitive mistake, a spelling variant or even an intended transformation? Only when context information is available you can rely on specific disambiguation strategies. For example, if you know or you deduce that you are dealing with a well-written piece of news about Libya, you should surely try to find alternative transliterations from Arabic. This problem is usually treated at dictionary level, incorporating the most widespread variants of foreign names.

George Washington

George Washington

4. Homonyms disambiguation
Problem: Proper names have usually more than one bearer.
Example: Washington may refer to more or less known people (starting by George Washington), the state on the Pacific coast of the USA, the capital of the USA (Washington, D.C.) and quite a few other cities, institutions and installations in the same and other countries. It can even be a metonym for the Federal government of the United States.
Solution: Semantic and contextual clues are needed for proper disambiguation. Are there any other references to the same name (maybe in a more complete form) along the piece of text under scrutiny? Can semantic analysis tell us if we deal with a person (producing human actions) or a place (where things happen)? Can we establish with confidence a geographical context for the text? This could also lead to favorite particular interpretations.

5. Fuzzy recognition and disambiguation:
Problem: in the general case, how to deal with unknown words when you rely on (maybe huge) multilingual dictionaries plus (maybe smart) tokenizers and morphological analyzers?
Example: If you find in an English text the word “Genva”, should you better interpret it as Geneva (in French Genève) or Genoa (in Italian Genova).
Solution: the presence of unknown words is linked most of times to the source of the piece of text that you are analyzing. When the text has been typed with a keyboard, the writer may have failed to type the right keys. When the text comes from a scanned image through OCR, the result can be erroneous depending on image resolution, font type and size, etc. Something similar occurs when you get a text through ASR. The strategy to interpret correctly the unknown word (identifying the meaning intended by the author) implies using metrics for distance between the unknown word and other words that you can recognize as correct. In our example, if the text has been typed with a qwerty keyboard, it seems that the distance between Genva and Geneva involves a single deletion operation, while the distance between Genva and Genoa involves a single substitution using a letter that is quite far apart. So, using distance metrics, Geneva should be preferred. But contextual information is equally important for disambiguation. If our text includes mentions to places in Switzerland, or it can be established as the right geographical context, then Geneva gains chances. Otherwise, if the text is about Mediterranean cruises, Genoa seems to be the natural choice.

Meaning as a Service


Textalytics: semantic technology at your fingertips

Systems or platforms for Content Management (CMS), Customer Relationship Management (CRM), Business Intelligence (BI) or Market Surveillance incorporate information retrieval functionality allowing the search of individual tokens (typically alphanumeric strings) or literals in unstructured data. However, they are very limited in terms of recognition of semantic elements (entities, concepts, relationships, topics, etc.) This kind of text analytics is very useful not only for indexing and search purposes, but also for content enrichment. The final aim of these processes is adding value in terms of higher visibility and findability (e.g. for SEO purposes), content linkage and recommendation (related contents), ads placing (contextual advertisement), customer experience analysis (Voice of Customer, VoC analytics), social media analysis (reputation analysis), etc.

To facilitate the integration of semantic functionality in any software application, Daedalus opened its multilingual semantic APIs to the community through the cloud-based service Textalytics. On the client side, you can send a call (petition) to our service in order to process one item of text (a piece of news, a tweet, etc.): what you get is the result of our processing in an interchange format (XML or JSON). Textalytics APIs offer natural language processing functionality in two flavors:
  • Core APIs: one API call for each single process (extraction of entities, text classification, spell checking, sentiment analysis, content moderation, etc.) Fine tuning is achieved through multiple parameterization. Besides natural language core processing, audio transcription to text is also available, as well as auxiliary functions. Auxiliary APIs are useful, for example, to link entities with open linked data repositories, as DBpedia/Wikipedia, or to guess crucial demographic features (type, gender, age) for a given social media user.
  • Vertical APIs (Media Analysis, Semantic Publishing): one API call provides highly aggregated results (e.g. extraction of entities and topics, plus classification, plus sentiment analysis…), convenient for standard use in a vertical market (media industry, publishing industry…)

To end this post, let me stress other benefits of selecting Textalytics for semantic processing:

  • SDKs (Java, Python, PHP and Visual Basic) are offered for quick integration. Software developers take not more than half an hour to read the documentation and integrate our semantic capabilities in any environment.
  • You can register in Textalytics, subscribe to the API or APIs of your choice, get your personal key and send as many petitions as you want for free, up to a maximum of 500.000 words processed per month. For research, academic or commercial usage. No matter.
  • If you need processing higher volumes of text (exceeding the free basic plan) or in case you require launching more than five API calls per second, you can subscribe at affordable prices. No long-term commitment. Pay per month. Check out our pricing plans.

Curious? Try our demo!
Interested?  Contact us!
Believer? Follow us!

José C. González (@jc_gonzalez)

Semantic Analysis and Big Data to understand Social TV

25 noviembre, 2013 1 comentario

We recently participated in the Big Data Spain conference with a talk entitled “Real time semantic search engine for social TV streams”. This talk describes our ongoing experiments on Social TV and combines our most recent developments on using semantic analysis on social networks and dealing with real-time streams of data.

Social TV, which exploded with the use of social networks while watching TV programs is a growing and exciting phenomenon. Twitter reported that more than a third of their firehose in the primetime is discussing TV (at least in the UK) while Facebook claimed 5 times more comments behind his private wall. Recently Facebook also started to offer hashtags and the Keywords Insight API for selected partners as a mean to offer aggregated statistics on Social TV conversations inside the wall.

As more users have turned into social networks to comment with friends and other viewers, broadcasters have looked into ways to be part of the conversation. They use official hashtags, let actors and anchors to tweet live and even start to offer companion apps with social share functionalities.

While the concept of socializing around TV is not new, the possibility to measure and distill the information around these interactions opens up brand new possibilities for users, broadcasters and brands alike.  Interest of users already fueled Social TV as it fulfills their need to start conversations with friends, other viewers and the aired program. Chatter around TV programs may help to recommend other programs or to serve contextually relevant information about actors, characters or whatever appears in TV.  Moreover, better ways to access and organize public conversations will drive new users into a TV program and engage current ones.

On the other hand, understanding the global conversation about a program is definitely useful to acquire insights for broadcasters and brands. Broadcasters and TV producers may measure their viewers preferences and reactions or their competence and acquire complementary information beyond plain audience numbers. Brands are also interested in finding the most appropriate programs to reach their target users as well as understand the impact and acceptance of their ads. Finally, new TV and ad formats are already being created based on interaction and participation, which again bolster engagement.

In our talk, we describe a system that combines natural language processing components from our Textalytics API and a scalable semi-structured database/search engine, SenseiDB, to provide semantic and faceted search, real-time analytics and support visualizations for this kind of applications.

Using Textalytics API we are able to include interesting features for Social TV like analyzing the sentiment around an entity (a program, actor or sportsperson). Besides, entity recognition and topic extraction allow us to produce trending topics for a program that correlate well with whatever happens on-screen. They work as an effective form to organize the conversation in real-time when combined with the online facets provided by SenseiDB. Other functionalities like language recognition and text classification help us to clean the noisy streams of comments.

SenseiDB is the second pillar of our system. A semi-structured distributed database that helps us to ingest streams and made them available for search in real-time with low query and indexing times. It includes a large number of facet types that enable us to use navigation using a range of semantic information. With the help of histogram and range facets it could even be overused for simple analytics tasks. It is well rounded with a simple and elegant query language, BQL, which help us to boost the development of visualizations on top.

If you find it interesting, check out our presentation for more detail or even the video of the event.

Offensive comments from readers in European online media have come to a full stop: Media will be responsible. What’s next?

28 octubre, 2013 1 comentario

The European Court of Human Rights issued on October 10th a very relevant sentence for European media companies. The case was brought by the Estonian news website Delfi, sued by the Justice of its country for having published offensive comments of readers against the director of a company which acted as a source of information. The publication of the news in question occurred on January 24th, 2006, and a few weeks later, on March 9th, the lawyers of the victim requested the withdrawal of 20 offensive comments and compensation for moral damages. The news website removed the comments on the same day and rejected the economic request. The following month, a civil lawsuit was filed before the Estonian courts. This lawsuit reached the national highest court, which upheld the guilty verdict and sentenced the media company to provide 320 euros in compensation to the plaintiff.

Delfi, the company that owns the news portal, resorted to Strasbourg (headquarters of the European Court of Human Rights), stating that the sentence violated the principle of freedom of expression, protected by article 10 of the Convention for the Protection of Human Rights and Fundamental Freedoms.


Now, this European court has ruled against the media company. And this despite the fact that Delfi had an automatic (rudimentary) system to filter out comments that included some keywords (insults or other problematic words). In addition, Delfi had a mechanism with which readers could mark a comment as inappropriate. The sentence considers that this filter was insufficient to prevent damage against the honor of third parties and that the media company should have taken more effective action to prevent these situations.

The court considers reasonable to hold responsible the editor, being its function to publish information and give visibility to the comments of readers, and profiting through the traffic generated by those comments.

What now? In an entry of this blog, entitled “Moderating participation in the media” [in Spanish] and published a couple of years ago, we summed up the difficulties and the keys of our approach to help solving a problem that is not trivial.

Difficulties are manifold. On the one hand, the detection of isolated offensive words is not enough and it is necessary to filter expressions, sometimes taking into account their context and inflected forms. On the other hand, it is also necessary to interpret the abbreviated language or texts with typographic errors, which are noticeably frequent in comments and user-generated content sections. These “errors” can arise from limitations of devices, the impulsive aspect of commenting, or the users’ intention to cheat the automatic filters trying to outsmart them by all means. (Sometimes in really witty ways).

In addition to this problem related to the Variety of texts, we find the other two recurring features in “big data” applications (forming the famous 3Vs): Volume of the comments to be processed and Velocity of response required.

At Daedalus, we have been addressing these problems for the media industry for years and lately also for other sectors, like banking and insurance.

As regards the integration architecture of our solutions, we are currently offering them in SaaS (Software as a Service) mode, from our new APIs platform in the cloud Textalytics, as well as the traditional licensing to run on-premises.

With automatic filtering systems, we cannot guarantee 100% accuracy for any filtering task. Different companies or media, and different sections within a same medium, require different strategies. It seems clear that it makes no sense applying the same filter criteria to the comments of a brilliant feature article and to the interventions that emerge during the live broadcast of a football match or a reality show. In this sense, our systems assess the gravity of the expression, allowing our customers to set flexibly their acceptability threshold. On the other hand, we provide customization tools to facilitate the incorporation of new problematic expressions. Finally, we also permanently monitor the operation of these systems for customers who wish it, within their plans of continuous quality assurance and improvement.

Are you interested? Feel free to contact Daedalus.

Discover our solutions for the media industry.

References to this topic:

Jose C. Gonzalez

Punto final para los comentarios ofensivos de los lectores en los medios de comunicación online: los medios serán los responsables. Y ahora, ¿qué?

24 octubre, 2013 1 comentario

El Tribunal Europeo de Derechos Humanos, el mismo que acaba de deslegitimar la aplicación retroactiva de la denominada “doctrina Parot”, dictó el pasado día 10 de octubre una sentencia muy relevante para los medios de comunicación europeos.

El caso en cuestión fue interpuesto por la web de noticias estonia Delfi, condenada por la justicia de su país por la publicación de comentarios ofensivos de lectores contra el director de una empresa que actuaba como fuente de una información. La publicación de la noticia en cuestión se produjo el 24 de enero de 2006, y algunas semanas después, el 9 de marzo, los abogados del ofendido solicitaron la retirada de 20 comentarios ofensivos y una indemnización por daños morales. La web de noticias retiró los comentarios el mismo día y rechazó la petición económica. Al mes siguiente, se interponía una demanda judicial civil ante los tribunales estonios. Esta demanda llegó hasta la máxima instancia judicial nacional, que confirmó la culpabilidad y condenó al medio a una indemnización de 320 euros para el demandante.


La empresa propietaria del portal de noticias, Delfi, recurrió a Estrasburgo (sede del Tribunal Europeo de Derechos Humanos), considerando que la condena vulneraba el principio de libertad de expresión, amparado por el artículo 10 de la Convención para la Protección de los Derechos Humanos y las Libertades Fundamentales.

Ahora, este tribunal europeo ha fallado en contra del medio de comunicación. Y ello a pesar de que Delfi disponía de un sistema automático (rudimentario) para filtrar comentarios que incluyeran algunas palabras clave (insultos u otras palabras problemáticas). Además, Delfi disponía de un mecanismo con el que los propios lectores podían marcar un comentario como inadecuado. La sentencia considera que este filtro era insuficiente para impedir daños contra el honor de terceros y que el medio debió tomar medidas más efectivas para prevenir estas situaciones.

El Tribunal considera razonable responsabilizar al editor, siendo su función publicar informaciones y dar visibilidad a los comentarios de los lectores, y lucrándose por ello a través del tráfico generado por esos comentarios.

Y ahora, ¿qué hacer? En un texto de este mismo blog, titulado “Moderar la participación en los medios“, publicado hace un par de años, resumíamos las dificultades y las claves de nuestro enfoque para ayudar a resolver un problema que no es trivial.

Las dificultades son múltiples. Por un lado, no basta con detectar palabras ofensivas aisladas, sino que es necesario filtrar expresiones, a veces teniendo en cuenta el contexto de la expresión y sus variantes flexivas. Por otro lado, hay que interpretar el lenguaje abreviado o los textos con errores ortotipográficos tan frecuentes en las secciones de participación o en los contenidos generados por usuarios. Estos “errores” pueden derivarse de las limitaciones de los dispositivos, del carácter impulsivo de los comentarios, o de la intención enmascaradora de los propios usuarios que, a sabiendas de la existencia de filtros automáticos, tratan de burlarlos por todos los medios (a veces con mucho ingenio).

Además de este problema relacionado con la Variedad de los textos, encontramos las otras dos características recurrentes en las aplicaciones de “big data” (conformando las famosas 3V): el Volumen de comentarios a tratar y la Velocidad de respuesta requerida.

En Daedalus, venimos abordando estos problemas desde hace años para el sector de los medios, y últimamente también en otros sectores, como el de banca y seguros.

En cuanto a la arquitectura de integración de estas soluciones, en la actualidad las estamos ofreciendo como un servicio en modo SaaS (Software as a Service), desde nuestra plataforma Textalytics de APIs en la nube, además del  tradicional licenciamiento para su ejecución on-premises.

Con los sistemas automáticos, no podemos garantizar un 100% de precisión para cualquier tarea de filtrado. Diferentes empresas o medios, y diferentes secciones dentro de un mismo medio, requieren distintas estrategias. Parece evidente que no tiene sentido aplicar los mismos criterios de filtrado a los comentarios de un sesudo artículo de fondo que a las intervenciones surgidas durante la transmisión en directo de un partido de fútbol o de un reality show. En ese sentido, nuestros sistemas caracterizan la gravedad de la expresión, permitiendo flexibilidad a nuestros clientes para establecer el umbral idóneo para su caso. Por otro lado, proporcionamos herramientas de personalización para facilitar la incorporación de nuevas expresiones problemáticas. Por último, también monitorizamos permanentemente el funcionamiento de estos sistemas para los clientes que lo desean, dentro de sus planes de aseguramiento y mejora continuada de la calidad.

¿Interesado? No dude en ponerse en contacto con Daedalus.

Descubra nuestras soluciones para el sector de medios.

Referencias a este asunto:

José Carlos González

Sentiment Analysis in Spanish: TASS corpus released

The corpus used in TASS, the Workshop on Sentiment Analysis in Spanish organized by Daedalus, has been made freely available to the research community after the workshop. With the creation and release of this corpus, we aim to provide a common benchmark dataset that enables researchers to compare their algorithms and systems. Results from participants in TASS 2012 and TASS 2013 are already available to compare.

The corpus is divided into General and Politics corpus. Both are written in XML following the same schema.

General corpus

The General corpus contains 68 017 Twitter messages, written in Spanish by 154 well-known celebrities of the world of politics, communication and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.

Each message has been tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE). In addition, the sentiment agreement level within the content has been classified into two possible values: AGREEMENT and DISAGREEMENT. This allows to make out whether a neutral sentiment comes from neutral keywords or else the text contains positive and negative sentiments at the same time.

Moreover, the values of the polarity at entity level, i.e., the polarity values related to the entities that are mentioned in the text, has been also included. These values are similarly divided into 5 levels and include the level of agreement as related to each entity.

On the other hand, a selection of a set of 10 topics has been made based on the thematic areas covered by the corpus, such as “politics”, “soccer”, “literature” or “entertainment”. Each message has been assigned to one or several of these topics.


The General corpus has been divided into two sets: training (7 219 tweets) and test (60 798 tweets) sets. The training set has been manually tagged. The tagging in the test set has been generated by pooling all submissions from participants in the TASS tasks with a votation schema followed by an extensive human review of the ambiguous decisions, which unfortunately is subject to errors. In the case of the polarity at entity level, the tagging has just been done for the training set, due to the high volume of data to check and the lack of participants in the task.

In addition, the political tendency of users has been manually identified and assigned to one of the four possible values: LEFT, RIGHT, CENTRE and UNDEFINED. The aim of Task 4 in TASS 2013 was in fact to estimate his/her political tendency based on the user’s tweets.


Politics corpus

The Politics corpus contains 2 500 tweets, gathered during the electoral campaign of the 2011 general elections in Spain (Elecciones a Cortes Generales de 2011), from Twitter messages mentioning any of the four main national-level political parties: Partido Popular (PP), Partido Socialista Obrero Español (PSOE), Izquierda Unida (IU) and Unión, Progreso y Democracia (UPyD).

Similarly to the General corpus, the global polarity and the polarity at entity level for those four entities has been manually tagged for all messages. However, in this case, only 3 levels are used in this case: positive (P), neutral (NEU), negative (N), and one additional no sentiment tag (NONE). Moreover, to simplify the identification of the named entities, a “source” attribute is assigned to each tagged entity, indicating the political party to which the entity refers.


All the information is available in the TASS 2013 Corpus page. If you are interested, please send an email to tass AT with your email, affiliation and a brief description of your research objectives, and you will be given a password to download the files in the password protected area.

Join us at TASS-2013 – Workshop on Sentiment Analysis in Spanish – Sept. 20th, 2013

TASS is an experimental evaluation workshop for sentiment analysis and online reputation analysis focused on Spanish language, organized by Daedalus, Universidad Politécnica de Madrid and Universidad de Jaén, as a satellite event of the annual SEPLN Conference. After a successful first edition in 2012, TASS 2013 [] is going to be held on Friday September 20th, 2013 at Universidad Complutense de Madrid, Madrid, Spain. Attendance is free and you are all welcome to participate.


The long-term objective of TASS is to foster research in the field of reputation analysis, which is the process of tracking, investigating and reporting an entity’s actions and other entities’ opinions about those actions. The rise of social media such as blogs and social networks and the increasing amount of user-generated contents in the form of reviews, recommendations, ratings and any other form of opinion, has led to creation of an emerging trend towards online reputation analysis, i.e., the use of technologies to calculate the reputation value of a given entity based on the opinions that people show in social media about that entity. All of them are becoming promising topics in the field of marketing and customer relationship management.

As a first approach, reputation analysis has two technological aspects: sentiment analysis and text classification (or categorization). Sentiment analysis is the application of natural language processing and text analytics to identify and extract subjective information from texts. Automatic text classification is used to guess the topic of the text, among those of a predefined set of categories or classes, so as to be able to assign the reputation level of the company into different facets, axis or points of view of analysis.

The setup of the workshop is based on a series of challenge tasks based on two provided corpus, specifically focused on Spanish language, which are intended to promote the application of existing state-of-the-art and new proposals of algorithms and techniques in these fields and provide a benchmark forum for comparing the latest approaches. In addition, with the creation and release of the fully tagged corpus, we aim to provide a benchmark dataset that enables researchers to compare their algorithms and systems.

Two corpus were provided:

  • The General corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012.

Sin título 1

  • The Politics corpus contains 2 500 tweets, gathered during the electoral campaign of the 2011 general elections in Spain (Elecciones a Cortes Generales de 2011), from Twitter messages mentioning any of the four main national-level political parties: Partido Popular (PP), Partido Socialista Obrero Español (PSOE), Izquierda Unida (IU) y Unión, Progreso y Democracia (UPyD).

Sin título

All messages are tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE). In addition, there is also an indication of the level of agreement or disagreement of the  expressed sentiment within the content, with two possible values: AGREEMENT and DISAGREEMENT. Moreover, a selection of a set of topics has been made based on the thematic areas covered by the corpus, such as politics, soccer, literature or entertainment, and each message has been assigned to one or several of these topics. More information on these corpus will be included in future posts.

Four tasks were proposed for the participants, covering different aspects of sentiment analysis and automatic text classification:

  • Task 1: Sentiment Analysis at Global Level. This task consists on performing an automatic sentiment analysis to determine the global polarity (using 5 levels) of each message in the test set of the General corpus.
  • Task 2: Topic Classification. The technological challenge of this task is to build a classifier to automatically identify the topic of each message in the test set of the General corpus.
  • Task 3: Sentiment Analysis at Entity Level. This task consists on performing an automatic sentiment analysis, similar to Task 1, but determining the polarity at entity level (using 3 polarity levels) of each message in the Politics corpus.
  • Task 4: Political Tendency Identification. This task moves one step forward towards reputation analysis and the objective is to estimate the political tendency of each user in the test set of the General corpus, in four possible values: LEFT, RIGHT, CENTRE and UNDEFINED. Participants could use whatever strategy they decide, but a first approach could be to aggregate the results of the previous tasks by author and topic.

31 groups registered (as compared to 15 groups in TASS 2012) and 14 groups (9 last year) sent their submissions. Participants were invited to submit a paper to the workshop in order to describe their experiments and discussing the results with the audience in the regular workshop session.


If you feel curious about the approaches adopted by the different groups and the results achieved in each Task, you are very welcome to attend the session on Friday September 20th, 2013 at Universidad Complutense de Madrid!

Or stay tuned for future posts that will provide valuable information and conclusions.

Are social media monitoring tools reliable?

There are many who emphasize the importance of monitoring what is being discussed in social networks (Rappaport) but it is not always clear how to do it (as tries to explain Seth Grimes in this article). Oddly enough, there are plenty of tools for that purpose; you can find dozens of lists with the best or the most popular ones, but what should we demand to these tools to ensure that the information we obtain is useful? This question is important, because those who question the reliability of the metrics about SoV in social networks are not less (at least if you look at the myriad of blog entries that try to convince the supporters). So, who do we listen to? Should we worry about what is being commented in social networks? Will those tools serve us for something? The answer to these last two questions is clearly yes, but we must bear in mind various aspects before elaborating a plan of social network monitoring and, above all, before selecting the tools with which we will work. Among those aspects we must consider:

Girl listening with her hand on an ear

What do we want to monitor?

Obviously, we are not interested in everything that appears in the web. Even if we wanted to, it would be impossible. But we can focus on some subjects: for example, our brand, our company, global warming, financial products, savings, etc. But how do we refer to these issues? That is, how do we instruct a tool about the issues we are interested in? Depending on the application we are working with, we will indicate the different ways in which the subject that interests us might appear in a text (for example, “climate change”, “greenhouse effect”, “ozone hole”) or the application itself will suggest us the approximate terms. It would be preferable to have mechanisms that permit to put some terms in relation with other ones. To do that, ontologies or other semantic resources are used, for example Wikipedia, Freebase, etc., or the user himself/herself can be the one who indicates how to relate those terms. At this point, the tool’s capabilities in terms of text analysis (text analytics) and, more specifically, Natural Language Processing, are fundamental. It should not matter if a term appears in its singular or plural form (“financial deposit” vs. “financial deposits”) or if there is ambiguity (“Santander” as financial institution vs. “Santander” as city). Something similar happens with the treatment of languages: how do you translate “greenhouse effect” into Spanish? Can it lead to an ambiguous term? Taking advantage of existing ontologies can be of great help when it comes to link equivalent concepts in different languages.

Where do we want to monitor?

The answer to this question may be simple —in social networks— but are we interested in all networks? Are we present in all of them? Is it possible that they speak about us in a network although we do not have presence in there? We should consider also if there are blogs that we should monitor (which do not form part of the so-called “social networks”). The answers to these questions depend, obviously, on the type of company or brand that we are considering. Is it possible that they talk on Facebook about a company that manufactures lamps when it has no presence in this social network?

From the point of view of text analysis, it is important to be able to work either with well-formed text, on which a complete syntactic analysis can be done, or with “incorrect” text, as it usually appears in messages that are published in social networks. This is also an important factor when selecting a tool.

In which languages?

We have mentioned briefly the problem of languages, but it is a fundamental element. Does the application for social media analysis that has caught our attention cover the languages that matter to us? Many of them apparently do, but it is necessary to verify how thoroughly they do it. It is advisable to take into account aspects such as: how many entities that appear in a text do they recognize, if they classify them in the appropriate type (person, place, organization), if besides entities they recognize other structures as URLs, hashtags, etc., if, in the case of sentiment analysis applications, they process negations or can assign polarity to entities or attributes.

What we earn by monitoring? Can we measure the results of our monitoring activity?

The doubt whether the investment implied by monitoring what happens in social networks will be recouped or not is recurrent. Is it enough to know how many people retweet our posts or indicate that they like our Facebook page? It is clear that it is not, but, if we could ask the question “how many people speak well of my product and how many criticize it?” and get a reliable answer, the chances are high that we would not be hesitating that much. Or even if we could interact with the customer who has made harsh criticism. How much a satisfied customer is worth and how much does it cost?

Today, there is a technology that can answer to these questions, even for English. What do you think? Do you trust in the technology for social media monitoring? If you use a specific tool, are you satisfied with it?

If you would like to have more information on how Daedalus can help you improve your way of monitoring social media, please contact us.

Language Technology and the Future of the Content Industry

A few days ago I had the opportunity to participate as a speaker at a conference organized by LT-Innovate (the European Industry of Language Technology Forum) oriented to the publishing and media industries. This initiative is part of the focus groups that LT-Innovate is organizing in order to boost and expand the activity of companies providing products and services based on language technology (intelligent content processing, speech technology and automatic translation). Representatives of around thirty European companies attended the forum, both customers and suppliers.

In my presentation I emphasized the transformation of the Content Industry as a result of a crisis with numerous facets: the changes in the way users consume contents, the departure from traditional supports and their prompt displacement to the Internet environment, the abundance of free content, with an enormous volume produced and published directly and instantly by users and the fall of advertising income. A scenario that is causing the failure of business models until recently successful and the rise of others still unpredictable.

Until not long ago, solutions based on language technology had little space in content management tools or were limited to isolated applications of the production environment.  Nevertheless, the progressive digitalization and the growth of the Internet’s segment dedicated to content consumption, the urgent need to reduce costs and time, the integration of media newsrooms independent of supports, etc. have let progressively grow our clients’ needs. Thus, gradually and throughout fifteen years, at Daedalus we have been covering those needs by increasing our catalog of solutions, among which are the following:

  • Spell, grammar and style checking oriented to the professional environment, which requires accuracy and uniform criteria.
  • Semantic publication, including the automatic identification of entities (people, organizations, places, facilities, concepts, time or currency references…) and significant concepts, the classification or grouping of texts according to journalistic or documentary standards.
  • Moderation or automatic filtering of forums and the immediate revision of user generated content.
  • Indexing and search of multilingual and multimedia content.
  • Approximate and natural language search interfaces.
  • Search in multilingual content by incorporating automatic translation systems.
  • Transcription of multimedia content and automatic video subtitling.
  • Automatic analysis of opinions, feelings and reputation in social media.

All these applications have use in the more and more diversified processes of content industry:

  • Delivery of content and contextual advertising adapted to the users’ interest profiles.
  • Production of transmedia content (simultaneous, complementary and synchronized distribution through multiple platforms: TV, Internet, tablets, smartphones).
  • Support to documentary research and data journalism, starting from the analysis and the advanced investigation of heterogeneous information sources.
  • Support to Search Engine Optimization features and marketing online.
  • Support for new business models based on the sale of single pieces of content or stories built up by the aggregation of content produced throughout the time on a subject, an event, a public figure, etc.

As we see, language technology has moved from marginal to central positions in all areas of this industry. At Daedalus we are proud of having served in this process to a good number of companies and groups of this industry for years, to which we feel closely committed.

We invite you to check out our presentation in the Publishing/Media Industry Forum organized by LT-Innovate (Berlin, April 12th, 2013).

Jose C. Gonzalez

A %d blogueros les gusta esto: