Archive for the ‘Semantic processing’ Category

Recognizing entities in a text: not as easy as you might think!

12 diciembre, 2013 Deja un comentario

Entities recognition: the engineering problem

As in every engineering endeavor, when you face the problem of automating the identification of entities (proper names: people, places, organizations, etc.) mentioned in a particular text, you should look for the right balance between quality (in terms of precision and recall) and cost from the perspective of your goals. You may be tempted to compile a simple list of such entities and apply simple but straightforward pattern matching techniques to identify a predefined set of entities appearing “literally” in a particular piece of news, in a tweet or in a (transcribed) phone call. If this solution is enough for your purposes (you can achieve high precision at the cost of a low recall), it is clear that quality was not among your priorities. However… What if you can add a bit of excellence to your solution without technological burden for… free? If you are interested in this proposition, skip the following detailed technological discussion and go directly to the final section by clicking here.

Where difficulties come from?

Now, I will summarize some of the difficulties that may arise when designing an automatic system for “Named Entities Recognition” (NER, in short, in the technical papers). Difficulties may come from two fronts:

  • Do you deal with texts in several languages? Do you know the language of each text in advance?
  • What is the source of the documents or items of text that you have to manage? Do they come from a professional newsroom? Did you ingest them from OCR (Optical Character Recognition) or ASR (Automatic Speech Recognition) systems? Did you catch them with the API of your favorite social network?
  • Do your texts follow strict academic conventions regarding spelling and typography? (i.e. do you always deal with well-written text?)  Did users generate them with their limited and error-prone devices (smartphones)? Did second language speakers or learners produce them?

Designing the perfect NER system: the language nightmare

The previous questions end up in a set of complex challenges:

Eiffel Tower

Eiffel Tower

1. Translingual equivalence:
Problem: When you deal with multilingual content, you are interested in recognizing not language-dependent names, but entities that are designated differently in different languages.
Example: Eiffel Tower (EN), Tour Eiffel (FR) and Torre Eiffel (ES) refer to the very same object.
Solution: You need to use semantic processing to identify meanings, relative to a consistent, language-independent world model (e.g. using ontologies or referring to linked data sources).


Nelson Mandela

2. Intralingual or intratext equivalence:
Problem: For a particular language, texts usually refer to the same entities in different flavors (to avoid repetition, due to style considerations or communication purposes).
Example: Nelson Mandela, Dr. Mandela (depending on the context) and Madiba are recognized by English speakers as the same entity.
Solution: Again, in the general case, you need to link multiword strings (tokens) to meanings (representing real world objects or concepts).

3. Transliteration ambiguity:
Problem: translation of names between different alphabets.
Example: Gaddafi, Qaddafi, Qadhdhafi can refer to the same person.
Solution: It is always difficult to decide the strategy to attach a sense to an unknown word. Should you apply phonetic rules to find equivalents from Arabic or from Chinese? Expressing it otherwise: is the unknown word just a typo, a cognitive mistake, a spelling variant or even an intended transformation? Only when context information is available you can rely on specific disambiguation strategies. For example, if you know or you deduce that you are dealing with a well-written piece of news about Libya, you should surely try to find alternative transliterations from Arabic. This problem is usually treated at dictionary level, incorporating the most widespread variants of foreign names.

George Washington

George Washington

4. Homonyms disambiguation
Problem: Proper names have usually more than one bearer.
Example: Washington may refer to more or less known people (starting by George Washington), the state on the Pacific coast of the USA, the capital of the USA (Washington, D.C.) and quite a few other cities, institutions and installations in the same and other countries. It can even be a metonym for the Federal government of the United States.
Solution: Semantic and contextual clues are needed for proper disambiguation. Are there any other references to the same name (maybe in a more complete form) along the piece of text under scrutiny? Can semantic analysis tell us if we deal with a person (producing human actions) or a place (where things happen)? Can we establish with confidence a geographical context for the text? This could also lead to favorite particular interpretations.

5. Fuzzy recognition and disambiguation:
Problem: in the general case, how to deal with unknown words when you rely on (maybe huge) multilingual dictionaries plus (maybe smart) tokenizers and morphological analyzers?
Example: If you find in an English text the word “Genva”, should you better interpret it as Geneva (in French Genève) or Genoa (in Italian Genova).
Solution: the presence of unknown words is linked most of times to the source of the piece of text that you are analyzing. When the text has been typed with a keyboard, the writer may have failed to type the right keys. When the text comes from a scanned image through OCR, the result can be erroneous depending on image resolution, font type and size, etc. Something similar occurs when you get a text through ASR. The strategy to interpret correctly the unknown word (identifying the meaning intended by the author) implies using metrics for distance between the unknown word and other words that you can recognize as correct. In our example, if the text has been typed with a qwerty keyboard, it seems that the distance between Genva and Geneva involves a single deletion operation, while the distance between Genva and Genoa involves a single substitution using a letter that is quite far apart. So, using distance metrics, Geneva should be preferred. But contextual information is equally important for disambiguation. If our text includes mentions to places in Switzerland, or it can be established as the right geographical context, then Geneva gains chances. Otherwise, if the text is about Mediterranean cruises, Genoa seems to be the natural choice.

Meaning as a Service


Textalytics: semantic technology at your fingertips

Systems or platforms for Content Management (CMS), Customer Relationship Management (CRM), Business Intelligence (BI) or Market Surveillance incorporate information retrieval functionality allowing the search of individual tokens (typically alphanumeric strings) or literals in unstructured data. However, they are very limited in terms of recognition of semantic elements (entities, concepts, relationships, topics, etc.) This kind of text analytics is very useful not only for indexing and search purposes, but also for content enrichment. The final aim of these processes is adding value in terms of higher visibility and findability (e.g. for SEO purposes), content linkage and recommendation (related contents), ads placing (contextual advertisement), customer experience analysis (Voice of Customer, VoC analytics), social media analysis (reputation analysis), etc.

To facilitate the integration of semantic functionality in any software application, Daedalus opened its multilingual semantic APIs to the community through the cloud-based service Textalytics. On the client side, you can send a call (petition) to our service in order to process one item of text (a piece of news, a tweet, etc.): what you get is the result of our processing in an interchange format (XML or JSON). Textalytics APIs offer natural language processing functionality in two flavors:
  • Core APIs: one API call for each single process (extraction of entities, text classification, spell checking, sentiment analysis, content moderation, etc.) Fine tuning is achieved through multiple parameterization. Besides natural language core processing, audio transcription to text is also available, as well as auxiliary functions. Auxiliary APIs are useful, for example, to link entities with open linked data repositories, as DBpedia/Wikipedia, or to guess crucial demographic features (type, gender, age) for a given social media user.
  • Vertical APIs (Media Analysis, Semantic Publishing): one API call provides highly aggregated results (e.g. extraction of entities and topics, plus classification, plus sentiment analysis…), convenient for standard use in a vertical market (media industry, publishing industry…)

To end this post, let me stress other benefits of selecting Textalytics for semantic processing:

  • SDKs (Java, Python, PHP and Visual Basic) are offered for quick integration. Software developers take not more than half an hour to read the documentation and integrate our semantic capabilities in any environment.
  • You can register in Textalytics, subscribe to the API or APIs of your choice, get your personal key and send as many petitions as you want for free, up to a maximum of 500.000 words processed per month. For research, academic or commercial usage. No matter.
  • If you need processing higher volumes of text (exceeding the free basic plan) or in case you require launching more than five API calls per second, you can subscribe at affordable prices. No long-term commitment. Pay per month. Check out our pricing plans.

Curious? Try our demo!
Interested?  Contact us!
Believer? Follow us!

José C. González (@jc_gonzalez)

Semantic Analysis and Big Data to understand Social TV

25 noviembre, 2013 1 comentario

We recently participated in the Big Data Spain conference with a talk entitled “Real time semantic search engine for social TV streams”. This talk describes our ongoing experiments on Social TV and combines our most recent developments on using semantic analysis on social networks and dealing with real-time streams of data.

Social TV, which exploded with the use of social networks while watching TV programs is a growing and exciting phenomenon. Twitter reported that more than a third of their firehose in the primetime is discussing TV (at least in the UK) while Facebook claimed 5 times more comments behind his private wall. Recently Facebook also started to offer hashtags and the Keywords Insight API for selected partners as a mean to offer aggregated statistics on Social TV conversations inside the wall.

As more users have turned into social networks to comment with friends and other viewers, broadcasters have looked into ways to be part of the conversation. They use official hashtags, let actors and anchors to tweet live and even start to offer companion apps with social share functionalities.

While the concept of socializing around TV is not new, the possibility to measure and distill the information around these interactions opens up brand new possibilities for users, broadcasters and brands alike.  Interest of users already fueled Social TV as it fulfills their need to start conversations with friends, other viewers and the aired program. Chatter around TV programs may help to recommend other programs or to serve contextually relevant information about actors, characters or whatever appears in TV.  Moreover, better ways to access and organize public conversations will drive new users into a TV program and engage current ones.

On the other hand, understanding the global conversation about a program is definitely useful to acquire insights for broadcasters and brands. Broadcasters and TV producers may measure their viewers preferences and reactions or their competence and acquire complementary information beyond plain audience numbers. Brands are also interested in finding the most appropriate programs to reach their target users as well as understand the impact and acceptance of their ads. Finally, new TV and ad formats are already being created based on interaction and participation, which again bolster engagement.

In our talk, we describe a system that combines natural language processing components from our Textalytics API and a scalable semi-structured database/search engine, SenseiDB, to provide semantic and faceted search, real-time analytics and support visualizations for this kind of applications.

Using Textalytics API we are able to include interesting features for Social TV like analyzing the sentiment around an entity (a program, actor or sportsperson). Besides, entity recognition and topic extraction allow us to produce trending topics for a program that correlate well with whatever happens on-screen. They work as an effective form to organize the conversation in real-time when combined with the online facets provided by SenseiDB. Other functionalities like language recognition and text classification help us to clean the noisy streams of comments.

SenseiDB is the second pillar of our system. A semi-structured distributed database that helps us to ingest streams and made them available for search in real-time with low query and indexing times. It includes a large number of facet types that enable us to use navigation using a range of semantic information. With the help of histogram and range facets it could even be overused for simple analytics tasks. It is well rounded with a simple and elegant query language, BQL, which help us to boost the development of visualizations on top.

If you find it interesting, check out our presentation for more detail or even the video of the event.

Trends in data analysis from Big Data Spain 2013

19 noviembre, 2013 Deja un comentario

logo Big Data Spain

The second edition of Big Data Spain took place in Madrid on last November 7 and 8 and proved to be a landmark event on technologies and applications of big data processing. The event attracted more than 400 participants, doubling last year’s number, and reflected the growing interest on these technologies in Spain and across Europe. Daedalus participated with a talk that illustrated the use of natural language processing and Big Data technologies to analyze in real time the buzz around Social TV.

Big Data technology has matured when we are about to cellebrate its 10th birthday, marked by the publication of the MapReduce computing abstraction that later gave rise to the field.

Rubén Casado, in one of the most useful talks to understand the vast amnount of Big Data and NoSQL project outlined the recent history of the technology in three eras:

  • Batch processing ( 2003 – ) with examples like  Hadoop or Cassandra.
  • Real time processing ( 2010 – ) represented by recent projects like StormKafka o Samza.
  • Hybrid processing ( 2013 – ) which attempts to combine both worlds in an unified programming model like Summingbird  or Lambdoop.

Withouth any doubt, the first era of solutions is enterprise-ready with several Hadoop based distributions like Cloudera, MapR or HortonWorks. Likewise the number of companies that are integrating them or providing consultancy in this field is expanding and reaching every sector from finance and banking to telecomunications or marketing.

Some other technological trends clearly emerged from talk topics and panels:

  • growing number of alternatives to deal online with large volume data analysis tasks (Spark, Impala, SploutSQL o SenseiDB)
  • SQL comeback, or at least as dialects on top of actual systems that made easier to develop and maintain applications
  • the importance of visualization as a tool to communicate Big Data results effectively.

However, adopting Big Data as a philosophy inside your company is not just merely technology. It requires a clear vision of the benefits that grounding all your processes in data may carry, and the value and knowledge that you may obtain by integrating internal and also external data. Another important factor is to be able to find the right people to bridge the chasm between the technical and businness sides. In this sense, the role of the data scientist is very important and Sean Owen from Cloudera defined it as “a person who is better at statistics than any software engineer and better at software engineering than any statistician”. We may add to the whish list a deep knowledge of your businness domain and the ability to ask the right questions.

While not everybody agreed, it seems that the best way to start “doing Big Data” is one step at a time and with a project with clear bussiness goals. If you want to test the technology, good candidates are those business process that have already become a bottleneck using standard databases. On the other hand, innovation may also be an important driver, by using external open data or if you need to design data-centric products. A good example of that sort is the Open Innovation challenge from Centro de Innovacion BBVA,  providing aggregate information on  credit card transactions.


Finally, going back to the theme of our talk, one of the external sources that would is generating more value are social network data. Due to their heterogeneity, social networks are intrinsically difficult to analyze, but, fortunately, text analytics tools like Textalytics API, enable you to make sense of unstructured data. If implemented into your Big Data toolset they open the door to the intellingent integration of quantitative and qualitative data with all the valuable insights you would obtain.

If you want to dive into the Big Data world, videos of the talks and experts panel are available at the Big Data Spain site.

The Citizen Sensor: the citizen as a sensor in the city of the future

sensorciudadano1One of our most promising lines of work in the Ciudad2020 R&D project (INNPRONTA Program, funded by CDTI, Technological and Industrial Development Center) focuses on the concept that we have defined as Citizen Sensor: the log of events in relation with citizens and their municipality.

By applying Textalytics’ semantic technologies, we can analyze in detail the citizen’s voice, extracting heterogeneous, high-level information. Being this highly descriptive and with high added value, it is useful to model the citizen’s urban behavior and his/her relationship with the city of the future. In this way the citizen becomes a sensor integrated in the network of sensors of the systems of the city.

The Citizen Sensor can provide data in different ways:

  • Mobile phone.- For example, to detect noise pollution, the user could start an application on his/her smartphone to record the noise level and send it to the city servers. This act will give us a map of the most significant sources of noise of the city, which evolves over time (works in the mornings, parties on weekends…).
  • Citizen’s events.- For example, the user validates the train ticket to go to work. This, added to the events generated by the rest of users who use the train, will give us an idea of the density of travelers who use the train to go to work each morning and which way they go through.
  • Social networks.- Our systems can analyze the flow of tweets in a geographic area to know what users are talking about, and if it is something relevant (a car crash that provokes traffic jams, a fire, a music festival…) we can use those data to develop a precise model with much more adjusted predictions. We can also collect the citizens’ thinking or opinion with respect to policies taken by the local authority (for example, the policy of reducing consumption on air-conditioning in public transport).

As a preliminary work, we have built an ontology that defines the different dimensions which are going to guide the semantic analysis. We are currently collecting information from Twitter, and in particular, our aim is to identify in each tweet the location where the user is located (a public building like the city hall or a hospital, parks, transportation facilities, places of leisure or work, etc.), the concept (city services, supplies, sign posts, etc.), or the specific event it refers to (concerts or sporting events, or problematic situations as breakdowns, traffic jams, accidents, fires), as well as the subject area of the message (politics, economy, quality of life, tourism, sport, social interest…). This analysis is complemented by a sentiment analysis able to detect the polarity of the message (very positive, positive, negative, very negative and neutral).


The aim is to merge the semantic analysis with the user’s geopositioning in order to obtain interesting results on what citizens talk and opine about, in real time, as a city management console. This type of analysis could serve, for example, for early detection of risk situations such as accidents or supply breakdowns on public roads, fights in leisure areas, condition (cleaning, security, services) of public parks or beaches, etc.

For this analysis we use our APIs of language detection (which can process Spanish, English, French, Italian, Portuguese and Catalan), extraction of entities, automatic classification, sentiment analysis and demographic classification of users, all included in Textalytics Core.


At the moment we are researching in temporal analysis, to try to detect the citizens’ tendencies of behavior and opinion throughout the time of analysis. This research consists of comparing the condition of the city at different moments of time to analyze and interpret the differences which will be due either to the daily life of the city (for example, the natural increase of public activity as the morning advances) or unexpected situations that might be predicted.

You can find more information, documentation, and demos on our web page: If you have any questions or comments, please do not hesitate to contact us.

[Translation by Luca de Filippis]

Offensive comments from readers in European online media have come to a full stop: Media will be responsible. What’s next?

28 octubre, 2013 1 comentario

The European Court of Human Rights issued on October 10th a very relevant sentence for European media companies. The case was brought by the Estonian news website Delfi, sued by the Justice of its country for having published offensive comments of readers against the director of a company which acted as a source of information. The publication of the news in question occurred on January 24th, 2006, and a few weeks later, on March 9th, the lawyers of the victim requested the withdrawal of 20 offensive comments and compensation for moral damages. The news website removed the comments on the same day and rejected the economic request. The following month, a civil lawsuit was filed before the Estonian courts. This lawsuit reached the national highest court, which upheld the guilty verdict and sentenced the media company to provide 320 euros in compensation to the plaintiff.

Delfi, the company that owns the news portal, resorted to Strasbourg (headquarters of the European Court of Human Rights), stating that the sentence violated the principle of freedom of expression, protected by article 10 of the Convention for the Protection of Human Rights and Fundamental Freedoms.


Now, this European court has ruled against the media company. And this despite the fact that Delfi had an automatic (rudimentary) system to filter out comments that included some keywords (insults or other problematic words). In addition, Delfi had a mechanism with which readers could mark a comment as inappropriate. The sentence considers that this filter was insufficient to prevent damage against the honor of third parties and that the media company should have taken more effective action to prevent these situations.

The court considers reasonable to hold responsible the editor, being its function to publish information and give visibility to the comments of readers, and profiting through the traffic generated by those comments.

What now? In an entry of this blog, entitled “Moderating participation in the media” [in Spanish] and published a couple of years ago, we summed up the difficulties and the keys of our approach to help solving a problem that is not trivial.

Difficulties are manifold. On the one hand, the detection of isolated offensive words is not enough and it is necessary to filter expressions, sometimes taking into account their context and inflected forms. On the other hand, it is also necessary to interpret the abbreviated language or texts with typographic errors, which are noticeably frequent in comments and user-generated content sections. These “errors” can arise from limitations of devices, the impulsive aspect of commenting, or the users’ intention to cheat the automatic filters trying to outsmart them by all means. (Sometimes in really witty ways).

In addition to this problem related to the Variety of texts, we find the other two recurring features in “big data” applications (forming the famous 3Vs): Volume of the comments to be processed and Velocity of response required.

At Daedalus, we have been addressing these problems for the media industry for years and lately also for other sectors, like banking and insurance.

As regards the integration architecture of our solutions, we are currently offering them in SaaS (Software as a Service) mode, from our new APIs platform in the cloud Textalytics, as well as the traditional licensing to run on-premises.

With automatic filtering systems, we cannot guarantee 100% accuracy for any filtering task. Different companies or media, and different sections within a same medium, require different strategies. It seems clear that it makes no sense applying the same filter criteria to the comments of a brilliant feature article and to the interventions that emerge during the live broadcast of a football match or a reality show. In this sense, our systems assess the gravity of the expression, allowing our customers to set flexibly their acceptability threshold. On the other hand, we provide customization tools to facilitate the incorporation of new problematic expressions. Finally, we also permanently monitor the operation of these systems for customers who wish it, within their plans of continuous quality assurance and improvement.

Are you interested? Feel free to contact Daedalus.

Discover our solutions for the media industry.

References to this topic:

Jose C. Gonzalez

See you at Big Data Spain!

The second edition of Big Data Spain, one of the landmark events in Europe on technologies and business applications of big data, will take place in Madrid next 7 and 8 November.

Some analysts’ estimates give an idea of the importance of the big data phenomenon. As reported in a survey by Gartner, 49% of organizations are already investing in these technologies or hope to do it over the next year. And according to IDC forecasts, big data will shape a market that will reach 16,900 million dollars in 2015.

Big Data Spain

In Daedalus, big data is one of our key technologies in multiple solutions to clients (usually combined with semantic processing) and therefore we proposed to the conference a paper which fortunately has been selected.

Our colleague César de Pablo (@zdepablo ) will be presenting, with the title “Real-time social semantic search engine for TV streams“, how we solve the problems of search and real-time analysis in social TV applications.

While TV viewers turn to social media in search of shared experiences while viewing programs, TV channels and brands do the same to get real-time insights about their programs and audience. This requires real-time processing of a large amount of social content, something that cannot be solved with traditional storage, analysis and retrieval technologies. More information about the presentation here.

If you plan to attend the conference it will be an excellent opportunity to talk in person. If not, we invite you to stay tuned to this channel, where we will soon publish our impressions of the event.

¡Nos vemos en Big Data Spain!

Los próximos 7 y 8 de noviembre se celebra en Madrid la segunda edición de Big Data Spain 2013, uno de los eventos de referencia en Europa sobre las tecnologías y las aplicaciones de negocio del big data.

Sobre la importancia del fenómeno big data dan idea algunas estimaciones de los analistas. Según un sondeo de Gartner, el 49% de las organizaciones ya está invirtiendo en estas tecnologías o espera hacerlo durante el próximo año. Y de acuerdo a las previsiones de IDC, el big data conformará un mercado que alcanzará los 16.900 millones de dólares en 2015.

Big Data Spain

En Daedalus, el big data es una de nuestras tecnologías clave en múltiples soluciones a clientes (generalmente combinada con el procesamiento semántico) y por eso propusimos a la conferencia una ponencia que afortunadamente ha sido seleccionada.

Nuestro compañero César de Pablo (@zdepablo) estará presentando, con el título “Real time semantic search engine for social tv streams”, cómo resolvemos los problemas  de búsqueda y análisis en tiempo real en aplicaciones de TV social.

A la vez que los espectadores de TV acuden a los medios sociales en busca de experiencias compartidas mientras ven los programas, los canales de TV y las marcas comerciales hacen lo propio para obtener insights en tiempo real sobre sus programas y la audiencia. Todo ello exige el proceso en tiempo real de grandes cantidad de contenido social, algo que no puede resolverse con las tecnologías de almacenamiento, análisis y búsqueda tradicionales. Más información sobre la ponencia aquí.

Si pensáis asistir a la conferencia será una excelente ocasión para hablar en persona. Y si no, os invitamos a que permanezcáis atentos a este canal donde pronto publicaremos nuestras impresiones sobre el evento.

Sentiment Analysis in Spanish: TASS corpus released

The corpus used in TASS, the Workshop on Sentiment Analysis in Spanish organized by Daedalus, has been made freely available to the research community after the workshop. With the creation and release of this corpus, we aim to provide a common benchmark dataset that enables researchers to compare their algorithms and systems. Results from participants in TASS 2012 and TASS 2013 are already available to compare.

The corpus is divided into General and Politics corpus. Both are written in XML following the same schema.

General corpus

The General corpus contains 68 017 Twitter messages, written in Spanish by 154 well-known celebrities of the world of politics, communication and culture, between November 2011 and March 2012. Although the context of extraction has a Spain-focused bias, the diverse nationality of the authors, including people from Spain, Mexico, Colombia, Puerto Rico, USA and many other countries, makes the corpus reach a global coverage in the Spanish-speaking world.

Each message has been tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE). In addition, the sentiment agreement level within the content has been classified into two possible values: AGREEMENT and DISAGREEMENT. This allows to make out whether a neutral sentiment comes from neutral keywords or else the text contains positive and negative sentiments at the same time.

Moreover, the values of the polarity at entity level, i.e., the polarity values related to the entities that are mentioned in the text, has been also included. These values are similarly divided into 5 levels and include the level of agreement as related to each entity.

On the other hand, a selection of a set of 10 topics has been made based on the thematic areas covered by the corpus, such as “politics”, “soccer”, “literature” or “entertainment”. Each message has been assigned to one or several of these topics.


The General corpus has been divided into two sets: training (7 219 tweets) and test (60 798 tweets) sets. The training set has been manually tagged. The tagging in the test set has been generated by pooling all submissions from participants in the TASS tasks with a votation schema followed by an extensive human review of the ambiguous decisions, which unfortunately is subject to errors. In the case of the polarity at entity level, the tagging has just been done for the training set, due to the high volume of data to check and the lack of participants in the task.

In addition, the political tendency of users has been manually identified and assigned to one of the four possible values: LEFT, RIGHT, CENTRE and UNDEFINED. The aim of Task 4 in TASS 2013 was in fact to estimate his/her political tendency based on the user’s tweets.


Politics corpus

The Politics corpus contains 2 500 tweets, gathered during the electoral campaign of the 2011 general elections in Spain (Elecciones a Cortes Generales de 2011), from Twitter messages mentioning any of the four main national-level political parties: Partido Popular (PP), Partido Socialista Obrero Español (PSOE), Izquierda Unida (IU) and Unión, Progreso y Democracia (UPyD).

Similarly to the General corpus, the global polarity and the polarity at entity level for those four entities has been manually tagged for all messages. However, in this case, only 3 levels are used in this case: positive (P), neutral (NEU), negative (N), and one additional no sentiment tag (NONE). Moreover, to simplify the identification of the named entities, a “source” attribute is assigned to each tagged entity, indicating the political party to which the entity refers.


All the information is available in the TASS 2013 Corpus page. If you are interested, please send an email to tass AT with your email, affiliation and a brief description of your research objectives, and you will be given a password to download the files in the password protected area.

Join us at TASS-2013 – Workshop on Sentiment Analysis in Spanish – Sept. 20th, 2013

TASS is an experimental evaluation workshop for sentiment analysis and online reputation analysis focused on Spanish language, organized by Daedalus, Universidad Politécnica de Madrid and Universidad de Jaén, as a satellite event of the annual SEPLN Conference. After a successful first edition in 2012, TASS 2013 [] is going to be held on Friday September 20th, 2013 at Universidad Complutense de Madrid, Madrid, Spain. Attendance is free and you are all welcome to participate.


The long-term objective of TASS is to foster research in the field of reputation analysis, which is the process of tracking, investigating and reporting an entity’s actions and other entities’ opinions about those actions. The rise of social media such as blogs and social networks and the increasing amount of user-generated contents in the form of reviews, recommendations, ratings and any other form of opinion, has led to creation of an emerging trend towards online reputation analysis, i.e., the use of technologies to calculate the reputation value of a given entity based on the opinions that people show in social media about that entity. All of them are becoming promising topics in the field of marketing and customer relationship management.

As a first approach, reputation analysis has two technological aspects: sentiment analysis and text classification (or categorization). Sentiment analysis is the application of natural language processing and text analytics to identify and extract subjective information from texts. Automatic text classification is used to guess the topic of the text, among those of a predefined set of categories or classes, so as to be able to assign the reputation level of the company into different facets, axis or points of view of analysis.

The setup of the workshop is based on a series of challenge tasks based on two provided corpus, specifically focused on Spanish language, which are intended to promote the application of existing state-of-the-art and new proposals of algorithms and techniques in these fields and provide a benchmark forum for comparing the latest approaches. In addition, with the creation and release of the fully tagged corpus, we aim to provide a benchmark dataset that enables researchers to compare their algorithms and systems.

Two corpus were provided:

  • The General corpus contains over 68 000 Twitter messages, written in Spanish by about 150 well-known personalities and celebrities of the world of politics, economy, communication, mass media and culture, between November 2011 and March 2012.

Sin título 1

  • The Politics corpus contains 2 500 tweets, gathered during the electoral campaign of the 2011 general elections in Spain (Elecciones a Cortes Generales de 2011), from Twitter messages mentioning any of the four main national-level political parties: Partido Popular (PP), Partido Socialista Obrero Español (PSOE), Izquierda Unida (IU) y Unión, Progreso y Democracia (UPyD).

Sin título

All messages are tagged with its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment, or no sentiment at all. 5 levels have been defined: strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and one additional no sentiment tag (NONE). In addition, there is also an indication of the level of agreement or disagreement of the  expressed sentiment within the content, with two possible values: AGREEMENT and DISAGREEMENT. Moreover, a selection of a set of topics has been made based on the thematic areas covered by the corpus, such as politics, soccer, literature or entertainment, and each message has been assigned to one or several of these topics. More information on these corpus will be included in future posts.

Four tasks were proposed for the participants, covering different aspects of sentiment analysis and automatic text classification:

  • Task 1: Sentiment Analysis at Global Level. This task consists on performing an automatic sentiment analysis to determine the global polarity (using 5 levels) of each message in the test set of the General corpus.
  • Task 2: Topic Classification. The technological challenge of this task is to build a classifier to automatically identify the topic of each message in the test set of the General corpus.
  • Task 3: Sentiment Analysis at Entity Level. This task consists on performing an automatic sentiment analysis, similar to Task 1, but determining the polarity at entity level (using 3 polarity levels) of each message in the Politics corpus.
  • Task 4: Political Tendency Identification. This task moves one step forward towards reputation analysis and the objective is to estimate the political tendency of each user in the test set of the General corpus, in four possible values: LEFT, RIGHT, CENTRE and UNDEFINED. Participants could use whatever strategy they decide, but a first approach could be to aggregate the results of the previous tasks by author and topic.

31 groups registered (as compared to 15 groups in TASS 2012) and 14 groups (9 last year) sent their submissions. Participants were invited to submit a paper to the workshop in order to describe their experiments and discussing the results with the audience in the regular workshop session.


If you feel curious about the approaches adopted by the different groups and the results achieved in each Task, you are very welcome to attend the session on Friday September 20th, 2013 at Universidad Complutense de Madrid!

Or stay tuned for future posts that will provide valuable information and conclusions.

Are social media monitoring tools reliable?

There are many who emphasize the importance of monitoring what is being discussed in social networks (Rappaport) but it is not always clear how to do it (as tries to explain Seth Grimes in this article). Oddly enough, there are plenty of tools for that purpose; you can find dozens of lists with the best or the most popular ones, but what should we demand to these tools to ensure that the information we obtain is useful? This question is important, because those who question the reliability of the metrics about SoV in social networks are not less (at least if you look at the myriad of blog entries that try to convince the supporters). So, who do we listen to? Should we worry about what is being commented in social networks? Will those tools serve us for something? The answer to these last two questions is clearly yes, but we must bear in mind various aspects before elaborating a plan of social network monitoring and, above all, before selecting the tools with which we will work. Among those aspects we must consider:

Girl listening with her hand on an ear

What do we want to monitor?

Obviously, we are not interested in everything that appears in the web. Even if we wanted to, it would be impossible. But we can focus on some subjects: for example, our brand, our company, global warming, financial products, savings, etc. But how do we refer to these issues? That is, how do we instruct a tool about the issues we are interested in? Depending on the application we are working with, we will indicate the different ways in which the subject that interests us might appear in a text (for example, “climate change”, “greenhouse effect”, “ozone hole”) or the application itself will suggest us the approximate terms. It would be preferable to have mechanisms that permit to put some terms in relation with other ones. To do that, ontologies or other semantic resources are used, for example Wikipedia, Freebase, etc., or the user himself/herself can be the one who indicates how to relate those terms. At this point, the tool’s capabilities in terms of text analysis (text analytics) and, more specifically, Natural Language Processing, are fundamental. It should not matter if a term appears in its singular or plural form (“financial deposit” vs. “financial deposits”) or if there is ambiguity (“Santander” as financial institution vs. “Santander” as city). Something similar happens with the treatment of languages: how do you translate “greenhouse effect” into Spanish? Can it lead to an ambiguous term? Taking advantage of existing ontologies can be of great help when it comes to link equivalent concepts in different languages.

Where do we want to monitor?

The answer to this question may be simple —in social networks— but are we interested in all networks? Are we present in all of them? Is it possible that they speak about us in a network although we do not have presence in there? We should consider also if there are blogs that we should monitor (which do not form part of the so-called “social networks”). The answers to these questions depend, obviously, on the type of company or brand that we are considering. Is it possible that they talk on Facebook about a company that manufactures lamps when it has no presence in this social network?

From the point of view of text analysis, it is important to be able to work either with well-formed text, on which a complete syntactic analysis can be done, or with “incorrect” text, as it usually appears in messages that are published in social networks. This is also an important factor when selecting a tool.

In which languages?

We have mentioned briefly the problem of languages, but it is a fundamental element. Does the application for social media analysis that has caught our attention cover the languages that matter to us? Many of them apparently do, but it is necessary to verify how thoroughly they do it. It is advisable to take into account aspects such as: how many entities that appear in a text do they recognize, if they classify them in the appropriate type (person, place, organization), if besides entities they recognize other structures as URLs, hashtags, etc., if, in the case of sentiment analysis applications, they process negations or can assign polarity to entities or attributes.

What we earn by monitoring? Can we measure the results of our monitoring activity?

The doubt whether the investment implied by monitoring what happens in social networks will be recouped or not is recurrent. Is it enough to know how many people retweet our posts or indicate that they like our Facebook page? It is clear that it is not, but, if we could ask the question “how many people speak well of my product and how many criticize it?” and get a reliable answer, the chances are high that we would not be hesitating that much. Or even if we could interact with the customer who has made harsh criticism. How much a satisfied customer is worth and how much does it cost?

Today, there is a technology that can answer to these questions, even for English. What do you think? Do you trust in the technology for social media monitoring? If you use a specific tool, are you satisfied with it?

If you would like to have more information on how Daedalus can help you improve your way of monitoring social media, please contact us.

A %d blogueros les gusta esto: