Archivo

Archive for the ‘Social media’ Category

Semantic Analysis and Big Data to understand Social TV

25 noviembre, 2013 1 comentario

We recently participated in the Big Data Spain conference with a talk entitled “Real time semantic search engine for social TV streams”. This talk describes our ongoing experiments on Social TV and combines our most recent developments on using semantic analysis on social networks and dealing with real-time streams of data.

Social TV, which exploded with the use of social networks while watching TV programs is a growing and exciting phenomenon. Twitter reported that more than a third of their firehose in the primetime is discussing TV (at least in the UK) while Facebook claimed 5 times more comments behind his private wall. Recently Facebook also started to offer hashtags and the Keywords Insight API for selected partners as a mean to offer aggregated statistics on Social TV conversations inside the wall.

As more users have turned into social networks to comment with friends and other viewers, broadcasters have looked into ways to be part of the conversation. They use official hashtags, let actors and anchors to tweet live and even start to offer companion apps with social share functionalities.

While the concept of socializing around TV is not new, the possibility to measure and distill the information around these interactions opens up brand new possibilities for users, broadcasters and brands alike.  Interest of users already fueled Social TV as it fulfills their need to start conversations with friends, other viewers and the aired program. Chatter around TV programs may help to recommend other programs or to serve contextually relevant information about actors, characters or whatever appears in TV.  Moreover, better ways to access and organize public conversations will drive new users into a TV program and engage current ones.

On the other hand, understanding the global conversation about a program is definitely useful to acquire insights for broadcasters and brands. Broadcasters and TV producers may measure their viewers preferences and reactions or their competence and acquire complementary information beyond plain audience numbers. Brands are also interested in finding the most appropriate programs to reach their target users as well as understand the impact and acceptance of their ads. Finally, new TV and ad formats are already being created based on interaction and participation, which again bolster engagement.

In our talk, we describe a system that combines natural language processing components from our Textalytics API and a scalable semi-structured database/search engine, SenseiDB, to provide semantic and faceted search, real-time analytics and support visualizations for this kind of applications.

Using Textalytics API we are able to include interesting features for Social TV like analyzing the sentiment around an entity (a program, actor or sportsperson). Besides, entity recognition and topic extraction allow us to produce trending topics for a program that correlate well with whatever happens on-screen. They work as an effective form to organize the conversation in real-time when combined with the online facets provided by SenseiDB. Other functionalities like language recognition and text classification help us to clean the noisy streams of comments.

SenseiDB is the second pillar of our system. A semi-structured distributed database that helps us to ingest streams and made them available for search in real-time with low query and indexing times. It includes a large number of facet types that enable us to use navigation using a range of semantic information. With the help of histogram and range facets it could even be overused for simple analytics tasks. It is well rounded with a simple and elegant query language, BQL, which help us to boost the development of visualizations on top.

If you find it interesting, check out our presentation for more detail or even the video of the event.

Trends in data analysis from Big Data Spain 2013

19 noviembre, 2013 Deja un comentario

logo Big Data Spain

The second edition of Big Data Spain took place in Madrid on last November 7 and 8 and proved to be a landmark event on technologies and applications of big data processing. The event attracted more than 400 participants, doubling last year’s number, and reflected the growing interest on these technologies in Spain and across Europe. Daedalus participated with a talk that illustrated the use of natural language processing and Big Data technologies to analyze in real time the buzz around Social TV.

Big Data technology has matured when we are about to cellebrate its 10th birthday, marked by the publication of the MapReduce computing abstraction that later gave rise to the field.

Rubén Casado, in one of the most useful talks to understand the vast amnount of Big Data and NoSQL project outlined the recent history of the technology in three eras:

  • Batch processing ( 2003 – ) with examples like  Hadoop or Cassandra.
  • Real time processing ( 2010 – ) represented by recent projects like StormKafka o Samza.
  • Hybrid processing ( 2013 – ) which attempts to combine both worlds in an unified programming model like Summingbird  or Lambdoop.

Withouth any doubt, the first era of solutions is enterprise-ready with several Hadoop based distributions like Cloudera, MapR or HortonWorks. Likewise the number of companies that are integrating them or providing consultancy in this field is expanding and reaching every sector from finance and banking to telecomunications or marketing.

Some other technological trends clearly emerged from talk topics and panels:

  • growing number of alternatives to deal online with large volume data analysis tasks (Spark, Impala, SploutSQL o SenseiDB)
  • SQL comeback, or at least as dialects on top of actual systems that made easier to develop and maintain applications
  • the importance of visualization as a tool to communicate Big Data results effectively.

However, adopting Big Data as a philosophy inside your company is not just merely technology. It requires a clear vision of the benefits that grounding all your processes in data may carry, and the value and knowledge that you may obtain by integrating internal and also external data. Another important factor is to be able to find the right people to bridge the chasm between the technical and businness sides. In this sense, the role of the data scientist is very important and Sean Owen from Cloudera defined it as “a person who is better at statistics than any software engineer and better at software engineering than any statistician”. We may add to the whish list a deep knowledge of your businness domain and the ability to ask the right questions.

While not everybody agreed, it seems that the best way to start “doing Big Data” is one step at a time and with a project with clear bussiness goals. If you want to test the technology, good candidates are those business process that have already become a bottleneck using standard databases. On the other hand, innovation may also be an important driver, by using external open data or if you need to design data-centric products. A good example of that sort is the Open Innovation challenge from Centro de Innovacion BBVA,  providing aggregate information on  credit card transactions.

Textalytics

Finally, going back to the theme of our talk, one of the external sources that would is generating more value are social network data. Due to their heterogeneity, social networks are intrinsically difficult to analyze, but, fortunately, text analytics tools like Textalytics API, enable you to make sense of unstructured data. If implemented into your Big Data toolset they open the door to the intellingent integration of quantitative and qualitative data with all the valuable insights you would obtain.

If you want to dive into the Big Data world, videos of the talks and experts panel are available at the Big Data Spain site.

The Citizen Sensor: the citizen as a sensor in the city of the future

sensorciudadano1One of our most promising lines of work in the Ciudad2020 R&D project (INNPRONTA Program, funded by CDTI, Technological and Industrial Development Center) focuses on the concept that we have defined as Citizen Sensor: the log of events in relation with citizens and their municipality.

By applying Textalytics’ semantic technologies, we can analyze in detail the citizen’s voice, extracting heterogeneous, high-level information. Being this highly descriptive and with high added value, it is useful to model the citizen’s urban behavior and his/her relationship with the city of the future. In this way the citizen becomes a sensor integrated in the network of sensors of the systems of the city.

The Citizen Sensor can provide data in different ways:

  • Mobile phone.- For example, to detect noise pollution, the user could start an application on his/her smartphone to record the noise level and send it to the city servers. This act will give us a map of the most significant sources of noise of the city, which evolves over time (works in the mornings, parties on weekends…).
  • Citizen’s events.- For example, the user validates the train ticket to go to work. This, added to the events generated by the rest of users who use the train, will give us an idea of the density of travelers who use the train to go to work each morning and which way they go through.
  • Social networks.- Our systems can analyze the flow of tweets in a geographic area to know what users are talking about, and if it is something relevant (a car crash that provokes traffic jams, a fire, a music festival…) we can use those data to develop a precise model with much more adjusted predictions. We can also collect the citizens’ thinking or opinion with respect to policies taken by the local authority (for example, the policy of reducing consumption on air-conditioning in public transport).

As a preliminary work, we have built an ontology that defines the different dimensions which are going to guide the semantic analysis. We are currently collecting information from Twitter, and in particular, our aim is to identify in each tweet the location where the user is located (a public building like the city hall or a hospital, parks, transportation facilities, places of leisure or work, etc.), the concept (city services, supplies, sign posts, etc.), or the specific event it refers to (concerts or sporting events, or problematic situations as breakdowns, traffic jams, accidents, fires), as well as the subject area of the message (politics, economy, quality of life, tourism, sport, social interest…). This analysis is complemented by a sentiment analysis able to detect the polarity of the message (very positive, positive, negative, very negative and neutral).

sensorciudadano2

The aim is to merge the semantic analysis with the user’s geopositioning in order to obtain interesting results on what citizens talk and opine about, in real time, as a city management console. This type of analysis could serve, for example, for early detection of risk situations such as accidents or supply breakdowns on public roads, fights in leisure areas, condition (cleaning, security, services) of public parks or beaches, etc.

For this analysis we use our APIs of language detection (which can process Spanish, English, French, Italian, Portuguese and Catalan), extraction of entities, automatic classification, sentiment analysis and demographic classification of users, all included in Textalytics Core.

sensorciudadano3

At the moment we are researching in temporal analysis, to try to detect the citizens’ tendencies of behavior and opinion throughout the time of analysis. This research consists of comparing the condition of the city at different moments of time to analyze and interpret the differences which will be due either to the daily life of the city (for example, the natural increase of public activity as the morning advances) or unexpected situations that might be predicted.

You can find more information, documentation, and demos on our web page: http://www.daedalus.es/ciudad2020/Sensor_Ciudadano. If you have any questions or comments, please do not hesitate to contact us.

[Translation by Luca de Filippis]

Offensive comments from readers in European online media have come to a full stop: Media will be responsible. What’s next?

28 octubre, 2013 1 comentario

The European Court of Human Rights issued on October 10th a very relevant sentence for European media companies. The case was brought by the Estonian news website Delfi, sued by the Justice of its country for having published offensive comments of readers against the director of a company which acted as a source of information. The publication of the news in question occurred on January 24th, 2006, and a few weeks later, on March 9th, the lawyers of the victim requested the withdrawal of 20 offensive comments and compensation for moral damages. The news website removed the comments on the same day and rejected the economic request. The following month, a civil lawsuit was filed before the Estonian courts. This lawsuit reached the national highest court, which upheld the guilty verdict and sentenced the media company to provide 320 euros in compensation to the plaintiff.

Delfi, the company that owns the news portal, resorted to Strasbourg (headquarters of the European Court of Human Rights), stating that the sentence violated the principle of freedom of expression, protected by article 10 of the Convention for the Protection of Human Rights and Fundamental Freedoms.

delfi

Now, this European court has ruled against the media company. And this despite the fact that Delfi had an automatic (rudimentary) system to filter out comments that included some keywords (insults or other problematic words). In addition, Delfi had a mechanism with which readers could mark a comment as inappropriate. The sentence considers that this filter was insufficient to prevent damage against the honor of third parties and that the media company should have taken more effective action to prevent these situations.

The court considers reasonable to hold responsible the editor, being its function to publish information and give visibility to the comments of readers, and profiting through the traffic generated by those comments.

What now? In an entry of this blog, entitled “Moderating participation in the media” [in Spanish] and published a couple of years ago, we summed up the difficulties and the keys of our approach to help solving a problem that is not trivial.

Difficulties are manifold. On the one hand, the detection of isolated offensive words is not enough and it is necessary to filter expressions, sometimes taking into account their context and inflected forms. On the other hand, it is also necessary to interpret the abbreviated language or texts with typographic errors, which are noticeably frequent in comments and user-generated content sections. These “errors” can arise from limitations of devices, the impulsive aspect of commenting, or the users’ intention to cheat the automatic filters trying to outsmart them by all means. (Sometimes in really witty ways).

In addition to this problem related to the Variety of texts, we find the other two recurring features in “big data” applications (forming the famous 3Vs): Volume of the comments to be processed and Velocity of response required.

At Daedalus, we have been addressing these problems for the media industry for years and lately also for other sectors, like banking and insurance.

As regards the integration architecture of our solutions, we are currently offering them in SaaS (Software as a Service) mode, from our new APIs platform in the cloud Textalytics, as well as the traditional licensing to run on-premises.

With automatic filtering systems, we cannot guarantee 100% accuracy for any filtering task. Different companies or media, and different sections within a same medium, require different strategies. It seems clear that it makes no sense applying the same filter criteria to the comments of a brilliant feature article and to the interventions that emerge during the live broadcast of a football match or a reality show. In this sense, our systems assess the gravity of the expression, allowing our customers to set flexibly their acceptability threshold. On the other hand, we provide customization tools to facilitate the incorporation of new problematic expressions. Finally, we also permanently monitor the operation of these systems for customers who wish it, within their plans of continuous quality assurance and improvement.

Are you interested? Feel free to contact Daedalus.

Discover our solutions for the media industry.

References to this topic:

Jose C. Gonzalez

A %d blogueros les gusta esto: