- Pavel Serdyukov (Yandex, Russia)
- Peter Mika (Yahoo! Labs, Spain)
- Roberto Cornacchia (Spinque, the Netherlands)
- Alessandro Benedetti and Antonio Perez (Zaizi, United Kingdom)
- Jakub Zavrel (TextKernel, The Netherlands)
- Joaquin Delgado (Intel Media, United States)
- Khludnev Mikhail (GridDynamics, United States)
- Erik van Oosten (eBay, The Netherlands)
Abstract Yandex is one of the largest internet companies in Europe, operating Russia’s most popular search engine, generating 62% of all search traffic in Russia, what means processing about 220 million queries from about 22 million users daily. Clearly, the amount and the variety of user behavioral data which we can monitor at search engines is rapidly increasing. Still, we do not always recognize its potential to help us solve the most challenging search problems and do not immediately know the ways to deal with it most effectively both for search quality evaluation and for its improvement. My talk will focus on various practical challenges arising from the need to “grok” search engine users and do something useful with the data they most generously, though almost unconsciously share with us. I will also present some answers to that by overviewing our latest research on user model based retrieval quality evaluation, implicit feedback mining and personalization.
I will also summarize the experience we gained from organizing three data mining challenges at the series of workshops on using search click data (WSCD) organized in the scope of WSDM 2012 – 2014 conferences. These challenges provided a unique opportunity to consolidate and scrutinize the work from search engines’ industrial labs on analyzing behavioral data. Each year we publicly shared a fully anonymized dataset extracted from Yandex query logs and asked participants to predict editorial relevance labels of documents using search logs (in 2011), detect search engine switchings in search sessions (in 2012) and personalize web search using the long-term (user history based) and short-term (session-based) user context (in 2013).
Bio Pavel Serdyukov is the Head of Research Projects at Yandex, where he manages a team of researchers working in the field of web search and data mining. He has published extensively in top-tier conferences on the topics related to web search, personalization, enterprise/entity search, query log analysis, location-specific retrieval and recommendation. He co-organized a number of workshops at SIGIR, was a co-organizer of the Entity track at TREC 2009-2011, and co-organized a series of workshops at WSDM in 2012 – 2014. Recently, he was also the General Chair of ECIR 2013 in Moscow. Before joining Yandex in 2011, he was a postdoc at Delft University, got his PhD from Twente University (2009) and his MSc from Max-Planck Institute for Computer Science (2005).
Abstract Starting in 2008 with the introduction of “rich” snippets that incorporate structured data from result pages, Yahoo has been a pioneer in extending search functionality from pure document retrieval to search over structured data or “knowledge”. In this talk, we review the history of semantic search at Yahoo and developments across the broader industry. We also look ahead to highlight the potential of semantic search and discuss the research challenges that have surfaced and remain unsolved.
Bio Peter Mika is a Senior Research Scientist at Yahoo!, based in Barcelona, Spain. Peter is working on the applications of semantic technology to Web search. He received his MSc and PhD in computer science (summa cum laude) from Vrije Universiteit Amsterdam. He is the author of the book ‘Social Networks and the Semantic Web’ (Springer, 2007). In 2008 he has been selected as one of “AI’s Ten to Watch” by the editorial board of the IEEE Intelligent Systems journal. Peter is a regular speaker at both academic and technology conferences and serves on the advisory board of a number of public and private initiatives. He represents Yahoo! in the leadership of the schema.org collaboration with Google, Bing and Yandex.
Abstract Bibliographic data have always represented an interesting case for Information Retrieval. Books have authors, title, editions, publishers, identification codes and so on; they can cite other publications and be held by a number of libraries. Digital humanities and the cultural heritage domain invest an increasing effort in the preservation, valorisation and exploitation of bibliographic data, with an emphasis on open data. This not only means that larger volumes of data are available, but also that such data sets are more and more linked together, with consequent challenges about their integration. So, even though “books” and their archival records have not changed for decades, the scale of the problem is changing rapidly.
Secondly, the spectrum of information needs to be satisfied is growing larger. The increase in available (open) data demands innovative services to be developed, whether they target researchers, librarians, or end users, and whether the context is an academic, cultural or commercial setting. The associated information retrieval challenge is no longer just about finding a book by its author’s last name. Full-text search combined with a few facets may address more complex needs, but does not help to exploit the linked nature of today’s open data to the maximum opportunity. The key problem is how to use effectively the full amount of linked data that are being made available online, increasing day by day; and turn this rich source of information into novel search scenarios: what are the most prestigious academic publishers, based on scientific citations, online consumer reviews and ratings? How can a search system tailor the quest for a book to the age of the expected
We discuss how Spinque addresses these challenges of rich interlinked book data, using its core Search by Strategy concept to separate concerns about modelling the various types of data and their interrelations, and customizing the ranking of information objects accordingly. Here, search processes are modelled on top of structured and unstructured data, with an integrated support for probabilistic reasoning in order to deal transparently with both exact and missing / vague information. We discuss this case of book records in the specific context of EU-funded project COMSODE (Components Supporting the Open Data Exploitation). The envisioned Open Data Node platform aims at effective reuse of integrated data sources, with a strong emphasis on data quality.
Bio Roberto Cornacchia received his PhD from TU Delft based on research work carried out at CWI, and then co-founded Spinque to further develop these insights on the integration of information retrieval and databases.
Alessandro Benedetti and Antonio Perez (Zaizi, United Kingdom) – Content Discovery Through Entity Driven Search
Abstract Leveraging enterprise information is no easy task, especially when unstructured information represents more than 80% of enterprise content. Meaningfully structuring content is critical for companies, Natural Language Processing and Semantic Enrichment is becoming increasingly important to improve the quality of tasks related to information retrieval.
With the Semantic Web moving towards full realisation thanks to the Linked Data initiative and with the interest of major search engines in structured data, the enterprise search world is finding it more attractive to make its information machine readable and exploit that information to improve search over its content.
In this scenario, three trends are transforming the face of search:
- Entity-oriented search. Searching not by keyword, but by entities that represent specific concepts in a certain domain.
- Knowledge graphs. Leveraging relationships amongst entities: Linked Data datasets (Freebase, DbPedia….) or custom companies’ knowledge bases.
- Search assistance. Autocomplete and spellchecking are now common features, but making use of semantic data makes it possible to offer smarter features, guiding the users to what they want, in a natural way.
Sometimes, the proper resources for building such features are not easy to obtain. In order to generate these, our approach includes a number of unstructured data processing mechanisms the goal of which is to automatically extract semantic information:
Extract content from heterogeneous data sources
Extract domain information and enrich the content through different NLP processes like Named Entity Recognition, Coreference Resolution, Entity Linking and Disambiguation, and Topic Annotation
Create specialised indexes to store the semantic information extracted
Currently there are a number of well developed uses of semantic extracted information such as faceting and concept indexing, however further methods of exploiting semantic extracted information are presenting themselves in the industry:
The target of this feature is to automatically complete users’ phrase with entity names and properties, helping them to find the desired documents through exploration of the domain Knowledge Graph. As the user keys in the phrase, the system will propose a set of named entities and/or a set of entity types. As the user accepts a suggestion, the system will dynamically adapt following suggestions to the chosen context.
The accuracy delivered by entity driven search brings increased satisfaction among users. They will see documents that are about a specific semantic concept, with concrete properties, and not about a keyword that can be ambiguously interpreted.
Semantic More Like This
A feature to find documents similar to one that is input, based on the underlying knowledge in the documents, instead of tokens.
Implementing a semantic distance function, we can provide a grade of similarity between documents based on the concepts and entities inside those documents. These techniques take into account the actual relations between entities, not only within a document, but also in the whole Knowledge Base. So, even if two documents are not sharing any entity, they could share a common topic (they would have the same semantic background) because all the entities are close to each other in the Knowledge Base Graph (locality).
Bio Search expert and semantic technology passionate, working in the R&D division of Zaizi. His focus and favourite work is in R&D on information retrieval, information extraction, natural language processing and machine learning with a big emphasis on data structures, algorithms and probability theory. Alessandro got his Master degree in Computer Science with full grade in December 2009, after that he had a collaboration with Universita’ degli Studi di Roma3 regarding his master thesis : “Outdesit – A new approach to improve/support semantic search in the web”. After that three years working across Italy with Sourcesense as a Search and Open Source consultant and developer. Finally moved to UK in 2013 joining Zaizi in the last Septmember.
Jakub Zavrel (TextKernel, The Netherlands) – Can semantic search & match equal or enhance a recruiter’s common sense?
Abstract Textkernel is working on tools for recruiters and job seekers to connect supply and demand in the job market. This means matching unstructured documents (CVs and Jobs) in a domain specific semi-structured retrieval model. The recruitment domain is characterised by a high number of different fields and very domain-specific terminology. Some fields are textual (job title, skills), others coded (sector or education level), numerical (salary or years of experience), or geographical (location) and the ranking problem is combining and weighting these fields with to a balance which resembles the common sense judgement of a good recuiter. Also typical for the domain is a large keyword level gap between the language of the job seeker (CV) and the recruiter (job advertisement). This gap needs to be overcome by incorporating domain specific synonyms, taxonomies, geolocations, concept relations (X years in job Y) and more into the ranking function.
The first part of this talk presents the main research challenges that complicate ranking CVs and Jobs, how we can mine knowledge from semi-structured data that helps improve retrieval quality, and how learning to rank techniques can be used to improve the ranking. We will also show how candidate status from actual applications in recruitment systems (rejected, invited for interview, second interview, hired) can be used as a feedback signal.
The second part of the talk will discuss UI challenges that face advanced search capabilities in this domain. The multi-field search in recruitment lends itself well to natural language queries in a single query box that can be parsed into structured queries using domain knowledge.
Bio Jakub Zavrel is the founder and CEO of Textkernel, an Amsterdam-based R&D and engineering company focusing on semantic search and matching for the job market, and European market leader in this area. Prior to Textkernel, Jakub worked as a researcher at the Universities of Tilburg and Antwerp in the areas of Machine Learning and Natural Language Processing.
Joaquin Delgado (Intel Media, United States) – Scalable recommender systems and its similarity with advertising systems
Abstract In this presentation I will talk about the design of scalable recommender systems and its similarity with advertising systems. The problem of generating and delivering recommendations of content/products to appropriate audiences and ultimately to individual users at scale is largely similar to the matching problem in computational advertising, specially in the context of dealing with self and cross promotional content. In this analogy with online advertising a display opportunity triggers a recommendation. The actors are the publisher (website/medium/app owner) the advertiser (content owner or promoter), whereas the ads or creatives represent the items being recommended that compete for the display opportunity and may have different monetary value to the actors. To effectively control what is recommended to whom, targeting constraints need to be defined over an attribute space, typically grouped by type (Audience, Content, Context, etc.) where some associated values are not known until decisioning time. In addition to constraints, there are business objectives (e.g. delivery quota) defined by the actors. Both constraints and objectives can be encapsulated into and expressed as campaigns. Finally, there there is the concept of relevance, directly related to users’ response prediction that is computed using the same attribute space used as signals.
As in advertising, recommendation systems require a serving platform where decisioning happens in real-time (few milliseconds) typically selecting an optimal set of items to display to the user from hundreds, sometimes thousands or millions of items. User actions are then taken as feedback and used to learn models that dynamically adjust order to meet business objectives. Most of the targeting and real-time decisioning capability on scalable ad systems have been inspired by information retrieval (IR) techniques, which can be directly applied to recommender systems.
This is a radical departure from the traditional item-based and user-based collaborative filtering approach to recommender systems, which fails to factor-in context, such as time-of-day, geo-location or category of the surrounding content to generate more accurate recommendations. Traditional approaches also fail to recognize that recommendations don’t happen in a vacuum and as such may require the evaluation of business constraints and objectives. All this should be considered when designing and developing commercial recommender/advertising systems.
Bio Joaquin A. Delgado is currently Director of Advertising Technology at Intel Media – OnCue (an Intel Corp subsidiary recently acquired by Verizon Communications), working on disruptive technologies in the Internet T.V. space. Previous to that he held CTO positions at AdBrite, Lending Club and TripleHop Technologies (acquired by Oracle). He was also Director of Engineering and Sr. Architect Principal at Yahoo! His expertise lies on distributed systems, advertising technology, machine learning, recommender systems and search. He holds a Ph.D in computer science and artificial intelligence from Nagoya Institute of Technology, Japan.
Abstract Our team is developing eCommerce search platform for one of the major US retailers. Core engine of the platform is based on modified Lucene and Solr. During implementation, we faced many limitation of classical IR models such as Boolean Retrieval and Vector Space Model as applied to eCommerce use cases. We developed several solutions which addressed those limitation and want to share and discuss our ideas with the community.
Obviously, eCommerce site customer never uses “advanced” syntax to demarcate optional vs mandatory terms in her query. She neigher quote phrases nor provide long search phrases which are so important for successful Boolean Retrieval, hence we had to reject it. We can’t use any standard relevance formula as well as any Vector Space Model, because assumptions made by Vector Space Model doesn’t work very well in eCommerce.
Thus, we established our own retrieval criteria which recognizes concepts in user input. This retrieval model is biased towards precision rather than recall, which is adequate for eCommerce use cases. The interesting detail is that we don’t introduce separate index for concepts; we rather piggyback on the main index, which allows us to avoid separate requests and achieve high performance.
We have a sort of language model for query expansion, it’s really neat. We don’t use weighted query branches, which enumerates matching hypothesis like in IR textbooks, because it creates a long tail of weakly relevant matches, thus polluting faceted navigation experience, which is a critical feature for modern eCommerce. Instead, we enumerate possible matching hypotheses and search for each of them separately, relaxing precision requirements as we go, which allows us to balance between precision and recall dynamically for every user input, and still be able to perform well enough.
We also aggregate recognized matching patterns to get compact representation of the provided search result. This representation is provided for other systems which track or react on user intention (or our understanding of her intention).
The last complication is that in eCommerce we deal with scoped entities like a few SKUs in a product, hence our platform should correctly model such relations which are much complex than classic ‘flat’ text models.
Bio Mikhail Khudnev is principal Engineer in eCommerce Search at GridDynamics. Mikhail used to develop backends for enterprises on Java, focusing on architecture and scalability.He joins Apache Lucene community and start working on eCommerce Search platform for one of US retailers for last few years . He did a few contributions to Lucene and Solr and spoke at Lucene Revolution about essential search algorithms. His intention is to share his team findings in practical eCommerce search and hear about state of art from IR community.
Abstract Clustered markers displayed on a map often do not require the precision that is offered by existing well-known clustering methods like GVM (Greedy Variance Minimization). This talk present a much simpler algorithm to cluster map markers which trades accuracy for speed. The algorithm is currently in use by Marktplaats, a Dutch Ebay daughter.
Bio Erik van Oosten is a software developer with 15 years of developer, teaching and designing experience in the Java environment. He is currently working as member of technical staff for Marktplaats.nl, an Ebay daughter. In this role he leads efforts on search and big data applications. Among other things, he also maintains the scala API for a popular open source metrics library.
Industry track Co-Chairs
David Carmel (Yahoo! Labs)
Thijs Westerveld (WizeNoze)