État académique
Thèse soutenue le 2016-03-14
Sujet: Uncertainty over Intensional Data
Direction de thèse:
Ellipse bleue: doctorant, ellipse jaune: docteur, rectangle vert: permanent, rectangle jaune: HDR. Trait vert: encadrant de thèse, trait bleu: directeur de thèse, pointillé: jury d'évaluation à mi-parcours ou jury de thèse.
Productions scientifiques
Cross-Fertilizing Deep Web Analysis and Ontology Enrichment
International audience
VLDS (Very Large Data Search) VLDS (Very Large Data Search) https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-00737941 VLDS (Very Large Data Search), Aug 2012, Istanbul, Turkey. pp.21-24, 2012ARRAY(0x7f4f38df5b00) 2012-08
Finite Open-World Query Answering with Number Restrictions
International audience

Open-world query answering is the problem of deciding, given a set of facts, conjunction of constraints, and query, whether the facts and constraints imply the query. This amounts to reasoning over all instances that include the facts and satisfy the constraints. We study finite open-world query answering (FQA), which assumes that the underlying world is finite and thus only considers the finite completions of the instance. The major known decidable cases of FQA derive from the following: the guarded fragment of first-order logic, which can express referential constraints (data in one place points to data in another) but cannot express number restrictions such as functional dependencies, and the guarded fragment with number restrictions but on a signature of arity only two. In this paper, we give the first decidability results for FQA that combine both referential constraints and number restrictions for arbitrary signatures: we show that, for unary inclusion dependencies and functional dependencies, the finiteness assumption of FQA can be lifted up to taking the finite implication closure of the dependencies[5]. Our result relies on new techniques to construct finite universal models of such constraints, for any bound on the maximal query size.

LICS 2015 LICS 2015 https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190580 LICS 2015, Jul 2015, Kyoto, Japan. pp.305-316, 2015ARRAY(0x7f4f388a84b0) 2015-07
On the Complexity of Mining Itemsets from the Crowd Using Taxonomies
International audience

We study the problem of frequent itemset mining in domains where data is not recorded in a conventional database but only exists in human knowledge. We provide examples of such scenarios, and present a crowdsourcing model for them. The model uses the crowd as an oracle to find out whether an itemset is frequent or not, and relies on a known taxonomy of the item domain to guide the search for frequent itemsets. In the spirit of data mining with oracles, we analyze the com- plexity of this problem in terms of (i) crowd complexity, that measures the number of crowd questions required to iden- tify the frequent itemsets; and (ii) computational complexity, that measures the computational effort required to choose the questions. We provide lower and upper complexity bounds in terms of the size and structure of the input taxonomy, as well as the size of a concise description of the output item- sets. We also provide constructive algorithms that achieve the upper bounds, and consider more efficient variants for practical situations.

ICDT (International Conference on Database Theory) ICDT (International Conference on Database Theory) https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-00986184 ICDT (International Conference on Database Theory), Mar 2014, Athens, Greece. pp.15-25, 2014ARRAY(0x7f4f389efa30) 2014-03
Get a Sample for a Discount: Sampling-Based XML Data Pricing
International audience
DEXA (Database and Expert Systems Applications) DEXA (Database and Expert Systems Applications) https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01069820 DEXA (Database and Expert Systems Applications), Sep 2014, Munich, Germany. pp.20-34, 2014ARRAY(0x7f4f38a94d18) 2014-09
Provenance Circuits for Trees and Treelike Instances
International audience

Query evaluation in monadic second-order logic (MSO) is tractable on trees and treelike instances, even though it is hard for arbitrary instances. This tractability result has been extended to several tasks related to query evaluation, such as counting query results or performing query evaluation on probabilistic trees. These are two examples of the more general problem of computing augmented query output, that is referred to as provenance. This article presents a provenance framework for trees and treelike instances, by describing a linear-time construction of a circuit provenance representation for MSO queries. We show how this provenance can be connected to the usual definitions of semiring provenance on relational instances, even though we compute it in an unusual way, using tree automata; we do so via intrinsic definitions of provenance for general semirings, independent of the operational details of query evaluation. We show applications of this provenance to capture existing counting and probabilistic results on trees and treelike instances, and give novel consequences for probability evaluation.

ICALP 2015 ICALP 2015 https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01178399 ICALP 2015, Jun 2015, Kyoto, Japan. pp.56-68, 2015ARRAY(0x7f4f38a948e0) 2015-06
On the Connections between Relational and XML Probabilistic Data Models
International audience
BNCOD (British National Conference on Databases) BNCOD (British National Conference on Databases) https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-00874445 BNCOD (British National Conference on Databases), Jul 2013, Oxford, United Kingdom. pp.121-134, 2013ARRAY(0x7f4f38dedc58) 2013-07
Combining Existential Rules and Description Logics
International audience

Query answering under existential rules — implications with existential quantifiers in the head — is known to be decidable when imposing restrictions on the rule bodies such as frontier-guardedness [Baget et al., 2010; Baget et al., 2011a]. Query answering is also decidable for description logics [Baader, 2003], which further allow disjunction and functionality constraints (assert that certain relations are functions); however, they are focused on ER-type schemas, where relations have arity two. This work investigates how to get the best of both worlds: having decidable existential rules on arbitrary arity relations, while allowing rich description logics, including functionality constraints, on arity-two relations. We first show negative results on combining such decidable languages. Second, we introduce an expressive set of existential rules (frontier-one rules with a certain restriction) which can be combined with powerful constraints on arity-two relations (e.g. GC2, ALCQIb) while retaining decidable query answering. Further, we provide conditions to add functionality constraints on the higher-arity relations.

IJCAI 2015 IJCAI 2015 https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190601 IJCAI 2015, Jul 2015, Buenos Aires, Argentina. pp.2691-2697, 2015ARRAY(0x7f4f38dee300) 2015-07
Recent Topics of Research around the YAGO Knowledge Base
International audience

A knowledge base (KB) is a formal collection of knowledge about the world. In this paper, we explain how the YAGO KB is constructed. We also summarize our contributions to different aspects of KB management in general. One of these aspects is rule mining, i.e., the identification of patterns such as spouse(x,y) ∧ livesIn(x,z) ⇒livesIn(y,z). Another aspect is the incompleteness of KBs. We propose to integrate data from Web Services into the KB in order to fill the gaps. Further, we show how the overlap between existing KBs can be used to align them, both in terms of instances and in terms of the schema. Finally, we show how KBs can be protected by watermarking.

Asia-Pacific Web Conference (APWEB) Asia-Pacific Web Conference (APWEB) https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190642 Asia-Pacific Web Conference (APWEB), Sep 2014, Changsha, China. pp.1-12, 2014ARRAY(0x7f4f38df0f78) 2014-09
Structurally Tractable Uncertain Data
International audience

Many data management applications must deal with data which is uncertain, incomplete, or noisy. However, on existing uncertain data representations, we cannot tractably perform the important query evaluation tasks of determining query possibility, certainty, or probability: these problems are hard on arbitrary uncertain input instances. We thus ask whether we could restrict the structure of uncertain data so as to guarantee the tractability of exact query evaluation. We present our tractability results for tree and tree-like uncertain data, and a vision for probabilistic rule reasoning. We also study uncertainty about order, proposing a suitable representation, and study uncertain data conditioned by additional observations.

SIGMOD/PODS Ph.D. Symposium SIGMOD/PODS Ph.D. Symposium https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190610 SIGMOD/PODS Ph.D. Symposium, Jun 2015, Melbourne, Australia. pp.39-44, 2015ARRAY(0x7f4f38df0a20) 2015-06
IBEX: Harvesting Entities from the Web Using Unique Identifiers
International audience

In this paper we study the prevalence of unique entity identifiers on the Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs (for documents), email addresses, and others. We show how these identifiers can be harvested systematically from Web pages, and how they can be associated with humanreadable names for the entities at large scale.

Starting with a simple extraction of identifiers and names from Web pages, we show how we can use the properties of unique identifiers to filter out noise and clean up the extraction result on the entire corpus. The end result is a database of millions of uniquely identified entities of different types, with an accuracy of 73--96% and a very high coverage compared to existing knowledge bases. We use this database to compute novel statistics on the presence of products, people, and other entities on the Web.

WebDB WebDB https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190629 WebDB, May 2015, Melbourne, Australia. 2015ARRAY(0x7f4f38df0d08) 2015-05
The Possibility Problem for Probabilistic XML
International audience

We consider the possibility problem of determining if a document is a possible world of a probabilistic document, in the setting of probabilistic XML. This basic question is a special case of query answering or tree automata evaluation, but it has specific practical uses, such as checking whether an user-provided probabilistic document outcome is possible or sufficiently likely. In this paper, we study the complexity of the possibility problem for probabilistic XML models of varying expressiveness. We show that the decision problem is often tractable in the absence of long-distance dependencies, but that its computation variant is intractable on unordered documents. We also introduce an explicit matches variant to generalize practical situations where node labels are unambiguous; this ensures tractability of the possibility problem, even under long-distance dependencies, provided event conjunctions are disallowed. Our results entirely classify the tractability boundary over all considered problem variants.

Alberto Mendelzon International Workshop on Foundations of Data Management Alberto Mendelzon International Workshop on Foundations of Data Management https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190712 Alberto Mendelzon International Workshop on Foundations of Data Management, Jun 2014, Carthagène des Indes, Colombia. 2014ARRAY(0x7f4f38df0a08) 2014-06
Uncertainty in Crowd Data Sourcing under Structural Constraints
International audience

Applications extracting data from crowdsourcing platforms must deal with the uncertainty of crowd answers in two different ways: first, by deriving estimates of the correct value from the answers; second, by choosing crowd questions whose answers are expected to minimize this uncertainty relative to the overall data collection goal. Such problems are already challenging when we assume that questions are unrelated and answers are independent, but they are even more complicated when we assume that the unknown values follow hard structural constraints (such as monotonicity).

In this vision paper, we examine how to formally address this issue with an approach inspired by [2]. We describe a generalized setting where we model constraints as linear inequalities, and use them to guide the choice of crowd questions and the processing of answers. We present the main challenges arising in this setting, and propose directions to solve them.

UnCrowd UnCrowd https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01190716 UnCrowd, Apr 2014, Bali, Indonesia. 8505, pp.351-359, 2014ARRAY(0x7f4f38dee030) 2014-04
Intensional Data on the Web
International audience

We call data intensional when it is not directly available, but must be accessed through a costlyinterface. Intensional data naturally arises in a number of Web data management scenarios, suchas Web crawling or ontology-based data access. Such scenarios require us to model an uncertainview of the world, for which, given a query, we must answer the question “What is the best thingto do next?” Once data has been retrieved, the knowledge of the world is revised, and the wholeprocess is repeated, until enough knowledge about the world has been obtained for the particularapplication considered. In this article, we give an overview of the steps underlying all intensionaldata management scenarios, and illustrate them on three concrete applications: focused crawling,online influence maximization in social networks, and mining crowdsourced data.

ACM Sigweb Newsletter https://hal-institut-mines-telecom.archives-ouvertes.fr/hal-01191721 ACM Sigweb Newsletter, 2015, pp.12ARRAY(0x7f4f38dee5e8) 2015-07
Thèse: "Tirer parti de la sttructure des données incertaines"
Soutenance: 2016-03-14