Journées QCM-BioChem

5 & 6 décembre 2018

Localisation: Salle B013 (mercredi 5) et Salle C005 (jeudi 6) au LORIA.

Motivation:The QCM-BioChem consortium is the result of the fusion of 3 Mastodons projects (QualiBioConsensus, HyQual and DECADE) and aims to address and tackle challenges arising from the large and wide variety of biological and chemical data as well as the different methodologies in knowledge discovery. The novelty of the approaches developed in QCM-BioChem relies on the consideration and explanation of quality criteria in terms of data, and methods for data mining and data analysis.

For further details: https://www.lri.fr/~cohen/QCM-BioChem.html

Programe du 5 décembre:

14h00-15h00: Antti Kuusisto (Tampere University of Technology)

Title: Computational Logics – A Theoretical Perspective
Abstract: We survey recent work on theoretical aspects of computational logics, also discussing the interface of the related work with applications in, e.g., ontology-based data access, database theory and distributed computing. We begin by giving an overview of recent work on satisfiability and model counting problems of fragments of first-order logic. This includes results on first-order prefix classes, two-variable logic and its variants, and n-ary description logics. We show, inter alia, that the model counting problem of all typical n-ary description logics is in PTime (following directly from work published in LICS 2018) and that FO-rewritability of ontology-mediated UCQ-queries with the ontology between ALC and SHI is complete for 2NexpTime (ICDT 2017). We also discuss a recent program on descriptive complexity of distributed computing and its relations to modal logic. Highlights in this program includes several characterizations of complexity classes of distributed computing and also logic-based separations of such classes.

15h15-15h45: Arnaud Soulet (LI, Université de Tours)

Title: Representativeness of knowledge bases with the generalized Benford’s Law
Abstract: Knowledge bases (KBs) such as DBpedia, Wikidata, and YAGO contain a huge number of entities and facts. Several recent works induce rules or calculate statistics on these KBs. Most of these methods are based on the assumption that the data is a representative sample of the studied universe. Unfortunately, KBs are biased because they are built from crowdsourcing and opportunistic ag-glomeration of available databases. This work aims at approximating the representativeness of a relation within a knowledge base. For this, we use the Generalized Benford’s law, which indicates the distribution expected by the facts of a relation. We then compute the minimum number of facts that have to be added in order to make the KB representative of the real world. Experiments show that our unsupervised method applies to a large number of relations. For numerical relations where ground truths exist, the estimated representativeness proves to be a reliable indicator.

15h45-16h15: Coffee break

16h15-16h45: Tatiana Makahlova (LORIA, Université de Lorraine)

Title: The Application of MDL in the Mining of Numerical Data
Abstract: The principle of Minimum Description Length (MDL) is now widely used in Data Mining and Knowledge Discovery. In Pattern Mining MDL can be applied to select small and characteristic subsets of patterns describing data in a compact way.However, in the existing approaches, MDL is applied to nominal or binary data, while real-world data are usually more complex and very often numerical for example. Applying MDL to numerical data remains tо be explored much more deeply. In our study, we propose an MDL-based approach to the selection of numerical patterns, where objects are described by numerical attributes. The approach is based on FCA and pattern structures.

17h00-17h45: Sylvie Hamel (Université de Montréal)

Title: Space reduction techniques for the median of permutations problem
Abstract: TBA

19h30: Meeting point at Place St Epvre

20h00: Restaurant La source

Programe du 6 décembre:

10h00-10h45: Henry Soldano (LIPN, Université Paris-Nord)

Title: Bi-pattern mining of attributed two-mode and directed networks
Abstract: To apply closed pattern mining to attributed two-mode networks requires two conditions. First, as there are two kinds of vertices, each described with a proper attribute set, we have to consider patterns made of two components that we call bi-patterns. The occurrences of such a bi-pattern forms then an extension made of a pair of vertex subsets. Second, Formal Concept Analysis and Closed Pattern Mining were recently applied to networks by reducing the extensions of pattern extensions to their cores, according to some core definition. To apply this methodology to two-mode networks, we need to consider two mode cores and define accordingly abstract closed bi-patterns. We give in this article a general framework to define closed bi-pattern mining. We also show that the same methodology applies to cores of directed and undirected networks in which each vertex subset is associated to a specific role. We illustrate the methodology both on a two-mode network of epistemological data, on a directed advice network of lawyers and on an undirected co-regulation network.

10h45-11h00: Coffee break

11h00-11h30: Lamine Diop (LI, Université de Tours)

Title: Sequential Pattern Sampling with Norm Constraints
Abstract: In recent years, the field of pattern mining has shifted to user-centered methods. In such a context, it is necessary to have a tight coupling between the system and the user where mining techniques provide results at any time or within a short response time of only few seconds. Pattern sampling is a non-exhaustive method for instantly discovering relevant patterns that ensures a good interactivity while providing strong statistical guarantees due to its random nature. Curiously, such an approach investigated for itemsets and subgraphs has not yet been applied to sequential patterns, which are useful for a wide range of mining tasks and application fields. We propose the first method for sequential pattern sampling. In addition to address sequential data, the originality of our approach is to introduce a constraint on the norm to control the length of the drawn patterns and to avoid the pitfall of the « long tail » where the rarest patterns flood the user. We propose a new constrained two-step random procedure, named CSSampling, that randomly draws sequential patterns according to frequency with an interval constraint on the norm. We demonstrate that this method performs an exact sampling. Moreover, despite the use of rejection sampling, the experimental study shows that CSSampling remains efficient and the constraint helps to draw general patterns of the « head ». We also illustrate how to benefit from these sampled patterns to instantly build an associative classifier dedicated to sequences. This classification approach rivals state of the art proposals showing the interest of constrained sequential pattern sampling.

11h45-12h15: Nyoman Juniarta (LORIA, Université de Lorraine)

Title: Application of Biclustering to the discovery of constant and gradual Patterns
Abstract: Biclustering plays a crucial role in many real world applications. Related to clustering, which groups similar rows in a matrix (data table), biclustering aims at simultaneously grouping similar rows and columns, i.e. to find submatrices which exhibit a correlation among their respective cells. There are many types of biclustering based on a similarity criterion. In this paper we are interested in constant-column CC biclustering, where the objective is to discover submatrices whose columns have a constant value across all the rows. Then, we study an extension of CC biclustering to the so-called coherent-sign-changes CSC biclustering. The main goal of CSC biclustering is to find submatrices whose rows jointly are in a given « agreement » w.r.t. the columns. Finally, we present the application of CSC biclustering to solve the problem of mining gradual patterns.

12h30-14h00: Lunch break

14h00-15h00: Jérôme Lang (LAMSADE, Université Paris-Dauphine)

Title: From social choice to preference learning
Abstract: Social choice theory is the field that studies the aggregation of individual preferences towards a collective choice. Among important collective choice problems studied, we find voting, fair division of resources, matching, coalition formation, and judgment aggregation. Computer science (and especially artificial intelligence and operations research) plays an increasing role in the field. More specifically, and more recently, machine learning (and especially, preference learning) is now playing an important role as well. Examples of meeting points of social choice and preference learning:
– How can we decide the outcome of an election from a small set of samples of the voters’ preferences?
– How can we learn the ‘preferential structure’ of a set of alternatives (such as single-peakedness or separability) from a set of votes?
– How can we design social choice mechanisms using machine learning principles?
– How can we interpret voting rules as maximum likelihood estimators?
– When can voting and aggregation rules be useful for preference learning?
The talk will present the basics of social choice, and then will focus on the connections between social choice and preference learning.

15h00-15h30: Coffee break

15h30-16h00: Guilherme Alves (LORIA, Université de Lorraine)

Title: A framework for online clustering based on evolving semi-supervision
Abstract: The huge amount of currently available data puts considerable constraints on the task of information retrieval. Clustering methods to organize data help with this task reducing access time.
Semi-supervised approaches make use of some additional information on data attributes to guide the clustering performed. However, this extra information may change over time, thus imposing a shift in the manner by which data is organized.
In order to cope with this issue, we proposed a framework called CABESS (Cluster Adaptation Based on Evolving Semi-Supervision) for online clustering. This framework is able to deal with evolving semi-supervision obtained through user binary feedbacks. To validate our approach, experiments were run over hierarchical labeled data that considers clustering splits over time. This empirical study shows the potential of the proposed framework for dealing with evolving semi-supervision. Moreover, it also shows competetive results w.r.t. traditional semi-supervised clustering algorithms.

16h15-16h45: Kevin Dalleau (LORIA, Université de Lorraine)

Title: Tackling (some) preprocessing issues using unsupervised extremely randomized trees
Abstract: In many cases, preprocessing can be a very challenging task. In this talk, after a brief introduction to some common preprocessing issues, we present a method enabling the computation of pairwise similarities while limiting the need for preprocessing. This method is based on extremely randomized trees and is particularly interesting on heterogeneous data, where attributes describing the instances are of different types (continuous, categorical, and ordinal). We will discuss some empirical results both on synthetic and real-world datasets and on homogeneous and heterogeneous data.

17h00-17h30: Discussions