Learning representations in structured domains using neural networks and applications to social network analysis
Sujet proposé par
Directeur de thèse:
Unité de recherche UMR 7606 Laboratoire d'informatique de Paris 6
Domaine: Sciences et technologies de l'information et de la communication
Keywords : Machine Learning, Relational Learning, Social Networks , Representation Learning, Latent Models, Deep Neural Networks, Matrix Factorization, Big Data
The development of Information technology has led to the production of huge volumes of data. A recent report by IBM [IBM 2013] mentions that every day, quintillions of data are produced, and that the majority of these data has been produced in the last few years only. Managing such large volumes has become a major issue for modern Information Technology. This domain, known as Big Data Analysis, raises several important challenges including the capture, the storage, the search, and the analysis of these data. This thesis proposal focuses on this last aspect: the development of automatic tools for the analysis of large volumes of complex data based on Machine Learning algorithms and their application to the domain of social network analysis. Machine Learning has become a key technology for the analysis of data and is now a major component of the data processing platforms developed by major IT companies. The field has considerably developed over the last ten years together with the growth of data production. Data come now from everywhere: information-sensing mobile devices, wireless networks, cell phone GPS signals, internet, user logs, purchase transaction records, videos, social media sites, biology and many other sources. They come in many different forms and variety as could be inferred from the previous examples. They are often complex, exhibiting different relational structures among the data elements. On social network sites for example, the data may correspond to different sources like text, images, videos; the users or items of these networks may share different relations. To summarize, data analysis and in particular Machine Learning faces important challenges related to the volume, diversity and complexity of the data sources to be processed.
Practical applications in the field of data analysis require intensive pre-processing steps in order to provide a data representation amenable to machine learning. These pre-processing steps usually involve human expertise and knowledge about the task and the data to be handled. A major challenge is then to alleviate this burden by developing automatic techniques for learning good representations. This has become a field of study by itself in the machine learning community, and in 2013, a new international conference has been launched, and is entirely dedicated to this topic [ICLR 2013].
The thesis proposal concerns representation learning for complex relational data, typical of those that can be found on social sites. For many current applications, data are more and more complex. They are often multimodal (image, music, video, comments, etc.) , heterogeneous (different types of entities are present) and dynamic (they evolve with time). Besides, they are often highly relational (e.g. in biology or in social networks) with different types of relations linking the different entities. Classical data analysis techniques, and therefore machine learning algorithms, are not adapted to the processing of these complex data. One possible choice to circumvent this problem is then to learn data representations which can then be handled by classical algorithms. One way towards this direction is to map these complex data onto one or several latent representation spaces where machine learning techniques may then operate. If one is able to learn such representations, it will then be possible to define metrics or similarity functions onto these latent spaces, it will be possible to represent multi-modal and heterogeneous data onto common latent spaces and to consider relational information as constraints on these latent representations. Different families of techniques have been developed for learning such representations. Probabilistic techniques like PLSA or LDA [Hoffman 1999, Blei 2003] allow us learning latent distributions, matrix factorization methods like Non Negative Matrix Factorization [Lee 99] have been successfully used in different domains like recommendation or link prediction. More recently, deep neural networks [Socher 2011, Le 2012, Bordes 2001] have emerged as a powerful family of methods for learning complex transformation of the data.