logo EDITE Sahar CHANGUEL
Identité
Sahar CHANGUEL
État académique
Thèse soutenue le 2011-05-03
Sujet: Métadonnées pour la personalisation de l'accès à la connaissance et à l'information
Direction de thèse:
Encadrement de thèse:
Laboratoire:
Voisinage
Ellipse bleue: doctorant, ellipse jaune: docteur, rectangle vert: permanent, rectangle jaune: HDR. Trait vert: encadrant de thèse, trait bleu: directeur de thèse, pointillé: jury d'évaluation à mi-parcours ou jury de thèse.
Productions scientifiques
oai:hal.archives-ouvertes.fr:hal-00577127
Automatic Web Pages Author Extraction
This paper addresses the problem of automatically extracting the author from heterogeneous HTML resources as a sub problem of automatic metadata extraction from (Web) documents. We take a supervised machine learning approach to address the problem using a C4.5 Decision Tree algorithm. The particularity of our approach is that it focuses on both, structure and contextual information. A semi-automatic approach was conducted for corpus expansion in order to help annotating the dataset with less human effort. This paper shows that our method can achieve good results (more than 80% in term of F1-measure) despite the heterogeneity of our corpus.
Automatic Web Pages Author Extraction Proc. of the 8th Int. Conf. on Flexible Query Answering Systems, FQASproceeding with peer review 2009
oai:hal.archives-ouvertes.fr:hal-00577126
A General Learning Method for Automatic Title Extraction from HTML Pages
This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format. In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based on these features, learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that combining both methods can induce better performance.
A General Learning Method for Automatic Title Extraction from HTML Pages Proc. of the 6th Int. Conf. on Machine Learning and Data Mining, MLDMproceeding with peer review 2009
oai:hal.archives-ouvertes.fr:hal-00577128
Automatic Concept Type Identification from learning Resources
The objective of any tutoring system is to provide a meaningful learning to the learner. Therefore an automated tutoring system should be able to know whether a concept mentioned in a document is a prerequisite for studying that document, or it can be learned from it. This paper addresses the problem of identifying defined concepts and prerequisite concepts from learning resources in html format. In this paper a supervised machine learning approach was taken to address the problem, based on linguistic features which enclose contextual information and stylistic features such as font size and font weight. This paper shows that contextual information in addition to format information can give better results when used with the SVM classifier than with the (LP)2 algorithm.
Automatic Concept Type Identification from learning Resources Proc. of the 2010 International Joint Conference on Neural Networks, IJCNNproceeding with peer review 2010
edite:1332792352170
Distinguishing defined concepts from prerequisite concepts in learning resources
IEEE Symposium on Computational Intelligence and Data Mining, SSCI 2011 Conference 2011
Soutenance
Thèse: Métadonnées pour la personalisation de l'accès à la connaissance et à l'information
Soutenance: 2011-05-03
Rapporteurs: Bruno CRÉMILLEUX    Florence SÈDES