2019 | |
Journal Articles | |
1. | Ruiz, Pablo; Poibeau, Thierry: Mapping the Bentham Corpus: Concept-based Navigation. In: Journal of data mining and digital humanities , 2019. (Type: Journal Article | Abstract | BibTeX | Links: ) @article{Fabo2019, title = {Mapping the Bentham Corpus: Concept-based Navigation}, author = {Pablo Ruiz and Thierry Poibeau }, url = {https://hal.archives-ouvertes.fr/hal-01915730v2}, year = {2019}, date = {2019-03-06}, journal = {Journal of data mining and digital humanities }, abstract = {British philosopher and reformer Jeremy Bentham (1748-1832) left over 60,000 folios of unpublished manuscripts. The Bentham Project, at University College London, is creating a TEI version of the manuscripts, via crowdsourced transcription verified by experts. We present here an interface to navigate these largely unedited manuscripts, and the language technologies the corpus was enriched with to facilitate navigation, i.e Entity Linking against the DBpedia knowledge base and keyphrase extraction. The challenges of tagging a historical domain-specific corpus with a contemporary knowledge base are discussed. The concepts extracted were used to create interactive co-occurrence networks, that serve as a map for the corpus and help navigate it, along with a search index. These corpus representations were integrated in a user interface. The interface was evaluated by domain experts with satisfactory results , e.g. they found the distributional semantics methods exploited here applicable in order to assist in retrieving related passages for scholarly editing of the corpus. }, keywords = {}, pubstate = {published}, tppubtype = {article} } British philosopher and reformer Jeremy Bentham (1748-1832) left over 60,000 folios of unpublished manuscripts. The Bentham Project, at University College London, is creating a TEI version of the manuscripts, via crowdsourced transcription verified by experts. We present here an interface to navigate these largely unedited manuscripts, and the language technologies the corpus was enriched with to facilitate navigation, i.e Entity Linking against the DBpedia knowledge base and keyphrase extraction. The challenges of tagging a historical domain-specific corpus with a contemporary knowledge base are discussed. The concepts extracted were used to create interactive co-occurrence networks, that serve as a map for the corpus and help navigate it, along with a search index. These corpus representations were integrated in a user interface. The interface was evaluated by domain experts with satisfactory results , e.g. they found the distributional semantics methods exploited here applicable in order to assist in retrieving related passages for scholarly editing of the corpus. |
2017 | |
PhD Theses | |
2. | Ruiz, Pablo: Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities. PSL Research University, 2017. (Type: PhD Thesis | Abstract | BibTeX | Links: ) @phdthesis{Ruiz2017, title = {Concept-based and relation-based corpus navigation : applications of natural language processing in digital humanities}, author = {Pablo Ruiz}, url = {https://tel.archives-ouvertes.fr/tel-01575167v2}, year = {2017}, date = {2017-06-23}, school = {PSL Research University}, abstract = {Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH. Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th–19th century corpus in political philosophy. Second, the Poli Informatics corpus, with heterogeneous materials about the American financial crisis of 2007–2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated. For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: The negotiation actors having discussed a given issue using verbs indicating supportor opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms. The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts’ domains. First, we payed attention to whether the corpus representations we created correspond to experts’ knowledge of thecorpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications’ strengths and weaknesses were pointed out, outlining possible improvements as future work.}, keywords = {}, pubstate = {published}, tppubtype = {phdthesis} } Social sciences and Humanities research is often based on large textual corpora, that it would be unfeasible to read in detail. Natural Language Processing (NLP) can identify important concepts and actors mentioned in a corpus, as well as the relations between them. Such information can provide an overview of the corpus useful for domain-experts, and help identify corpus areas relevant for a given research question. To automatically annotate corpora relevant for Digital Humanities (DH), the NLP technologies we applied are, first, Entity Linking, to identify corpus actors and concepts. Second, the relations between actors and concepts were determined based on an NLP pipeline which provides semantic role labeling and syntactic dependencies among other information. Part I outlines the state of the art, paying attention to how the technologies have been applied in DH. Generic NLP tools were used. As the efficacy of NLP methods depends on the corpus, some technological development was undertaken, described in Part II, in order to better adapt to the corpora in our case studies. Part II also shows an intrinsic evaluation of the technology developed, with satisfactory results. The technologies were applied to three very different corpora, as described in Part III. First, the manuscripts of Jeremy Bentham. This is a 18th–19th century corpus in political philosophy. Second, the Poli Informatics corpus, with heterogeneous materials about the American financial crisis of 2007–2008. Finally, the Earth Negotiations Bulletin (ENB), which covers international climate summits since 1995, where treaties like the Kyoto Protocol or the Paris Agreements get negotiated. For each corpus, navigation interfaces were developed. These user interfaces (UI) combine networks, full-text search and structured search based on NLP annotations. As an example, in the ENB corpus interface, which covers climate policy negotiations, searches can be performed based on relational information identified in the corpus: The negotiation actors having discussed a given issue using verbs indicating supportor opposition can be searched, as well as all statements where a given actor has expressed support or opposition. Relation information is employed, beyond simple co-occurrence between corpus terms. The UIs were evaluated qualitatively with domain-experts, to assess their potential usefulness for research in the experts’ domains. First, we payed attention to whether the corpus representations we created correspond to experts’ knowledge of thecorpus, as an indication of the sanity of the outputs we produced. Second, we tried to determine whether experts could gain new insight on the corpus by using the applications, e.g. if they found evidence unknown to them or new research ideas. Examples of insight gain were attested with the ENB interface; this constitutes a good validation of the work carried out in the thesis. Overall, the applications’ strengths and weaknesses were pointed out, outlining possible improvements as future work. |
2016 | |
Inproceedings | |
3. | Ruiz, Pablo; Plancq, Clément; Poibeau, Thierry: Climate Negotiation Analysis. In: Digital Humanities 2016, pp. 663-666, 2016. (Type: Inproceedings | Abstract | BibTeX | Links: ) @inproceedings{fabo2016climate, title = {Climate Negotiation Analysis}, author = {Pablo Ruiz and Clément Plancq and Thierry Poibeau}, url = {https://hal.archives-ouvertes.fr/hal-01423299}, year = {2016}, date = {2016-01-01}, booktitle = {Digital Humanities 2016}, pages = {663-666}, abstract = {Text analysis methods based on word co-occurrence have yielded useful results in humanities and social sciences research. Whereas these methods provide a useful overview of a corpus, they cannot determine the predicates relating co-occurring elements with each other. For instance, if France and the phrase "binding commitments" co-occur within a sentence, how are both elements related? Is France in favour of, or against binding commitments? Different natural language processing (NLP) technologies can identify related elements in text, and the predicates relating them. We are developing a workflow to analyze the Earth Negotiations Bulletin, which summarizes international climate negotiations. A sentence in this corpus can contain several verbal or nominal predicates indicating support and opposition. Results were uneven when applying Open Relation Extraction tools to this corpus. To address these challenges, we developed a workflow with a domain model, and analysis rules that exploit annotations for semantic roles and pronominal anaphora, provided by an NLP pipeline. }, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Text analysis methods based on word co-occurrence have yielded useful results in humanities and social sciences research. Whereas these methods provide a useful overview of a corpus, they cannot determine the predicates relating co-occurring elements with each other. For instance, if France and the phrase "binding commitments" co-occur within a sentence, how are both elements related? Is France in favour of, or against binding commitments? Different natural language processing (NLP) technologies can identify related elements in text, and the predicates relating them. We are developing a workflow to analyze the Earth Negotiations Bulletin, which summarizes international climate negotiations. A sentence in this corpus can contain several verbal or nominal predicates indicating support and opposition. Results were uneven when applying Open Relation Extraction tools to this corpus. To address these challenges, we developed a workflow with a domain model, and analysis rules that exploit annotations for semantic roles and pronominal anaphora, provided by an NLP pipeline. |
2015 | |
Inproceedings | |
4. | Poibeau, Thierry; Ruiz, Pablo: Generating navigable semantic maps from social sciences corpora. In: Digital Humanities 2015, 2015. (Type: Inproceedings | Abstract | BibTeX | Links: ) @inproceedings{poibeau2015generating, title = {Generating navigable semantic maps from social sciences corpora}, author = {Thierry Poibeau and Pablo Ruiz}, url = {https://hal.archives-ouvertes.fr/hal-01173963}, year = {2015}, date = {2015-01-01}, booktitle = {Digital Humanities 2015}, journal = {arXiv preprint arXiv:1507.02020}, abstract = {It is now commonplace to observe that we are facing a deluge of online information. Researchers have of course long acknowledged the potential value of this information since digital traces make it possible to directly observe, describe and analyze social facts, and above all the co-‐evolution of ideas and communities over time. However, most online information is expressed through text, which means it is not directly usable by machines, since computers require structured, organized and typed information in order to be able to manipulate it. Our goal is thus twofold: 1. Provide new natural language processing techniques aiming at automatically extracting relevant information from texts, especially in the context of social sciences, and connect these pieces of information so as to obtain relevant socio-‐ semantic networks ; 2. Provide new ways of exploring these socio-‐semantic networks , thanks to tools allowing one to dynamically navigate these networks , de-‐construct and re-‐ construct them interactively , from different points of view following the needs expressed by domain experts.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } It is now commonplace to observe that we are facing a deluge of online information. Researchers have of course long acknowledged the potential value of this information since digital traces make it possible to directly observe, describe and analyze social facts, and above all the co-‐evolution of ideas and communities over time. However, most online information is expressed through text, which means it is not directly usable by machines, since computers require structured, organized and typed information in order to be able to manipulate it. Our goal is thus twofold: 1. Provide new natural language processing techniques aiming at automatically extracting relevant information from texts, especially in the context of social sciences, and connect these pieces of information so as to obtain relevant socio-‐ semantic networks ; 2. Provide new ways of exploring these socio-‐semantic networks , thanks to tools allowing one to dynamically navigate these networks , de-‐construct and re-‐ construct them interactively , from different points of view following the needs expressed by domain experts. |
LIST OF SCIENTIFIC WORKS THAT HAVE USED CORTEXT MANAGER
(Sources: Google Scholar, HAL, Scopus, WOS and search engines)
We are grateful that you have found CorText Manager useful. Over the years, you have been more than 360 authors to trust CorText for your publicly accessible analyzes. This represents a little less than 10% of CorText Manager user’s community. So, thank you!
Below are listed the most active authors with CorText Manager for the past four years.
Top authors |
---|
Top authors |
Jiming Hu |
Aristotle T. Ubando |
Allison Loconto |
Alvin B. Culaba |
Wei-Hsin Chen |
Hongxiu Li |
Sophie Le Perchec |
Marla C. Maniquiz-Redillas |
Christophe Gauld |
Cecilia Rikap |
What types of documents? |
---|
What types of documents? |
76 journal articles |
31 conference proceedings |
12 Ph.D. thesis |
11 book chapters |
11 reports |
8 online articles |
6 masters thesis |
5 conference (not in proceedings) |
4 miscellaneous |
2 workshop |
1 book |