Synopsis of some recent work and thoughts. I welcome any feedback, critiques, etc.

WYSIWII (What You See Is What It Is):
Challenges in approximating relational data from texts and validating results
Jana Diesner, September 21th, 2008

It has become fast, cheap, and easy to gather and store large amounts of mainly unstructured text data, such as scientific and news articles, legal and governmental documents, web pages, emails, and interviews. These data may represent relational information, such as who says what to whom through what channel (part of the (Lasswell, 1948) formula). While such information is sometimes referred to as social network data, it is herein called relational data to avoid overpromising or mislabeling of data structures which might impact the selection of respective theory, methods and metrics for further work. Transforming sequential, natural language text data into non-linear, relational data and further analyzing the resulting networks has helped people to answer questions like:

  1. How do ideas or memes emerge, spread and vanish on the internet?
  2. Who are the key players in socio-technical systems, where are they located, what tasks are they assigned to, what resources or knowledge can they access? What benefits or risks may result from the observed network configuration for a certain system (Carley, Diesner, Reminga, & Tsvetovat, 2007)?

Scholars from a broad range of disciplines have been engaging in the extraction of relational data from texts (relation extraction) by deploying, advancing and developing a plethora of theories, methods, and metrics. In the simplest case, relation extraction involves the identification of nodes and edges. Extended approaches also facilitate the classification of nodes (multi-mode networks) and/ or edges (multirelational networks) according to a pre-defined, user-defined, or data-induced categorization schemata (ontology) (McCallum, 2005). I consider the resulting relational data as concise reductions and abstractions of the original material and argue that this type of transformation enables us to communicate those aspects of the data more clearly which reflect on the linkages of data points of interest (who says what to whom…). My research aims to span a boundary between computational linguistics and network analysis by interfacing probabilistic relational extraction with network analytical investigations of the distilled data. My goal with this work is to contribute to a better understanding of the co-evolution and interplay of the semantics and mechanics of real-world networks, or put more simply: how does what that people say and what they do relate? If we do not know in advance what the relevant instances of node and edge classes in the text data are, extracting them with non-deterministic techniques involves two levels of uncertainty, which lay in the methods and the data:
First and in all honesty, computational techniques for relation extraction only approximate network structure; meaning that what we reveal by using cutting-edge techniques and tools is not the truth, but an approximation to what is represented in the data. More specifically, one approach to this task is constructing or applying models μ that for each sequence of (x,y), where x are the tokens in a sequence and y is a corresponding feature of interest, predict a sequence y = μ(x) for any x, including new data (on the general approach see (Dietterich, 2002), for an example see (Diesner & Carley, 2008). Applying μ to text data may not return the correct solution, but the most likely one given the model, the method, and the data. What does that imply? First, the well-informed design and application of supervised learners is crucial for the controlled extraction of network structure. Second, the probabilistic computational solutions (tools, data or metrics) handed over to analysts and other end-users carry along decisions already made by the constructors of the transformation engines, and some of these decisions may impact what people observe in their network data. Do these impacts matter? In order to address this question people need to investigate the sensitivity of supervised machine learning techniques with respect to their impact on relational structures distilled from texts – something that we are currently working on in the CASOS lab.
Second, one frequently asked question is this one: How can relational data extracted from texts be verified or validated in the sense of contrasting it against ground truth (the true, underlying network)? That’s a very crucial question. The first part of my answer, so far, is WYSIWII (What You See Is What It Is) for the case of non-deterministic relation extraction. By using this term in this context I refer to relational data for which validation against ground truth is infeasible, and for networks that are nothing more than the gathered data traces. Using this definition, WYSIWII applies for example to:

  1. Covert networks (e.g. terrorist groups, money laundry systems, drug networks)
  2. Ephemeral networks (e.g. bankrupt corporations, the flow of people through a street)
  3. Networks that simply lack a true, underlying social network (e.g. blogs)

The second part of my answer to the validation question is that WYSIWIIs require the rigorous investigation of the behavior of our transformation engines as a precondition for all involved parties’ understanding of the degree to which a technique or tool is loaded with assumptions and decisions that impact outcomes. In summary, some of the challenges that I see with the process of harvesting texts for instances of certain node and edge classes are:

  1. Identifying the set and interdependencies of variables that impact the resulting network structure.
  2. Determining the strength and boundaries of this impact.
  3. Communicating non-deterministic behavior of a relation extraction mechanism to the end-user in an understandable fashion.     

Carley, K. M., Diesner, J., Reminga, J., & Tsvetovat, M. (2007). Toward an interoperable dynamic network analysis toolkit. Decision Support Systems. Special Issue Cyberinfrastructure for Homeland Security, 43(4), 1324-1347.
Diesner, J., & Carley, K. M. (2008). Conditional Random Fields for Entity Extraction and Ontological Text Coding. Journal of Computational and Mathematical Organization Theory, 14, 248 - 262.
Dietterich, T. G. (2002). Machine Learning for Sequential Data: A Review. Paper presented at the Joint IAPR International Workshops SSPR 2002 and SPR 2002, August 6-9, 2002, Windsor, Ontario, Canada.
Lasswell, H. (1948). The structure and function of communication in society. In L. Bryson (Ed.), (pp. 37-51). New York: Institute for Religious and Social Studies.
McCallum, A. (2005). Information extraction: distilling structured data from unstructured text. ACM Queue, 3(9), 48-57.

very basic updates...

I am co-teaching an interdisciplinary, undergrad course on Social Networks @ CMU in Spring '10, poster

poster_image