robin_anne_reid (robin_anne_reid) wrote,

Internet Corpus UPdate

I haven't posted in a while--our house flooded in April, and the summer was made even busier by dealing with ALL the stuff (new flooring installed in July!). But I've not stopped working on my projects relating to fandom, imbroglios, and digital methodologies.

I have been working on a number of smaller projects the past five years that focus on the rhetorics (written and visual) of racism in debates that have been occurring in online media fandoms located primarily on the social networking sites of LiveJournal and Dreamwidth. This larger project involves collaborative work with colleagues at my university in computer science and linguistics and is the subject of a grant we submitted earlier this year to the National Endowment of Humanities' (NEH) Digital Humanities program. This grant was not funded, but we will submit a version reflecting the feedback we received next fall (2012), as well as submitting grants relating to the same project to the National Science Foundation (NSF).
This interdisciplinary digital project will focus on a collaborative effort between computer scientists and humanities scholars to apply new digital technologies to the search, retrieval, and categorization of written and graphic materials on the internet: an internet corpus. Our corpus in the early years will focus on online fandom communities. Since fans were early adopters of new technologies, their interactions can show how people employ new technologies to participate in political and social action relating to identity.

While my original projects focused on constructions of race and racism, the database that results from our work can be analyzed from any of a variety of critical methodologies, and can also provide the basis for later, intersectional work. The majority of scholarship on how people use the internet has tended to ignore the medium of the internet as a context which shapes communication and to focus primarily on it as a tool for accessing subjects to study.
Social science methodologies identify user demographics; focus groups are asked for their perceptions; the "digital divide" has been analyzed as have consumer buying patterns. In the majority of computer science, social science, and business scholarship, internet users are treated as objects of study in ways that reflect traditional disciplinary assumptions. The growth of Web 2.0 which has seen as explosion of user-generated content published not solely for marketing or educational purposes shows just how great the gaps in past scholarship are.
Humanities research concerned with topics such as the mapping of cultural constructions of individual and group identities exist but the work has rarely been informed by the inexhaustible supply of freely available online texts by internet users. Scholars are daunted by the sheer amount of information available on the internet and the ways it can be gathered. This practical problem is crucial: How can we aggregate the huge variety of texts created by groups of users on the internet? How can this very messy data be processed, i.e. cleaned and converted, so that standard corpus tools can be used to analyze the textual messages.

My part of the larger project is primarily that of fan studies expert who has done a number of pilot presentations drawing on sociolinguistic methodologies applied to fairly small (in internet terms) amounts of text. This work is necessary to generate terminology for tagging and parsing that can be integrated into the programs developed by the computer scientists. My colleagues (a psychologist and linguists) bring their own disciplinary knowledge to bear in a collaborative process that expands the possible research questions that can be considered. I'm also trying to organize a series of presentations/workshops on our campus concerning digital methodologies.

My scholarship has focused on analyzing discourses of minority and majority fans in online communities, specifically fandom communities. The first conflicts I became aware of took place during 2006 and 2007, occurring in specific media fandom communities (StarGate: Atlantis, Dr. Who, and Life On Mars); one Harry Potter community (Daily_Deviants), and an annual gift exchange focusing on rare fandoms, Yuletide. Debates over racial and class stereotypes in fan fiction, racial and class stereotypes in the canon texts of the fandom, racist terminology being used by fans that embodied histories and etymology not widely known outside the United States, and, finally, ignorance of a Jewish religious practices were hotly debated. Additional levels of conflict occurred because of the international demographic of online fandom, with debates over the history and contemporary racial attitudes in the United States compared to the United Kingdom (Dr. Who is a British produced show) as well as other countries (Canada, Australia), and disagreements on anti-racist strategies and practices, including the issue of what "tone" can or should be taken when noting the existence of racist language, imagery, or characterizations.

In all cases, a single event (a fic, a post, an announcement) initiated major debate, news of which rapidly moved outside the individual fandom communities because of cross-fandom communities dedicated to posting news and linking to posts across fandoms. While many white fans see these debates as something new and unusual in their fandoms, often insisting that the conflicts are quote "harshing their squee," there is widespread agreement by some fans of color that the most recent events were simply the latest in an on-going pattern of white privilege, including a range of racist behaviors that institutionalized marginalization and discrimination against fans of color, that the problems had always been there but were only becoming visible on the internet in ways that could not be seen in the offline, con cultures, or, alternately in the earlier period of online fan culture that was predominantly book oriented and existed on listservs and archives which tended to act as centralizing foci of online communication in a fandom.

LiveJournal (started in 1999) and the later iterations of it (InsaneJournal) and the most recent fork (Dreamwidth) changes fandom structure from centralized to a web of individual journals. While social networks allow communities do form around specific fan texts, or specific types of fan production, fans who were active in the listservs note distinct differences between having a few listservs for a fandom, listservs which were moderated by a fan or a group and the current system which makes it easy for someone to leave a community and start another, or multiple others. With less centralized authority, discussions move rapidly across a number of individual journals, branching off rapidly. While there are lists (often called "Linkspams" made by fandom newsletters or individual fans) for various debates, there is no guarantee that all posts relevant to the discussion were linked. While discussions about race and racism in fandom and in the book and media texts is not new, the connections between fans of color and anti-racist actions in LJ can be relatively easily viewed (compared to earlier print 'zines and listservs which may have been locked and are not easily accessed). In the last few years, the migration of fans to other social networking sites (Twitter, Facebook, Tumblr) have added additional electronic spaces to fandom networks that raise new challenges to track how discussions migrate across platforms.

In my work, I define "racist" and "racism" as the institutionalized and ideological patterns of behaviors that have been established for generations in the United States and that affect all people born within the culture. While online fandoms are international in nature, the predominance of US fans as well as my own situatedness in the US culture leads me to focus primarily on the constructions of race in mainstream American culture; the specific outgrowth of Racefail that I focus on in this paper is particularly relevant to highlighting elements of the white American culture which constructs some groups of "immigrants" in racist ways.

The immediate focus of the larger work is the debate known as Racefail 09 although once I decided to do a corpus, the project grew to include a wider range of discussions and might continue to grow in future to consider a range of internet communities as well as online media fandom. The methodology I'm using in the project is a stylistics corpus: A corpus is basically a searchable database of text which has been annotated, or marked in .xml language. That means that certain elements of the text have been tagged and can be quantified. Stylistics is the application of linguistic methods to literary texts, although for this project I'm extending the definition to any written text. Linguistics corpora are databases of various transcribed texts (for linguistic study, it’s often collections of spoken utterances around dialects, or a specific language variety, i.e. not copyright material). Many of these are huge (millions of words), but there are also more specialized, or narrow, corpora. Linguistic corpora are large collections that are often available online, or can be purchased for a fee.

The methodology analyzes patterns in writing rather than intent, resulting in a pattern analysis of aggregated data. My pilot projects involved laborious copying of publicly available materials from internet sites to .txt files and hand annotation of elements through the use of a free program (the UAM Corpus Tool). Marking each textual element by hand can take a hundred or more hours (depending on how many layers, i.e. specific types of textual element, I mark). The goal in our collaborative project is to develop digital tools that can replace some of the time-consuming and human error-prone preparation of text.

Mapping Racial Constructions and Identities on the Internet: Creating a Conceptual Search Engine and Multimodal Corpus for Humanities Research

Keyword-based search engines retrieve millions of internet pages but many are not relevant to the requested search. New conceptual search engines increase the percentage of relevant documents retrieved and allow humanities scholars to analyze texts in-depth. This Level I grant will fund an innovative collaboration between computer scientists and humanities scholars to apply new digital technologies to the search, retrieval, and categorization of texts. The grant will support an on-site workshop and conference and the staff needed to create a conceptual search engine alpha-level prototype used in conjunction with a multimodal corpus (searchable database of spoken and written discourse). Our corpus will focus on online fandom communities as venues enabling new forms of participatory democracy. Fan groups were early adopters of new technologies, and their interactions show how people employ new technologies to participate in political and social action relating to race and identity.


This grant was to fund a group project between me, two linguists in my department, and a computer science professor (his main project is conceptual search engines).
We were not funded (given that it was a fairly quickly assembled proposal with two of the people involved being FT administrators and that the majority of first time grant proposals are not funded, I was not surprised--we got some excellent feedback from the readers reports and will be revising, though possibly with a different group, and a different program. See below!)

Title of Project: Identities & Imbroglios: Conflicts in Internet Fandoms

This project focuses on the space where language, identity, and the internet intersect by melding the quantitative methods of Psychology with the linguistic and rhetorical methodologies of English to analyze discourses of minority and majority fans in online communities. Through a partnership with [name redacted] and students in his computer science laboratory we will develop a web spider to crawl the web and collect millions of written texts of interactions between members of online fan groups. These texts will then be analyzed by researchers (the PIs, graduate and undergraduate students) from both English and Psychology to explore online fan group behavior. This study is important to expand the scientific knowledge of group behavior in online computer-mediated environments.


We were given a grant of $15,000 which is being used to pay the computer science graduate students to design our web spider. There are a number of web spiders out there (I've heard about commercial ones; I imagine there are open source ones as well), but none of them are designed to do what we want to do. This project was a collaboration between me and a Psychology faculty member (he could not participate in the Digital Humanities one above, because NEH does not allow any social science methodologies!), and he has a graduate student in his department who is doing his work on how communities form on the internet, and what happens to them over time.
We met yesterday with the Library Director and some of the librarians (digital archives, archivist) and one of the Technology staff to discuss the possibility of setting up an Institutional Repository.

Wikipedia's entry on Institutional Repository looks fairly accurate (i.e. does not contradict anything our Library Director told us yesterday!):

The IR here will be much larger than our project, but our project sort of initiative the need to set this up (as with many things, as a small rural university, we're rather behind the curve, but while the Research 1 universities have funding and staff, there are smaller regional universities doing more than we have).

The important elements of an IR are: meeting data management and open access standards (since we plan to write federal grants, we need to have a data management plan and meet certain open access requirements, if we're funded). I'd want that open access anyway because I believe that universities should make research (and other materials) freely available.

The 'product' of our grant will be a database that will be in two parts: a publicly accessible part (accessed through our university library interface), and a part accessible only to researchers (at first, here; I hope, later, to scholars at other institutions).

This process will be a multi-year one--our pilot project with the library will take a year. During that time, we'll be submitting other grants (and while we'll continue to do this work, without funding, it will be a lot slower). We'll be presenting on various 'parts' of the project (I'm talking at the 2012 MLA in January, for example!).

The most exciting aspect is the interdisciplinary issues (putting a psychologist, an english teacher, and two computer science majors in a room leads to lots of confusion--we've been meeting every Friday since classes started, and yesterday, I think, we had a major breakthrough of understanding). The linguistics and I have closer ties, but they haven't been able to make all the meetings (other duties). The recent inclusion of our digital archivist/librarian and a systems engineer tech person seemed to push to a whole new level of excitement: that is, as far as they knew, the sort of project that we're doing is innovative in terms of the technology (i.e. the nature of our spider-bot) and in terms of database and collections issues. This aspect will grow as we build the repository, and are able to invite others to get involved, and see what results.

Some articles I found that are going to be useful, I think:

Sebastian Hoffman. "Processing Internet-derived Text--Creating a Corpus of Usenet Messages." Literary and Linguistic Computing. 22.2. 2007. 151-165.


Jonas Sjöbergh. "The Internet as a Normative Corpus: Grammar Checking with a Search Engine."

This entry was originally posted at Comments are enabled at both sites but anonymous comments will be screened for moderation.

  • Announcement

    As of January 1, 2012, I will be posting only on my Dreamwidth account. The Livejournal will be inactive. All entries here will continue to be…

  • Dr. Stella Ray Memorial Endowment

    This post is very different than the other posts in this journal, but I want to spread this news as widely as I can. Last July, one of the students…

  • LJ and Privacy

    Have disabled logging of IP addresses. Allowing comments from all registered users, screening only anonymous.

  • Post a new comment


    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.