Computer program detects author gender
Simple algorithm suggests words and syntax bear sex and genre stamp.
18 July 2003
PHILIP BALL


A.S Byatt confuses the computer; will it see through George Elliot?




A new computer program can tell whether a book was written by a man or a woman. The simple scan of key words and syntax is around 80% accurate on both fiction and non-fiction1,2.

The program's success seems to confirm the stereotypical perception of differences in male and female language use. Crudely put, men talk more about objects, and women more about relationships.

Female writers use more pronouns (I, you, she, their, myself), say the program's developers, Moshe Koppel of Bar-Ilan University in Ramat Gan, Israel, and colleagues. Males prefer words that identify or determine nouns (a, the, that) and words that quantify them (one, two, more).

So this article would already, through sentences such as this, have probably betrayed its author as male: there is a prevalence of plural pronouns (they, them), indicating the male tendency to categorize rather than personalize.

If I were female, the researchers imply, I'd be more likely to write sentences like this, which assume that you and I share common knowledge or engage us in a direct relationship. These differing styles have previously been called 'informational' and 'involved', respectively.

Koppel and colleagues trained their algorithm on a few test cases to identify the most prevalent fingerprints of gender and of fiction and non-fiction. They then set it searching for these fingerprints in 566 English-language works in a variety of genres, ranging from A Guide to Prague to A. S. Byatt's novel Possession - which, intriguingly, the programme misclassified by gender, along with Kazuo Ishiguro's The Remains of the Day.

Strikingly, the distinctions between male and female writers are much the same as those that, even more clearly, differentiate non-fiction and fiction. The programme can tell these two genres apart with 98% accuracy. This is perhaps unsurprising, given that non-fiction is more informational and fiction more involved.

Most of the works studied were published after 1975. The Israeli team now intends to probe whether the differences extend further back in time - and so whether George Eliot was wasting her time disguising herself with a male nom de plume - and also whether they occur in other languages.


References
Koppel, M., Argamon, S. & Shimoni, A. R. Automatically categorizing written texts by author gender. Literary and Linguistic Computing, in the press, (2003). |Homepage|
, Koppel, M., Fine, J. & Shimoni, A. R. Gender, genre, and writing style in formal written texts. Text, in the press, (2003).



http://www.nature.com/nsu/030714/030714-13.html