Processing a book in python

Fayd

Diamond Member
Jun 28, 2001
7,970
2
76
www.manwhoring.com
I am trying to build a social network of characters from a book. The edges of the graph are established as coincidence (existing at the same time) in the text. I'm trying to split the book into chapter-lets (smaller than chapter. IE, break on chapter, then break every 5-6 paragraphs), which i'll then do regex search on each chapterlet to establish whether a character is there. (return 1 for yes, 0 for no)

I still have to work this out, but i'm thinking that assembling the character incidence in a chapter into a matrix = X, lower triangle of XX' = matrix of character coincidence = edge weights for my network. (like i said, i'm working it out). this part is fairly easy, just bool(re.search()). need to create custom rules for each character, which will be somewhat painful.

edit: just confirmed my suspicions, XX' will do what I wanted. now i just need to establish character coincidence.

the regex searches themselves will have some problems. some characters are referred to by substrings of other characters. Eg, Villefort is a different character than Madamme Villefort. Other times, characters have several different names they use/ are referred to in a variety of ways. the way i intend to get around this, is search for the longest string first and then delete it from the text. I'll have to check out some chapter on text cleaning to figure how to do this.

the first problem i have is the book i'm looking at is VERY long (700 pages, around 120 chapters), and some of these chapters are themselves very long. I want to split some of the chapters by length into smaller chapter-lets, but I am unsure how to do this.

the initial readin of the full text will read to a list of chapters, when i do

readin.split('Chapter'), but I don't know how how to then take and cut down each chapter based on length into segments. Anyone have a suggestion?


edit: Hmm... chapters are between 1500 and 9000 words long. It would be nice to separate them into groups of paragraphs. I can count paragraphs per by counting the number of double line breaks '\n\n'. at worst, i could just make all paragraphs separate cells, though that would make a rather large matrix. like 200 x 20000 or more.
 
Last edited:

Markbnj

Elite Member <br>Moderator Emeritus
Moderator
Sep 16, 2005
15,682
14
81
www.markbetz.net
Interesting problem, man. One of the first things that comes to mind is that you can't always take the linear sequence of chapters to record forward motion in time, and the mention of a character in a chapter doesn't mean the character is present at that moment in that timeline. But perhaps all you mean by "coincidence" is inhabiting the same chunk of text.

If that's the case then I would first look for scene breaks within a chapter. These are usually denoted by an additional blank line, or a symbol string, or a header, or something else that is always consistent within a text. You know at least that most scenes have a strong correlation to a single place and time.
 

Fayd

Diamond Member
Jun 28, 2001
7,970
2
76
www.manwhoring.com
Interesting problem, man. One of the first things that comes to mind is that you can't always take the linear sequence of chapters to record forward motion in time, and the mention of a character in a chapter doesn't mean the character is present at that moment in that timeline. But perhaps all you mean by "coincidence" is inhabiting the same chunk of text.

If that's the case then I would first look for scene breaks within a chapter. These are usually denoted by an additional blank line, or a symbol string, or a header, or something else that is always consistent within a text. You know at least that most scenes have a strong correlation to a single place and time.

Yes, you are correct in that coincidence I am referring to is two characters being in the same scene together. I realize that this is somewhat limited, as going by text, a person may be merely mentioned by another, or thought of by another... there's several shortcomings to this method. I rationalize it like this: if the character is mentioned, or thought of in the scene, then he has an effect on the scene; ergo he is tied to the scene. Admittedly this would tend to overinflate the importance of the main character, but there's not much we can do about that. One way I was thinking of to get around this is using log transform of the scene coincidence counts, because otherwise the titular character is going to be *extremely* popular.

When I was theorizing this initially, one thing I thought of was iteratively going through and deleting everything between quotation marks. IE, get rid of all spoken text, so as to not concern myself with characters being mentioned in third person. The problem with this method is characters may still be thought of in third person, as the book does a lot of inner monologue. I can't get rid of that.

unfortunately, the text I'm working with lacks scene breaks. But in reading through it, one thing I am seeing is that there are two types of paragraphs: conversational, and exposition. These are defined chiefly by length, with conversational being very short. (1-2 sentences, as each time the speaker changes creates a new paragraph)

If I could break on paragraph length, with each paragraph of sufficiently large size denoting its own scene, and all the conversational paragraphs in-between the large paragraphs stuffed together, then that would be sufficient.

What i'm doing right now is I have broken into individual paragraphs, and am setting up the word matching. (i've put the paragraph splitting problem on back burner) I'm trying to define a regex for each of the characters. Anyone know a way to look for matches 'Villefort' that won't trigger on 'Madame Villefort'?

EDIT: The magic of lookbehind assertions have saved my ass. Still haven't figured how to split into scenes, but i'm making progress on the other nuts and bolts. I'm going to set the regex string for every character as the value in a dictionary, then build the incidence matrix as a list of lists. Pass that into a pandas dataframe, then multiply by its transpose to get the coincidence matrix. I'm hoping all this works, I've never done matrix multiplication in python. I don't know if pandas data frames can do it. (i know R's data frames can..)

EDIT: I found this idea to split on the first paragraph after n characters. This should work for my purpose, if I set the length somewhere between a conversational paragraph and an exposition paragraph

http://stackoverflow.com/questions/...ng-in-python-after-a-specific-character-count
 
Last edited: