I am trying to build a social network of characters from a book. The edges of the graph are established as coincidence (existing at the same time) in the text. I'm trying to split the book into chapter-lets (smaller than chapter. IE, break on chapter, then break every 5-6 paragraphs), which i'll then do regex search on each chapterlet to establish whether a character is there. (return 1 for yes, 0 for no)
I still have to work this out, but i'm thinking that assembling the character incidence in a chapter into a matrix = X, lower triangle of XX' = matrix of character coincidence = edge weights for my network. (like i said, i'm working it out). this part is fairly easy, just bool(re.search()). need to create custom rules for each character, which will be somewhat painful.
edit: just confirmed my suspicions, XX' will do what I wanted. now i just need to establish character coincidence.
the regex searches themselves will have some problems. some characters are referred to by substrings of other characters. Eg, Villefort is a different character than Madamme Villefort. Other times, characters have several different names they use/ are referred to in a variety of ways. the way i intend to get around this, is search for the longest string first and then delete it from the text. I'll have to check out some chapter on text cleaning to figure how to do this.
the first problem i have is the book i'm looking at is VERY long (700 pages, around 120 chapters), and some of these chapters are themselves very long. I want to split some of the chapters by length into smaller chapter-lets, but I am unsure how to do this.
the initial readin of the full text will read to a list of chapters, when i do
readin.split('Chapter'), but I don't know how how to then take and cut down each chapter based on length into segments. Anyone have a suggestion?
edit: Hmm... chapters are between 1500 and 9000 words long. It would be nice to separate them into groups of paragraphs. I can count paragraphs per by counting the number of double line breaks '\n\n'. at worst, i could just make all paragraphs separate cells, though that would make a rather large matrix. like 200 x 20000 or more.
I still have to work this out, but i'm thinking that assembling the character incidence in a chapter into a matrix = X, lower triangle of XX' = matrix of character coincidence = edge weights for my network. (like i said, i'm working it out). this part is fairly easy, just bool(re.search()). need to create custom rules for each character, which will be somewhat painful.
edit: just confirmed my suspicions, XX' will do what I wanted. now i just need to establish character coincidence.
the regex searches themselves will have some problems. some characters are referred to by substrings of other characters. Eg, Villefort is a different character than Madamme Villefort. Other times, characters have several different names they use/ are referred to in a variety of ways. the way i intend to get around this, is search for the longest string first and then delete it from the text. I'll have to check out some chapter on text cleaning to figure how to do this.
the first problem i have is the book i'm looking at is VERY long (700 pages, around 120 chapters), and some of these chapters are themselves very long. I want to split some of the chapters by length into smaller chapter-lets, but I am unsure how to do this.
the initial readin of the full text will read to a list of chapters, when i do
readin.split('Chapter'), but I don't know how how to then take and cut down each chapter based on length into segments. Anyone have a suggestion?
edit: Hmm... chapters are between 1500 and 9000 words long. It would be nice to separate them into groups of paragraphs. I can count paragraphs per by counting the number of double line breaks '\n\n'. at worst, i could just make all paragraphs separate cells, though that would make a rather large matrix. like 200 x 20000 or more.
Last edited: