An author of no particular popularity

Jay Lake
Date: 2013-10-20 07:28
Subject: [writing|tech] Looking for a reader with some scripting chops
Recent discussions online have gotten me interested in producing a complete lexicon of my own fiction output. I'd like to find a fan with the scripting chops to feed about four or five hundred .doc and .docx files (plus a few .pdf, .html and .txt files) through a scripting engine and pull out a list of each distinct word I have ever used in my fiction output, ideally with frequency.

There's a second part to this, which is someone with the linguistics chops to filter that list for words which are forms of the same stem, i.e. "walk", "walked" and "walking".

I'm curious what my demonstrated written vocabulary is, and secondarily how many words I've coined, re-invented or backformed.

Anybody interested in grinding this for me?

User: klwilliams
Date: 2013-10-20 18:23 (UTC)
Subject: (no subject)
Sorry, I deleted my earlier comment. I'll email you.
User: blue_23
Date: 2013-10-20 19:54 (UTC)
Subject: (no subject)
A bit more involved metrics might be interesting. Easy examples are adding if it's a dictionary word (so you can find YOUR words), and frequency by work instead of just overall (so you can see ties between different works). If you also had a chronology of the works, you could track word usage over time. Ah, and length of each work, so you can work out not number of occurrences within a work, but also as a % of total length to know if it's an often occurring word or a rarity.

One thought - afterwards would you be interested in releasing the output with a open license?
User: blue_23
Date: 2013-10-20 23:49 (UTC)
Subject: (no subject)
Another random thought - if you are worried about your entire corpus being out in soft copy, have someone take a copy of them and sort them alphabetically.

Though another thought for the scripting inclined would be to train Markov chains on them to have a Jay Lake sentence maker. Unfortunately, this is incompatible with the above.
