?

Log in

No account? Create an account
Lakeshore
An author of no particular popularity

Jay Lake
Date: 2007-07-09 15:17
Subject: [tech] So many of you must be smarter than me
Security: Public
Tags:help, tech, writing
Is there a way to construct a wildcard search in Microsoft Word to find identical words within xx characters of each other? For example, if I have the sentence, "The house was as big as a house!", I'd like to be able to flag that.

I can see a lot of problems with the algorithm -- substrings, plurals, punctuation, different variations on a common stem (houses vs housing, frex) -- but I'm curious if it can be done at all.
Post A Comment | 14 Comments | | Flag | Link






Autopope
User: autopope
Date: 2007-07-09 22:35 (UTC)
Subject: (no subject)
I could do it in Perl regular expressions, but I'm not sure Turd supports the extended syntax necessary to do so; Microsoft re-invented a perfectly good language and botched it in the process, leaving some necessary bits out.

What you need is:

pattern_representing_word_delimited_by_non-word_characters
random_wildcard_constrained_to_MAXNUMBER_characters
backreference_to_earlier_pattern_match

Note that the first pattern can be as complex as you like (including sub-patterns for suffixes and tenses) but it needs to be grouped by a pattern delimiter, and this is usually a headache in systems that don't provide a full extended regular expression syntax.

If you were working on a Mac I'd recommend grabbing a demo version of Nisus Writer Express, which has Perl as a built-in macro language, but I have a horrible feeling you're running on Virus-OS ...
Reply | Thread | Link



Jay Lake
User: jaylake
Date: 2007-07-09 22:43 (UTC)
Subject: (no subject)
I have been using a Mac since 1985, sir.

(Not the very same one, of course.)

Currently working on an iBook G4 and contemplating a new MacBook.

Signed,

One of the good guys.
Reply | Parent | Thread | Link



Peter Hollo
User: frogworth
Date: 2007-07-10 05:01 (UTC)
Subject: (no subject)
I run VirusOS and haven't had a crash or virus since I don't know when. But still, from my examination of Microsoft's VeryBasic, I have a feeling that the bit about constraining the wildcard to MAXNUM characters would be the tricky bit (i.e. I don't know if you could do it at all).

If you could get around that, it'd be easy to even write a macro that asked you what word you wanted to look for in that manner, and then showed you the results.

But yeah, if you really want to, just copy the text out into a text file and I'm sure you'll have no problem writing a regex to do it.
Reply | Parent | Thread | Link



kathryn_ironic: curious
User: kathryn_ironic
Date: 2007-07-09 23:28 (UTC)
Subject: (no subject)
Keyword:curious
I first thought that he could install Open Office (runs on Macs, yes?), learn RegEx, make the program, and days later he'll be a much better programmer and will have his answer.

But I don't want him to program. I want him to write*.

Jay,
1. make a copy of the doc as a text file
2. ask your wise readers to write a Perl script as per the structure above.
3. play with variations of character distance.
4. decide what to do with the 5, 10 or 20 odd results you find.
5. keep writing.

-------------
* Those free samples at Westercon? all GoHs should insist on that.
Reply | Parent | Thread | Link



Autopope
User: autopope
Date: 2007-07-10 07:53 (UTC)
Subject: (no subject)
OpenOffice's search/replace regexps are descended from a clone of MS Turd's, and thus are less than useful. Although they might have upgraded them lately. (Daydreams about a WP with libpcre linked in ...)
Reply | Parent | Thread | Link



User: deangc
Date: 2007-07-10 12:52 (UTC)
Subject: when you have a hammer...
Everything looks like a nail. I could do this in SQL. Mind you, it isn't as wacky as you might think. SQL excels at dealing with sets of things, and in this case you are looking for the set of things that are within a certain distance of another set of things. (This is a merely a restatement of what Charlie said.) This could be words, phrases... what have you.

Regarding 'house' and 'housing', SQL Server has a SOUNDEX function that allows you to compare words based on how similar the algorithm thinks they sound.

I may take a stab at this over the next little while. Not that this is likely to yield anything useful to anybody but me, but still. It's an interesting problem.
Reply | Thread | Link



Josh English
User: joshenglish
Date: 2007-07-10 18:24 (UTC)
Subject: Re: when you have a hammer...
OOOh. Great idea. I once wrote a program that counted the vocabulary of each word in my stories. I didn't go so far as to count "word" and "words" as the same thing. Maybe the Soundex algorithm could help.
Somewhere I have a Python script that separated text into sentences and searched through them. Python exists in OSX and the script is pretty simple. I'll see if I can't whip one up tonight.
Reply | Parent | Thread | Link



User: (Anonymous)
Date: 2007-07-10 13:51 (UTC)
Subject: (no subject)
Here's how I handled a similar task recently:

1) I wrote a macro within Word that reformatted the text so that every word was on its own line, stripped of punctuation
2) I copied the resultant text into Excel, so that each word ended up in a separate cell
3) I wrote Excel formulas to do the necessary searching — in your case this would be a trivially easy formula based on the MATCH function

It takes a few minutes to set this up, but once you've done that, you have the functionality all there and you can do various analysis in Excel.

In my case I wanted to extract all the proper nouns from stories because I was getting concerned that I was unconsciously recycling the same character names across multiple stories (e.g. in walk-on parts).

--Ian Creasey
Reply | Thread | Link



kathryn_ironic
User: kathryn_ironic
Date: 2007-07-10 20:24 (UTC)
Subject: (no subject)
A decade ago I'd have done something like this:
1. Written a vba program in Excel to scan along a Word document, and for each word and 1st half of each word scan ahead for a match.
2. if there's a hit, note the location and match in Excel, add a flag at the location in the document.

Then came a database of a million spam.

I learned Perl really quickly.

What took 200 lines and 3 hours of VBA coding could be done in 1/2 hour and 3 lines of Perl. Sure, 3 lines of alien semaphore that due to Algernon moments can't be read the next day, but it worked.

All to say that Excel is useful for analysis, sure. Annoying as it is, it has capabilities that I don't yet see in other analysis programs (that I can afford). I've got a corner of my computer thinking it's a windows box, all to run an Excel in.

But Perl- mighty powers, that. Also adds a +1 to resumes (I've found, and I'm not a programmer by trade)
Reply | Parent | Thread | Link



Elf M. Sternberg
User: elfs
Date: 2007-07-11 05:29 (UTC)
Subject: (no subject)
Totally off the seat of my pants, two words of between 5 and 20 letters length within 400 characters of each other:

perl -e 'while(<STDIN>){chomp;$a.=lc("$_ ");};while($a=~s/\b(\w{5,20})\b.{1,400}\1//im){print "$1, $&\n\n"}' < file.txt

The output is the word found, and the string in which it appears twice, for every instance in the file, in the order in which they appear in the file. The strings will be uniformly lower-cased to make sure that case sensitivity is ignored during the search.

That was fun. I don't usually get to write much perl these days.
Reply | Parent | Thread | Link



kathryn_ironic: nature
User: kathryn_ironic
Date: 2007-07-11 08:20 (UTC)
Subject: There ya' go, Jay.
Keyword:nature
To answer the spirit of your original question...
Can I use my toothbrush as a woodchipper? No.
Can I have a woodchipper? Yes.

Now all you have to do is toss texts into a wordchipper device similar to the Elf-built one above, and any odd repeating fragments ought to zip out.
Reply | Parent | Thread | Link



Jay Lake
User: jaylake
Date: 2007-07-11 13:06 (UTC)
Subject: Re: There ya' go, Jay.
Elf-Built: The brand of woodchippers that chipper woodchippers would prefer to chip with.
Reply | Parent | Thread | Link



Elf M. Sternberg
User: elfs
Date: 2007-07-12 16:47 (UTC)
Subject: Re: There ya' go, Jay.
The problem here is that I've taken my chainsaw and used it as a woodchipper. It works, but the methods and results are incomprehensible to the uninitiated. I love Perl, but if I can't explain what I've done clearly to someone who needs the product, what's the point? There ought to be plugins to MS products (just as there are for Emacs-related products) that allow you to write your solution in your language of choice.

I often thing I ought to publish my aft (Almost Free Text) & python toolchain that goes from "text file containing the story" to "Web, e-book, plain text, and PDF ready documents", but the damn thing is so idiosyncratic I wonder if anyone else would ever get any use out of it.
Reply | Parent | Thread | Link



Jay Lake
User: jaylake
Date: 2007-07-11 13:01 (UTC)
Subject: (no subject)
Thank you!
Reply | Parent | Thread | Link



browse
my journal
links
January 2014
2012 appearances