The Weird Word Project
Tuesday, April 3rd, 2012 05:26 pmI’ve been thinking about the various suggestions on how to manage this project, and I want to thank everyone who participated in the discussion. My thoughts follow.
1. What I really, truly need a is a list of Weird Words (including all “foreign” words, be they Liaden, Terran, Delgadan, Vandese, or etc.), and Names (including ship names, planet names, city names) for each book. One book = One list. It does, after thought, seem best that the WW and Names be combined into one, very long list.
2. Ideally, the words should be in the order they appear in the book in question — which was the idea behind the page numbers.
3. We do have to work within the constraints of my abilities. I am not a database whiz — not even close. I can keep the dern things, but multi-level/multi-directional sorts are beyond my skill level. Telling me that database wrangling is simple has in the past produced. . .no discernible improvement in my ability to do advanced sorts of any kind. In fact, my ability to deal actually plummets.
3A. The commodity I’m very short on for the foreseeable future is time. I cannot myself accept bits of databases to chain onto a master database or anything of that sort. Even if it *is* dead easy (which, believe me, it’s not), it will still take time. This project has a deadline associated with it; I’m not entirely sure when, but I don’t think it impossible that the lists for at least the first two books would be needed by early June.
4. It seems to me that, if the word harvesting process is completely automated, someone is going to have to check the lists against the books anyway, to make sure the automagic didn’t miss something. (Why, yes, I have had programs fail me. Why do you ask?) That said, it seems like it would be very useful to run a script as a check, in case the humans missed something.
5. In terms of volunteers, I think we’re going to need:
*At least one Word Wrangler for each book, to receive the information from volunteers, or to tend the database/wiki, check the list against a software-produced list (if any), and to compile the final list that will be forwarded to me
*At least two Word Harvesters for each book, who will compile the words and the page numbers and either forward them to the Wrangler, or enter them in a database/wiki/form
*An Automagician to automagically generate a list from each book to be checked against the list produced by the Word Harvesters. This is a luxury.
Note: I don’t really care about the process used to produce the lists. All I really care about is having accurate lists. That means that if three or four folks want to pool resources as the Gathering Team on Book A, I don’t care how they bag the words, only that all the words are bagged, and that I receive them in a simple, and understandable, format.
Does all of the above make sense (or is, at least, clear; I don’t actually expect anybody to enter into the Database-Free Zone that is my brain)?
Discussion open until tomorrow, Wednesday, mid-morning.
Thanks!
Originally published at Sharon Lee, Writer. You can comment here or there.
no subject
Date: 2012-04-03 09:37 pm (UTC)The project fascinates me, and I'd like to be a part of the gathering.
no subject
Date: 2012-04-03 09:46 pm (UTC)Anything that's not Everyday English ought to be on the list(s) -- Miri's swear words, words that were made up to describe how non-existent machines are constructed. Also, we do sometimes use archaic English words -- if it's weird, we want it, even if it looks easy to pronounce (aka "delm").
The names are maybe going to be tough ("Kamele" for instance, we thought was pretty straightforward, until we heard someone pronounce it "Ka-Meel-ie"). Might just be the way to go is to harvest all the names...
no subject
Date: 2012-04-03 09:54 pm (UTC)What format?
Date: 2012-04-03 09:56 pm (UTC)Re: What format?
Date: 2012-04-03 10:18 pm (UTC)no subject
Date: 2012-04-03 10:02 pm (UTC)no subject
Date: 2012-04-03 10:03 pm (UTC)I guess it seems to me that the exponential work reduction from automating is worth the potential cost of missing a few words (though I am biased, since I don't think automation is likely to miss words). Because doing it by hand with people seems like it will take man-weeks, and automating it will take man-minutes.
So, given the text files of the books, I can produce lists of words and their frequency in a few minutes (maybe seconds, but I'd have to go look at my notes), and then exclude from them words found in a dictionary (your choice of dictionary; I used a handy text file of english words lying around). For the files I had handy last time I looked at this, these were the top ten:
1 2131 daav
2 2119 *
3 1390 miri
4 1380 aelliana
5 1358 shan
6 1294 jethri
7 1209 jela
8 1101 cantra
9 1097 thom
10 971 korval
(Yes, I know that * is not a word. But better to include it than to avoid such things.)
Anyhow, let me know if you want me to run that operation on any and all individual books.
no subject
Date: 2012-04-03 10:04 pm (UTC)I would also be willing to help with this project. I have most of your work in electronic and paper so could do some word sorting that way, and then do some searching to find approximately where it appears, then double check the page numbers with the paper copies.
I also have modest skills as a database wrangler, especially mysql. Which may or may not be useful for this project.
wierd words
Date: 2012-04-04 12:14 am (UTC)With your multitude of fans, might the task of reading thru each book be divided into chapters? This division could be the responsibility of the particular books' 'Wrangler'.
*I* am willing to do a whole book :)
I actually like to do this sort of thing.
no subject
Date: 2012-04-04 12:58 am (UTC)no subject
Date: 2012-04-04 02:38 am (UTC)no subject
Date: 2012-04-04 02:52 am (UTC)no subject
Date: 2012-04-04 06:41 am (UTC)Just wondering why? My guess is that you're thinking of context, in which case a KWIC sort (keyword in context) might be just as useful. I seem to remember this being a standard sort on Unix, but I may be wrong. Ask Jayhawk, he can probably produce one at the drop of a script.
no subject
Date: 2012-04-04 12:29 pm (UTC)In order that it be Most Useful to the possibly several sets of people who may need to use it. Admittedly, I'm guessing; it could be that an alphabetical list will be easiest, but producing an alpha list from a random list is within my skill-set.
Weird words
Date: 2012-04-04 07:05 am (UTC)no subject
Date: 2012-04-04 02:23 pm (UTC)If desired, I can also provide the word number (i.e. this is word #14531 within the book), the word frequency (i.e. 'Miri' is in this book 4132 times) and/or the paragraph in which the word is found. I can also provide multiple versions -- say, one sorted in the order in which it's found, one sorted alphabetically. Or it might be better to make one formatted for easy reading, another formatted for importing into excel.
I'm going on vacation tonight straight from work, but I can do this next Wednesday after I return. (Heck: I can do this on vacation if husband gets internet on his laptop; I *like* wrangling Perl scripts and after jumping the gun earlier I don't think there's more than twenty minutes of work(*) left.) (*) Caveat: false positives would be left in place.
no subject
Date: 2012-04-04 08:52 pm (UTC)For clarification, do you want only the first occurrence of a word noted when a word occurs more than once in a book?
no subject
Date: 2012-04-04 09:35 pm (UTC)Is this what you're looking for? I'm particularly interested in whether first and last names should be considered one item and whether my personal filter for weirdness is too liberal.
Book Counter Word
Intelligent Design 1 Er Thom
Intelligent Design 2 yos'Galan
Intelligent Design 3 Korval
Intelligent Design 4 Ezern
Intelligent Design 5 pak'Ora
Intelligent Design 6 Wal Tor
Intelligent Design 7 delm
Intelligent Design 8 Ranvit
Intelligent Design 9 Ban Del
Intelligent Design 10 Code
Intelligent Design 11 Anne
Intelligent Design 12 Davis
Intelligent Design 13 Terran
Intelligent Design 14 Liaden
Intelligent Design 15 melant'i
Intelligent Design 16 clan
Intelligent Design 17 Code-wise
Intelligent Design 18 Val Con
Intelligent Design 19 yos'Phelium
Intelligent Design 20 Rising
Intelligent Design 21 Solcintra
Intelligent Design 22 necessity
Intelligent Design 23 Nova
Intelligent Design 24 Anthora
Intelligent Design 25 Service Houses
Intelligent Design 26 Balance
Intelligent Design 27 Ring
Intelligent Design 28 dea'Gauss
Intelligent Design 29 pel'Kana
Intelligent Design 30 Jelaza
Intelligent Design 31 Kazone
Intelligent Design 32 Shan
Intelligent Design 33 Liad
Intelligent Design 34 'fresher
Intelligent Design 35 Master Trader
Intelligent Design 36 Merlin
Intelligent Design 37 'prentice
Intelligent Design 38 Luken
Intelligent Design 39 relumma
Intelligent Design 40 duty-list
Intelligent Design 41 Pomerloo
Intelligent Design 42 Ken Rik
Intelligent Design 43 cargo master
Intelligent Design 44 Roderick Spode
Intelligent Design 45 IAMM
Intelligent Design 46 datagram
Intelligent Design 47 Pomerlooport
Intelligent Design 48 Scout
Intelligent Design 49 kerb
Intelligent Design 50 Independent Armed Military Modules
Intelligent Design 51 Standard Year
Intelligent Design 52 Prael
Intelligent Design 53 Anusta Hayn
Intelligent Design 54 'bots
Intelligent Design 55 Complex Logic Laws
Intelligent Design 56 sleep-learn
Intelligent Design 57 Standards
Intelligent Design 58 Glondinport
Intelligent Design 59 Kayzin
Intelligent Design 60 Ne-Zame
Intelligent Design 61 Wilberforce Warehouse
Intelligent Design 62 Port Rule One
Intelligent Design 63 Healers
Intelligent Design 64 Monix
Intelligent Design 65 jump-wire
Intelligent Design 66 data-jacks
Intelligent Design 67 whisker-wires
Intelligent Design 68 tri-spatial
Intelligent Design 69 voder
Intelligent Design 70 voice-box
Intelligent Design 71 databox
Intelligent Design 72 ambient network
Intelligent Design 73 Command Prime
Intelligent Design 74 Scout Commander
Intelligent Design 75 Ivdra
Intelligent Design 76 sen'Lora
Intelligent Design 77 Tralla
Intelligent Design 78 Gantrol
Intelligent Design 79 Jeeves
no subject
Date: 2012-04-04 09:44 pm (UTC)Which means that you can:
1) Ask someone to make a webform for each book, two fields: word and page number (maybe a comment, too?). Each form gets a really long, bizarre URL address.
2) Ask that someone to configure a report on each book, which is essentially a read-only excel table, visible on a web page.
2) Give those URL addresses to Harversters.
3) Check once in a while, how things flow.
I am a big fan of stable, out-of-the-box software.
:-)