[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Documentation Metrics

"David C. Merrill, Ph.D." wrote:
> I am working on the set of metrics to be used in reviewing our documents.

One thing I'd wonder about is whether any useful metrics can be
generated automatically.

There's a whole literature on readability indexes based on statistical
analysis of things like words per sentence, letters per word. Some of
the key work was Lorinda Cherry and others in the Writers' Workbench
project at Bell Labs.

There was a Reader's Workbench project at one point, U of Utah I
think, with an ex-Bell Labs person from the Programmer's Workbench
(make and ancestors of CVS) project involved. Anyone know where that
went? Is the software available somewhere? Did they publish papers?

There are other things one could measure.

Frequency of technical terms (first cut at a definition is words not
in some general-purpose dictionary) or of such terms minus a standard
list (Linux, ipchains, RFC, ...), or terms neither on list nor in
glossary (oops!).

Another variant would use not a standard English dictionary, but one
of the dictionaries developed for use with non-native speakers.

Frequency of words which indicate rhetorical structure -- therefore,
however, whereas, except, .. -- or of constructions that reference
other parts of text -- either pronouns such as 'it' or 'this',  or
non-specific nouns that refer back to more exact descriptions. In
many contexts, phrases like 'the device' or 'the interrupt' function
this way.

Frequency of various whateverML tags, and their level of nesting.
Nested lists inside a table structure under a level six heading?
Methinks I see a problem. One H1 tag followed by 14 K of text with
only two links in it? That's problematic too.

Measuring such things precisely and figuring out all the implications
is a big project. I'd guess there are half a dozen potential theses
in it. On the other hand, an afternoon of Perl hacking might be enough
to provide some interesting results.

My guess would be that at least some of the objectively, automatically
measurable statistical properties of text would correlate with some
of the judgements we make -- clear vs. confusing, basic vs. advanced,

I'd love to have a tool that tells me that, compared to some sample
that covers related docs (say, HowTos for administrators) and that
users rate as well-written, my docs are measurably different in
specific ways.

To UNSUBSCRIBE, email to ldp-discuss-request@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org