|
Processing a List of Characters and States
with Definitions:
A Perl Script to convert the list to xml
Purpose: For Openkey, we have a Perl script that can process a list of character groups, characters, and states with definitions that someone wants for their collection. Our script creates xml files out of each individual state, character, and character group.
For instance, our botanists in the Prairie Plant and South East Trees projects created such a list. Our script was written specifically for their list, but if you create your own list using the same format, our processing script should easily handle it.
We have a template in rich text format that may help people write their own.
What is Needed Convert A Definitions List to XML:
- Perl or ActivePerl
- A Definitions List (see formatting instructions)
- split_definitions.pl - the Perl script (copyleft)
- A directory structure for the xml files to go. For instance our directories were
- ../project_name/characterGroup/xml/,
- ../project_name/character/xml/,
- ../project_name/state/xml/
- A text editor.
- An example of each type of XML you want to create. (see note 1 below)
- A person who feels comfortable altering the .pl file to alter the hard coded xml element names.
The General Format for the List
(see the template for the best instructions, but here are the main guidelines)
- The project name should be on the first line, followed by two blank lines before the first major character group begins.
- The file should be a text file or a rich text file.
- Two blank lines between major character groups (i.e. "Leaves" is a major character group, "Stems" is another).
- Within a "major character group", there should be a single blank line between everything except for related states. (For example a single blank line between "Plant Habit and Life Style" and "Life Span"). But there is not a blank line between "Annual" and "Biennial" because they are states under "Life Span."
- After each state name, there should be a space a hyphen a space and then the defition. (See Annual below)(also see "Bunching" which does not have a definition but still has a hyphen).
- Between two states in the same character, there should be a hard return, but no blank line. Here is a quick example:
Project: prairieplant
Plant Habit and Life Style
Life Span
Annual - Normally living one year or less; growing, reproducing, and dying within one cycle of seasons. [K&P, p. 15]
Biennial - Normally living two years; germinating or forming and growing vegetatively during one cycle of seasons, then reproducing sexually and dying during the following one. [K&P, p. 21]
Perennial - Normally living more than two years, with no definite limit to its life span. [K&P, p. 79]
Woodiness
Herbaceous - Having little or no living portion of the shoot persisting aboveground from one growing season to the next, the aboveground portion being composed of relatively soft, non-woody tissue. [K&P, p. 56, modified]
Woody - With an aboveground shoot composed of relatively hard tissue that persists from one growing season to the next.
Herbaceous Plant Growth Form
Bunching -
Single upright stem -
- Only the states should be defined in the first portion of the file.
- Capitalization does not matter because our processing lower-cases everything.
- State definitions are allowed to be complicated.
Subshrub - 1) A shrub-like plant but with only the base composed of woody tissue, the herbaceous branches dying back at the end of each growing season. [K&P, pp. 106-107, modified] 2) A very low shrub that sprawls on the ground; a trailing shrub. (Compare with shrub.) [L, p. 772, modified]
Example of the resulting xml files:
Three types of files: character group files, character files, and state files.
File Name: plant_habit_and_life_style.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE characterGroup SYSTEM "http://www.isrl.uiuc.edu/~openkey/shared/characterGroup.dtd">
<!-- created by split_definitions.pl 2003_9_22_12_1 -->
<CharacterGroup>
<CharacterGroupName name="plant_habit_and_lifestyle" file="/home/openkey/public_html/prairieplant/characterGroup/xml/plant_habit_and_lifestyle.xml">plant habit and lifestyle</CharacterGroupName>
<LegalValue name="life_span" file="/home/openkey/public_html/prairieplant/character/xml/life_span.xml">life span</LegalValue>
<LegalValue name="woodiness" file="/home/openkey/public_html/prairieplant/character/xml/woodiness.xml">woodiness</LegalValue>
<LegalValue name="growth_habit" file="/home/openkey/public_html/prairieplant/character/xml/growth_habit.xml">growth habit</LegalValue>
<LegalValue name="herbaceous_plant_growth_form" file="/home/openkey/public_html/prairieplant/character/xml/herbaceous_plant_growth_form.xml">herbaceous plant growth form</LegalValue>
<LegalValue name="nutrition" file="/home/openkey/public_html/prairieplant/character/xml/nutrition.xml">nutrition</LegalValue>
<LegalValue name="carnivory" file="/home/openkey/public_html/prairieplant/character/xml/carnivory.xml">carnivory</LegalValue>
<Image>none yet</Image>
<Definition>plant habit and lifestyle</Definition>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm></RelatedTerm>
<DisplayBefore></DisplayBefore>
<DisplayFor><strong>Plant habit and lifestyle</strong>: </DisplayFor>
<DisplayAfter>.<BR></DisplayAfter>
</CharacterGroup>
File Name: life_span.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Character SYSTEM "http://www.isrl.uiuc.edu/~openkey/shared/character.dtd">
<!-- created by split_definitions.pl 2003_9_29_0_12 -->
<Character>
<CharacterName name="life_span" file="/home/openkey/public_html/prairieplant/character/xml/life_span.xml">life span</CharacterName>
<LegalValue name="annual" file="/home/openkey/public_html/prairieplant/state/xml/annual.xml">annual</LegalValue>
<LegalValue name="biennial" file="/home/openkey/public_html/prairieplant/state/xml/biennial.xml">biennial</LegalValue>
<LegalValue name="perennial" file="/home/openkey/public_html/prairieplant/state/xml/perennial.xml">perennial</LegalValue>
<Image>none yet</Image>
<Definition>life span</Definition>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm></RelatedTerm>
<DisplayBefore></DisplayBefore>
<DisplayFor></DisplayFor>
<DisplayAfter>,</DisplayAfter>
</Character>
File Name: annual.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE State SYSTEM "http://www.isrl.uiuc.edu/~openkey/shared/state.dtd">
<!-- created by split_definitions.pl 2003_9_29_0_12 -->
<State>
<StateName name="annual" file="/home/openkey/public_html/prairieplant/state/xml/annual.xml">annual</StateName>
<Definition><em>(plant habit and lifestyle )</em> Normally living one year or less; growing, reproducing, and dying within one cycle of seasons. [K&P, p. 15] </Definition>
<Image>none yet</Image>
<Example></Example>
<Synonym></Synonym>
<Synonym></Synonym>
<BroaderTerm></BroaderTerm>
<NarrowerTerm></NarrowerTerm>
<RelatedTerm> </RelatedTerm>
<Prevelence></Prevelence>
<Certainty></Certainty>
<DisplayBefore></DisplayBefore>
<DisplayFor>annual </DisplayFor>
<DisplayAfter>,</DisplayAfter>
</State>
Notes, Comments, Hopes, Wishes, ...
Note 1: Currently, because I was frequently running this script on my home computer which doesn't have the xml modules on it, I did hard code the xml elements and tags. In other words, the script must be altered if the xml elements change. I do not read in the 3 types of xml files.
Someday we'll have an upload area where people can process their own files on our server.
Please take and improve the script if you are so inclined. We would love to put your better code up for others to use. GNU copyleft licensing applies.
We encourage the "borrowing" of lists between projects.
For complicated character group relationships, we would like to enable the use of adding numbers like 1, 1.1, 1.2, ... 1.10.1 otherwise even the humans get confused. A sample of my ideal input file would be something like ideal list.
Updates of additional character groups right now means reprocessing the entire updated list. (otherwise we'd have to check to see if the character group already existed and update the legal values for each level).
I'd like to replace all the repeated things in the xml files with the minimum amount of information we need to access the related files. for instance: file="/home/openkey/public_html/prairieplant/character/xml/ should only need to be the word "character". The only problem keeping us from doing this is the xsl files that transform the xml to other xml and/or html.
|