Sunday, April 4, 2010

Forced return, non-breaking hyphens &...

Forced return (linefeed), non-breaking hyphens and spaces, suppress hyphenation.

None of these are preserved when exporting from FrameMaker to XML, since they do not seem to be represented by a Unicode code point. My question to you is: How do you deal with those things in XML?

The last issues (non-breaking and suppress hyphen) have recently been brought up in:

http://forums.adobe.com/thread/450363

http://forums.adobe.com/thread/459503

Forced return is a common thing in many FrameMaker documents to improve readability of certain phrases.

For export of FM to XML, there have been proposals to use a special element with a prefix just to capture the special FrameMaker symbol. It works, and may be a valid useful path in certain cases, but I think that it is a questionable way of dealing with it for several reasons:

Forced return, non-breaking hyphens %26...

Harald,

The linebreaks, hyphenation etc. are solely a function of the publishing design. If the same content is to be poured into a 3-column layout or a 1-column full-page width layout or any dynamic, resizeable container (like for portable devices), these are going to be different and really have not much purpose in the datastream.

In my case, XML is run through FM's publishing engine via Miramo to generate output and the FM files are simply transient - meant to be tweaked for visual display and discarded once published. So I really don't give much thought to preserving formatting characteristics in XML, simply because it is NOT relevant or useful data.

Perhaps if you could provide a clear answer (playing devil's advocate here and not meaning to antogonize), as to what purpose is served by trying to include application and/or layout specific attributes in an XML datastream, then the issue of how one deals with them may be germane. Stretching this even further, one could suppose that these formatting characteristics really should be a property of the publishing container and it should be the container's responsibility to format the datastream accordingly.

Forced return, non-breaking hyphens %26...

Arnis, thanks for your views.

Reflowing for different output may very well call for different hyphenation and line breaks, at least sometimes, but the way you describe it, it means that someone has to use some formatting software and redo *everything* in this area, *every* time the data stream is to be output! No reuse whatsoever about proper or improper hyphenation, or phrases that may not be split (such as e.g the voltage unit after the value: ''12 V'' which may not be split into two lines regardless of output format). Although doable, it is hard to think of this as good reuse. Any save to XML means the work is lost, which only is okay if the information was meant to be used only once.

To me, an XML data stream is crippled if it can't preserve the non-breaking space in?e.g ''12 V'' (although I personally often use a thin space there instead of a regular space).

Arnis Gubins wrote:

Perhaps if you could provide a clear answer..., as to what purpose is served by trying to include application and/or layout specific attributes in an XML datastream...

Harald,

The non-breaking space is a perfectly valid unicode character (u+00A0), as well as the non-breaking hyphen (u+2011), hyphenation point (u+2027) and a whole slew of other ''formatting'' characters are available in the unicode general punctuation block range (u+2000 to u+206F). If these are included as part of the datastream, then they should be properly processed by the application (at least mine do ).They also should roundtrip now that FM supports the unicode character set.

I also make use of regex (regular expression) processing to trap special characters or word combinations (like your ''12 V'', for example, if required) and apply formatting rules as required in the processing application (which is what I was eluding to by having the ''smarts'' in the container) when creating the output. I too do not like to have to manually insert all of the same things all of the time. However, for a final visual layout, there will be always be some degree of human intervention required. I just try to get the content to a point where there is not too much for the human to do.

Harald,

?Structured documents saved as XML can preserve all these characters.

Arnis,

?In practice, XML is not always completely format-independent. Suppose, for instance, that documents are to be archived in XML or written to XML so they can be run through XSLT or other XML processing and then opened back in FM. Such workflows are often impractical if use of forced returns or non-breaking hypehns are lost. Conditional text or attributes on special elements can identify whether the characters pertain to all possible outputs or only some of them.

?--Lynne

Lynne A. Price wrote:

?Structured documents saved as XML can preserve all these characters.

Harald,

Your either/or missed the third case: that an app can save the special white-space and hyphen characters, but it takes some configuration to do so.

You are correct that by default FM exports a forced return as a regular line break which it then reads as a space. There are a few techniques you can use to preserve the forced returns:

1. If you are concerned only with forced returns within elements of particular types, you can use the Preserve Line Breaks rule. When it imports an element that uses this rule, a line break in the XML document comes in as a forced return. To avoid breaking lines within the content other than for forced returns, you can use a writer line break rule to extend the length of lines FM will generate (by default, it breaks at about 70 characters). For example:

element ''ProgramListing'' {

?preserve line breaks;

?writer line break is 999 characters;

?}

2. You can declare an entity in your DTD to be used for the forced return character and use a read/write rule to map the entity to that character. For example, add:

?%26lt;!ENTITY ForcedReturn ''FM forced return''%26gt;

to the DTD and use the r/w rule:

entity ''ForcedReturn'' is fm char 0x09;

3. In FM 7.2, I found that the entity approach was not reliable within text range elements. Haven't tested with later versions. I avoided the problem by defining an empty element with a prefix of a forced return and used that element instead of entering the character directly.

The problem I mentioned with entities for forced return may be specific to that particular character. I just did some testing in FM 9 writing the suppress hyphenation character as an entity. I found that my r/w rule, which I entered into an FM document rather than a text file, had to identify the character by numeric character code rather than as a single character string.

--Lynne

Lynne,

I appreciate the response and understand the use of XML ''documents''. I usually only deal with database output where both the XML output stream and FM documents are transient, so round-tripping is not an issue nor even a consideration (product catalogs, directories, specification sheets, statements, etc.). If format characters or entities have been entered into the database, then I have to honour them, but in many cases this info isn't available. Processing rules have to be added to the publishing application to try to compensate for the lack of some formatting rules.

First of all an embarrassing correction: At the top of my original post I said that all those things mentioned in the title are not preserved when exporting from FM. That is wrong! Some of them are preserved, some are not. Those that (without tricks) are not preserved are: forced return, suppress hyphenation and discretionary hyphen. (This is with FM 8 or 9.)

Lynne,

Ahaa! Now we are talking!

That entity thing works! What puzzles me is that you mapped it to 0x09, which is the tab character!? Indeed, looking into the FM binary, there really is a tab character in these places! Strange.

Also, just as you had experienced with earlier FM version, the entity approach does not work within text range elements using FM 9. If that could be solved, it would at least be a ''small step for mankind''.

I wonder what other XML tools would do when they come across that entity declaration, which they obviously can not understand? Will they just bail out, or show '%26amp;ForcedReturn;'', or a space? (I made a little experiment, and in that case it showed %26amp;ForcedReturn; and barked at it.)

Is there a way to declare an entity such that, if not understood, it will use an alternative mapping?

So, how about suppress hyphenation and discretionary hyphens? I just checked the FM binary, and it uses 0x05 and 0x04 for those (i.e ENQ and EOT). Are there more such things that would need to be mapped? I have verified that things like non-breaking space, non-breaking hyphen, thin space, en space, em space, numeric space, en dash, em dash, are all preserved in FM 8 and 9.

I don't quite understand what you meant with identifying the character by numeric code rather than a character string? Wouldn't you do it the same way as with forced returns, i.e something like: entity ''SuppressHyphen'' is fm char 0x05; ?

--Harald

Harald,

1) Indulge me in a point of vocabulary: the features that you refer to as ''tricks'' are really configuration options. FM has a default way of representing all these characters, but also allows you to change that default.

2) Note that the r/w rule refers to FrameMaker's internal character codes which do not quite match those of standardized character sets. See the online Adobe FrameMaker 9 Character Sets manual (Character_Sets.pdf). FM uses \x08 for a tab and \x09 for a forced return. Hence, the rule is:

entity ''ForcedReturn'' is fm char 0x09;

3) When an entity is mapped to a special character in FM, FM ignores the entity's replacement text. FM results will be the same if the DTD uses any of

%26lt;!ENTITY ForcedReturn ''''%26gt;

%26lt;!ENTITY ForcedReturn ''xyz''%26gt;

%26lt;!ENTITY ForcedReturn ''FM forced return''%26gt;

A content-driven XML processor (one that does not consider the markup used within its input) will not distinguish characters entered directly from those that are included via an entity reference. As you've noted, to such software (which includes XSLT), an entity reference is equivalent to its replacement text (the part of the declaration between quotes).

When I iintend to process XML with XSLT, I use a declaration such as the 3rd form because the string ''FM forced return'' is unlikely to occur in the contents of my users' documents. The XSLT code can therefore trap that string and perform whatever processing is needed.

If you do not need to be able to distinguish the entity reference from other methods of entering a newline or other representation of a line break, go ahead and declare the entity the way you want it processed. Remember that a DTD can declare an entity multiple times; the first declaration is used. You may be able to define entity search paths to make use of this convention.

4) The Is FM Char rule can identify the character mapped to an entity by its character code (in decimal, hexadecimal, or octal) or as a string consisting of a single character. So the following rules are equivalent:

entity ''lettera'' is fm char 97;

entity ''lettera'' is fm char 0x61;

entity ''lettera'' is fm char ''a'';

Hope this helps,

?--Lynne

Thanks everyone for all the excellent info here.

I am wondering if there is also a way to round-trip a ''keep with next paragraph'' setting -- the one on the Pagination tab of the para designer.?Is there a way to represent that in xml??I havne't found anything that might represent that...?

Thanks,

Shelley

Shelley,

Pagination is left to the application for how it treats the element. The standard way of representing that in Structured FM or XML is a pagination attribute in the element that may have a value like ''Keep With Next Element'' (or similar).

In my primary EDD I have several elements with the attribute Pagination that may have values like ''Keep With Next Element'', ''Keep With Previous Element'' or ''Keep Together''. The EDD can easily control the relevant setting of the paragraph (the same as in the pagination tab).

Especially useful is the value ''Keep Together'' that I can have in lists. It can ensure that the whole list is kept together on the same page. The EDD can can do this since it can distinguish the first paragraph in the element from the other paragraphs.

All these things will of course round-trip since it is an explicit attribute.

Lynne,

As anyone can see, entity declarations have not really been in the domain of my expertise (understatement), but I might just learn bits and pieces...

Thanks! I have now done some more reading in various pdfs, such as scructapps etc, but also the w3c xml spec, all of which I have ''read'' before, but reading is not always the same as understanding...

I now have used the ''entity approach'' to make both discretionary hyphen and suppress hyphen work, so that it is represented in the XML export and round-trips just fine.

Forced return still puzzles me somewhat though: I have a working solution for that using the same entity approach, but it will not work for text ranges (although it is a bit odd to have forced returns in text range elements). Moreover, I am not certain that it really is the best way. In reading other posts on this issue, I have seen that some people want forced returns to really be linefeeds in the xml output, with no other linefeeds, and conversely, when importing xml that are ''formatted'' in lines, such as code, they want to preserve that.

One?way to achieve that is to use:

reader line break is forced return;

writer line break is 1000 characters;?(or more characters)

Then it is completely symmetric (i.e works also if some other app has generated the xml with intentional line breaks. And this method is global for all elements, and it works for me.

Another way to achieve this is to use the following, but it is only valid inside an element:

preserve line breaks;

writer line break is 1000 characters;

Then it is also completely symmetric (in the above sense), and it works for me.

The latter method has the drawback that it only works for text in the element; it is not inherited to its children.

But if we look at only one element, what is the difference between the two methods?

As far as I have been able to see, the results are identical! Why two methods? The first one seems more powerful and general(?)

Scrutinizing the documentation though, it says that 'preserve line breaks' will add an attribute 'xml:space', but I can't see that in my output?

Also, concerning the documentation for line break, it says that FM would ignore line breaks when exporting, but it does NOT do that (fortunately). Forced returns always result in linefeeds, no matter what.

I would be thankful if these issues could be explained, or elaborated.

Also, would you recommend against any of these methods?

With my preferred method of using the global 'reader line break is forced return' instead of a local (or many local) 'preserve line breaks', there is of course the danger that when importing xml, it is absolutely necessary that the xml really only has line breaks where they are intended, and not generated every 80 characters or so!

Is there anything that says it is ''ugly'' or ''bad style'' to honor line breaks, and generate long lines?

Finally, why do we need to limit the number of characters written in one line these days? Why 80, why 1000, why not an unlimited number of characters!?

Will software these days really crash if given a very long line (assuming that it doesn't exceed total ram memory of course)?

Sorry about so many questions, but I can't find any answers to them in the docs I have read and searched.

  • sunscreen skin
  • No comments:

    Post a Comment