About InfoWorld : Advertise : Subscribe : Contact Us : Awards : Events : Store
InfoWorld HomeNewsTest CenterOpinionsProduct GuideTechIndex
 
 

NEWS

 
Modeling biz docs in XML

By Jon Udell
November 29, 2002


THE GOOD NEWS is that Office 11 supports XML Schema. The bad news is that XML Schema has been described even by XML experts as "confusing," "impenetrable," "fuzzy," and "as user-friendly as a stick in the eye." A successor to the SGML/XML DTD (Standard Generalized Markup Language/XML document type definition), XML Schema is a language for writing rules that constrain the kinds of elements that can appear in documents and the ways in which they can be sequenced, grouped, and nested.

   ADVERTISEMENT
  

Free IT resource

TechNet: More ways to know it, share it, and keep it running.

Sponsored by Microsoft

Free IT resource

Attend the SOA Executive Forum: Breaking SOA Bottlenecks SOAExecForum.com/may2007

Sponsored by InfoWorld

RELATED LINKS
»  IT trainer offers master's degree for hackers
»  MSNBC buys participatory news site Newsvine
»  Merchants: eBay ad programs drive buyers away
»  Web services RSS feed 

IDG ENTERPRISE NETWORK
Web Services Caution Abounds  (CIO)

TOP NEWS 


IT SOLUTION SEARCH
XML Schema is still a relatively new specification. The W3C Recommendation for XML Schema was published in May 2001. XML parsers that support XML Schema haven't done so for very long, and there is not yet much experience using it. Most people who are adept at defining document structure learned how to do so by writing DTDs. Some of the allergic reaction to XML Schema can, therefore, be chalked up to normal reluctance to learn new skills.

Of course, it's hard to work up a lot of nostalgia for the DTD legacy. Adjectives such as "confusing" and "impenetrable" were also flung at SGML DTD. Back in the day, more than a few large document management projects -- like too many modern ERP systems -- produced a lot of sound and fury, signifying nothing. The fact is that, although sets of documents do exhibit databaselike properties that we can usefully formalize and exploit, this kind of information management is still in its infancy.

Boeing, one notable exception, has always understood that documentation is integral to its business. The company likes to joke that a jet is "five million parts flying in formation." The documents that describe that inventory are themselves part of the inventory, and are engineered accordingly. Applying that same discipline to routine business documents such as rÈsumÈs, expense reports, and purchase orders, though, was never a serious option. Sure, it would be nice to tag all this stuff for intelligent search, aggregation, and data mining. But there were no general-purpose tools for tagging documents that are individually low-value (albeit collectively high-value), and no business case could be made for creating special-purpose tools to do that instrumentation. Office 11, which aims to bring special-purpose capability to general-purpose tools, is arguably one of the most disruptive technologies in the pipeline.

"Got a question?" writes Phil Windley, CIO of the State of Utah, on his Weblog. "Somewhere, on some government computer, the information you need is probably available. Information you paid for and the government would gladly share with you -- if only they could find it." Upgrading the word processors and spreadsheets on those government computers to versions that not only can read and write XML, but, more crucially, can enforce rules about datatypes and structures, is part of the solution. Assuming, of course, that such rules can be written, deployed, and unobtrusively applied and maintained over time. "Therein," observes Windley, "lies the rub."

There is very little extant knowledge about how to model unstructured and semistructured data in XML. Unlike SGML, the XML DTD was always optional, because the framers of XML knew there was enormous value in documents that were merely well-formed, even if not valid with respect to a DTD. RSS (Rich Site Summary), for example, the wildly popular XML format for content syndication, has no DTD or schema. "Many on this list will find it shocking," wrote XML co-inventor Tim Bray on the xml-dev mailing list (http://lists.xml.org), "but lots of important XML dialects don't have any DTDs or schemas. People e-mail back and forth some examples, they cut some code, and then everything's working and they're too busy to go back and write a schema." Although he doesn't wholly approve of this practice, Bray is a realist who understands that it happens often, and it can yield good results. But if schemas don't exist, applications can't enforce them. So where are the schemas going to come from?

One possibility is to infer schemas from example documents. Tools can do this, but so far, not with much sophistication. Microsoft, for example, offers a .Net namespace (Microsoft.XsdInference) that will infer a schema from an XML document, and even refine that schema based on further examples. The results make a useful starting point, and inferencing is a promising technology that can and should evolve, but the fact is that modeling XML data is a complex subject that even the best human experts have yet to codify. XML Schema delivers a much richer set of modeling tools than were available to DTD authors. Learning to use them well is going to be a challenge.

One of the great strengths of XML Schema, for example, is its support for regular expressions, the protean pattern-matching technology that helped Perl dominate the first-generation dynamic Web. However, what is true for Perl and other regular-expression-savvy languages will also be true for XML Schema: Although it's tempting to use complex patterns, simple ones are best for maintenance and reuse.

RDBMS experts who approach XML Schema will need to adapt their thinking in a number of ways. For example, in XML Schema, uniqueness constraints can apply at any level of a nested structure. XPath expressions are used to bind those constraints to their targets.

Object-oriented programmers will appreciate the way in which XML Schema permits the derivation of specific types from more general ones. But they will also find, as elsewhere, that there are limits to the use of inheritance, and that design by composition -- rather than by derivation -- is often the better strategy.

XML Schema arguably ought to have been simpler. James Clark, who was technical lead of the XML working group and editor of the XPath and XSLT Recommendations, clearly thinks so. He has championed an alternative schema language, RELAX NG (Regular Language Description for XML, Next Generation), which is now on the Organization for the Advancement of Structured Information Standards (OASIS) and ISO standards tracks. RELAX NG aims to simplify the description of XML structures, but relies on XML Schema for the definition of datatypes.

There is a real danger that enterprises, seeing too many approaches to XML data modeling, will wait for the dust to settle. That would be a shame. Yes, it's a hard problem, but we'll have to tackle it sooner or later. Web services won't fly until we can usefully model real business documents. That's something we can only learn by doing in a hands-on laboratory such as Office 11.




  BOTTOM LINE
Putting XML Schema to work
EXECUTIVE SUMMARY
Unlocking Office 11's XML features means coming to grips with its data definition language, XML Schema. That won't be easy, but the sooner we start, the better. The future of Web services depends on our ability to model business documents in XML.

TEST CENTER PERSPECTIVE
Yes, XML Schema is complex, but some of the issues are more general. Even experts disagree on the best practices for object-oriented data modeling. Office 11creates an environment in which we can start to codify those best practices as they apply to ordinary business documents.


RELATED ARTICLES

http://www.infoworld.com/articles/pl/xml/02/11/18/021118plmsxml.xml
http://www.infoworld.com/articles/op/xml/02/11/14/021114opwebserv.xml
http://www.infoworld.com/articles/pl/xml/02/10/28/021028plxmlclient.xml


RELATED SUBJECTS

Web Technologies


SPONSORED WHITE PAPERS
EMC - Lower costs and improve reliability-Get the EMC CLARiiON white paper!
Ciphertrust - Are you ready for Sobig.G? Learn how to protect your email systems.
CDW - Personal attention. CDW. The Right Technology. Right Away.
EMC - Explore key performance features and capabilities of EMC ControlCenter 5.1.1.
Intel - Free Intel white paper shows you how to deploy a secure wireless LAN
Cisco - FREE WHITE PAPER: BLUEPRINT to design and implement secure VPNs
Verity, Inc. - "Mass Consolidation Hits the Web-Search Market"
McDATA - Download a FREE storage consolidation white paper from McDATA(R).
Lucent Technologies - Overcoming Common Firewall Limitations
Lucent Technologies - Leverage Your Mobile High Speed Data Access. Download Free White Paper!
Nokia - Get the scoop! Mobilizing business white papers & case studies.
BMC Software - Maximize the Potential of Enterprise Data: Free white paper!
Network Associates - Free white paper - Strategies for Optimizing Network Costs and Benefits
Entrust - Manage identities across applications. Improve productivity.
Stalker Software - CommuniGate Pro - Transform your Email and Calendaring
Remedy - A NEW Gartner Research Note:Producing Quality IT Services

Search the IDG White Paper Library:


SPONSORED LINKS

INFOWORLD MARKETPLACE


» EMC delivers high-speed image capture, storage
Learn how you can quickly capture, organize, and deliver information with EMC ApplicationXtender.
» Agentless SOA Management
SOA operational visibility in less than a day, without installing message agents - free download.
» Apply BPM and ITIL at your IT Help Desk
ServiceWise brings BPM to complete IT service while eliminating integration cost. Learn more here.
» Find IT Consultant
Post Your Project for Free. Get Bids from Thousands of Pre-Screened Consultants. Register Now!
» Metadata Management Software
MetaCenter: Plug & play metadata management software for enterprise systems. Features: data ...




 HOME  NEWS  TEST CENTER  OPINIONS  PRODUCT GUIDE  TECHINDEX   About : Advertise : Subscribe : Contact Us : Awards : Events 

Copyright © 2009, Reprints, Permissions, Licensing, IDG Network, Privacy Policy

All Rights reserved. InfoWorld is a leading publisher of technology information and product reviews on topics including viruses, phishing, worms, firewalls, security, servers, storage, networking, wireless, databases, and web services.

Computerworld :: Network World :: CIO :: PC World :: Darwin :: CMO :: CSO
IT Careers :: JavaWorld :: Macworld :: Mac Central :: Playlist :: GamePro :: GameStar :: Gamerhelp
ITWorld Canada :: Computerwoche :: Techworld UK :: tecChannel :: IDG.se :: IDG.no