Chapter 1. Introduction to JBoss DNA

1.1. Use cases for JBoss DNA
1.2. What is metadata?
1.3. What is JCR?
1.4. Project roadmap
1.5. Development methodology
1.6. JBoss DNA modules

The JBoss DNA project is building a unified metadata repository system that is JCR-compliant and capable of federating information from a variety of back-end systems. To client applications, JBoss DNA looks and behaves like a regular JCR repository that they search, navigate, version, and listen for changes. But under the covers, JBoss DNA gets its content by federating multiple back-end systems (like databases, services, other repositories, etc.), allowing those systems to continue "owning" the information while ensuring the unified repository stays up-to-date and in sync. JBoss DNA also analyzes the content you put into the repository and turns it into information you can use more effectively.

This document goes into detail about JBoss DNA and its capabilities, features, architecture, components, extension points, security, configuration, and testing. So whether your a developer on the project, or you're trying to learn the intricate details of how JBoss DNA works, this document hopefully serves a good reference for developers on the project.

1.1. Use cases for JBoss DNA

JBoss DNA repositories can be used in a variety of applications. One of the most obvious ones is in provisioning and management, where it's critical to understand and keep track of the metadata for models, database, services, components, applications, clusters, machines, and other systems used in an enterprise. Governance takes that a step farther, by also tracking the policies and expectations against which performance can be verified. In these cases, a repository is an excellent mechanism for managing this complex and highly-varied information. But a JBoss DNA repository doesn't have to be large and complex: it could just manage configuration information for an application, or it could just provide a JCR interface on top of a couple of non-JCR systems.

1.2. What is metadata?

Before we dive into more detail about JBoss DNA and metadata repositories, it's probably useful to explain what we mean by the term "metadata." Simply put, metadata is the information you need to manage something. For example, it's the information needed to configure an operating system, or the description of the information in an LDAP tree, or the topology of your network. It's the configuration of an application server or enterprise service bus. It's the steps involved in validating an application before it can go into production. It's the description of your database schemas, or of your services, or of the messages going in and coming out of a service. JBoss DNA is designed to be a repository for all this (and more).

There are a couple of important things to understand about metadata. First, the majority of metadata is either found in or managed by other systems: databases, applications, file systems, source code management systems, services, and content management systems, and even other repositories. We can't pull the information out and duplicate it, because then we risk having multiple copies that are out-of-sync. But we do want to access it through a homogenous API, since that will make our lives significantly easier.

The answer to this apparent dichotomy is federation. We can connect to these back-end systems to dynamically access the content and project it into a single, unified repository. We can also cache it for faster access, as long as the cache can be invalidated based upon time or event. But we also need to maintain a clear picture of where all the bits come from, so users can be sure they're looking at the right information. And we need to make it as easy as possible to write new connectors, since there are a lot of systems out there that have information we want to federate.

The second important characteristic of the metadata is that a lot of it is represented as files, and there are a lot of different file formats. These include source code, configuration files, web pages, database schemas, XML schemas, service definitions, policies, documents, spreadsheets, presentations, images, audio files, workflow definitions, business rules, and on and on. And so even though information is added to the repository through files like these, the repository should be able to automatically extract the most useful content and place it in the repository where it can be much more easily used, searched, related, and analyzed. This process of extracting content and storing it in the repository is what JBoss DNA calls sequencing, and it's an important part of a metadata repository.

The third important characteristic of metadata is that it rarely stays the same. Different consumers of the information need to see different views of it. Metadata about two similar systems is not always the same. The metadata often needs to be tagged or annotated with additional information. And the things being described often change over time, meaning the metadata has to change, too. As a result, the way in which we store and manage the metadata has to be flexible and able to adapt, and the object model we use to interact with the repository must accommodate these needs. The graph-based nature of the JCR API provides this flexibility while also giving us the ability to constrain information when it needs to be constrained.

1.3. What is JCR?

There are a lot of choices for how applications can store information persistently so that it can be accessed at a later time and by other processes. The challenge developers face is how to use an approach that most closely matches the needs of their application. This choice becomes more important as developers choose to focus their efforts on application-specific logic, delegating much of the responsibilities for persistence to libraries and frameworks.

Perhaps one of the easiest techniques is to simply store information in files . The Java language makes working with files relatively easy, but Java really doesn't provide many bells and whistles. So using files is an easy choice when the information is either not complicated (for example property files), or when users may need to read or change the information outside of the application (for example log files or configuration files). But using files to persist information becomes more difficult as the information becomes more complex, as the volume of it increases, or if it needs to be accessed by multiple processes. For these situations, other techniques often have more benefits.

Another technique built into the Java language is Java serialization , which is capable of persisting the state of an object graph so that it can be read back in at a later time. However, Java serialization can quickly become tricky if the classes are changed, and so it's beneficial usually when the information is persisted for a very short period of time. For example, serialization is sometimes used to send an object graph from one process to another. Using serialization for longer-term storage of information is more risky.

One of the more popular and widely-used persistence technologies is the relational database. Relational database management systems have been around for decades and are very capable. The Java Database Connectivity (JDBC) API provides a standard interface for connecting to and interacting with relational databases. However, it is a low-level API that requires a lot of code to use correctly, and it still doesn't abstract away the DBMS-specific SQL grammar. Also, working with relational data in an object-oriented language can feel somewhat unnatural, so many developers map this data to classes that fit much more cleanly into their application. The problem is that manually creating this mapping layer requires a lot of repetitive and non-trivial JDBC code.

Object-relational mapping libraries automate the creation of this mapping layer and result in far less code that is much more maintainable with performance that is often as good as (if not better than) handwritten JDBC code. The new Java Persistence API (JPA) provide a standard mechanism for defining the mappings (through annotations) and working with these entity objects. Several commercial and open-source libraries implement JPA, and some even offer additional capabilities and features that go beyond JPA. For example, Hibernate is one of the most feature-rich JPA implementations and offers object caching, statement caching, extra association mappings, and other features that help to improve performance and usefulness. Plus, Hibernate is open-source (with support offered by JBoss).

While relational databases and JPA are solutions that work well for many applications, they are more limited in cases when the information structure is highly flexible, the structure is not known a priori, or that structure is subject to frequent change and customization. In these situations, content repositories may offer a better choice for persistence. Content repositories are almost a hybrid with the storage capabilities of relational databases and the flexibility offered by other systems, such as using files. Content repositories also typically provide other capabilities as well, including versioning, indexing, search, access control, transactions, and observation. Because of this, content repositories are used by content management systems (CMS), document management systems (DMS), and other applications that manage electronic files (e.g., documents, images, multi-media, web content, etc.) and metadata associated with them (e.g., author, date, status, security information, etc.). The Content Repository for Java technology API provides a standard Java API for working with content repositories. Abbreviated "JCR", this API was developed as part of the Java Community Process under JSR-170 and is being revised under JSR-283.

The JBoss DNA project is building unified metadata repository system that is compliant with JCR. Nearly all of these capabilities are to be hidden below the JCR API and involve automated processing of the information in the repository. Thus, JBoss DNA can add value to existing repository implementations. For example, JCR repositories offer the ability to upload files into the repository and have the file content indexed for search purposes. JBoss DNA also defines a library for "sequencing" content - to extract meaningful information from that content and store it in the repository, where it can then be searched, accessed, and analyzed using the JCR API.

JBoss DNA has other features as well. You can create federated repositories that dynamically merge the information from multiple databases, services, applications, and other JCR repositories. JBoss DNA also will allow you to create customized views based upon the type of data and the role of the user that is accessing the data. And yet another is to create a REST-ful API to allow the JCR content to be accessed easily by other applications written in other languages.

1.4. Project roadmap

The roadmap for JBoss DNA is managed in the project's JIRA instance . The roadmap shows the different tasks, requirements, issues and other activities that have been targeted to each of the upcoming releases. (The roadmap report always shows the next three releases.)

By convention, JIRA issues not immediately targeted to a release will be reviewed periodically to determine the appropriate release where they can be targeted. Any issue that is reviewed and that does not fit in a known release will be targeted to the Future Releases bucket.

At the start of a release, the project team reviews the roadmap, identifies the goals for the release, and targets (or retargets) the issues appropriately.

1.5. Development methodology

Rather than use a single formal development methodology, the JBoss DNA project incorporates those techniques, activities, and processes that are practical and work for the project. In fact, the committers are given a lot of freedom for how they develop the components and features they work on.

Nevertheless, we do encourage familiarity with several major techniques, including:

Agile software development includes those software methodologies (e.g., Scrum) that promote development iterations and open collaboration. While the JBoss DNA project doesn't follow these closely, we do emphasize the importance of always having running software and using running software as a measure of progress. The JBoss DNA project also wants to move to more frequent releases (on the order of 4-6 weeks)
Test-driven development (TDD) techniques encourage first writing test cases for new features and functionality, then changing the code to add the new features and functionality, and finally the code is refactored to clean-up and address any duplication or inconsistencies.
Behavior-driven development (BDD) is an evolution of TDD, where developers specify the desired behaviors first (rather than writing "tests"). In reality, this BDD adopts the language of the user so that tests are written using words that are meaningful to users. With recent test frameworks (like JUnit 4.4), we're able to write our unit tests to express the desired behavior. For example, a test class for sequencer implementation might have a test method shouldNotThrowAnErrorWhenStreamIsNull(), which is very easy to understand the intent. The result appears to be a larger number of finer-grained test methods, but which are more easily understood and easier to write. In fact, many advocates of BDD argue that one of the biggest challenges of TDD is knowing what tests to write in the beginning, whereas with BDD the shift in focus and terminology make it easier for more developers to enumerate the tests they need.
Lean software development is an adaptation of lean manufacturing techniques, where emphasis is placed on eliminating waste (e.g., defects, unnecessary complexity, unnecessary code/functionality/features), delivering as fast as possible, deferring irrevocable decisions as much as possible, continuous learning (continuously adapting and improving the process), empowering the team (or community, in our case), and several other guidelines. Lean software development can be thought of as an evolution of agile techniques in the same way that behavior-driven development is an evolution of test-driven development. Lean techniques help the developer to recognize and understand how and why features, bugs, and even their processes impact the development of software.

1.6. JBoss DNA modules

JBoss DNA consists of the following modules:

dna-common is a low-level library of common utilities and frameworks, including logging, progress monitoring, internationalization/localization, text translators, component management, and class loader factories.
dna-graph defines the graph Application Programming Interface (API) and Service Provider Interface (SPI) for DNA, including the repository connectors, sequencers, graph interfaces, and MIME type detectors.
dna-repository is the main module and provides the repository-oriented services, including the Repository Service, Sequencing Service, Observation Service, and Rules Service.
dna-jcr provides the JBoss DNA implementation of the JCR API, which relies upon a repository connector, such as the Federation Connector (see dna-connector-federation ).
dna-integration-tests provides a home for all of the integration tests that involve more components that just unit tests. Integration tests are often more complicated, take longer, and involve testing the integration and functionality of many components (whereas unit tests focus on testing a single class or component and may use stubs or mock objects for other components).

The following modules are optional extensions that may be used selectively and as needed (and are located in the source under the extensions/ directory):

dna-maven-classloader is a small library that provides a ClassLoaderFactory implementation that can create java.lang.ClassLoader instances capable of loading classes given a Maven Repository and a list of Maven coordinates. The Maven Repository can be managed within a JCR repository.
dna-connector-federation is a DNA repository connector that federates, integrates and caches information from multiple sources (via other repository connectors).
dna-connector-inmemory is a simple DNA repository connector that manages content within memory. This can be used as a simple cache or as a transient repository.
dna-connector-jbosscache is a DNA repository connector that manages content within a JBoss Cache instance. JBoss Cache is a powerful cache implementation that can serve as a distributed cache and that can persist information. The cache instance can be found via JNDI or created and managed by the connector.
dna-sequencer-zip is a DNA sequencer that extracts from ZIP archives the files (with content) and folders.
dna-sequencer-images is a DNA sequencer that extracts the image metadata (e.g., size, date, etc.) from PNG, JPEG, GIF, BMP, PCS, IFF, RAS, PBM, PGM, and PPM image files.
dna-sequencer-mp3 is a DNA sequencer that extracts metadata (e.g., author, album name, etc.) from MP3 audio files.
dna-sequencer-java is a DNA sequencer that extracts the package, class/type, member, documentation, annotations, and other information from Java source files.
dna-sequencer-msoffice is a DNA sequencer that extracts metadata and summary information from Microsoft Office documents. For example, the sequencer extracts from a PowerPoint presentation the outline as well as thumbnails of each slide. Microsoft Word and Excel files are also supported.
dna-sequencer-cnd is a DNA sequencer that extracts JCR node definitions from JCR Compact Node Definition (CND) files.
dna-mimetype-detector-aperture is a DNA MIME type detector that uses the Aperture library to determine the best MIME type from the filename and file contents.

There are also documentation modules (located in the source under the docs/ directory):

docs-getting-started is the project with the DocBook source for the JBoss DNA Getting Started document.
docs-getting-started-examples is the project with the Java source for the example application used in the JBoss DNA Getting Started document.
docs-reference-guide is the project with the DocBook source for this document, the JBoss DNA Reference Guide document.

Finally, there is a module that represents the whole JBoss DNA project:

dna is the parent project that aggregates all of the other projects and that contains some asset files to create the necessary Maven artifacts during a build.

Each of these modules is a Maven project with a group ID of org.jboss.dna . All of these projects correspond to artifacts in the JBoss Maven 2 Repository .