Modelling the Real World: How to Avoid an Implicit Schema

Front matter: Who cares about schema?

The world is getting more schemaless. Databases with no schema have become more and more common, particularly since 2009 with the launch of MongoDB. Relational databases, which generally have a strict schema, are over forty years old; MongoDB, at seven, is already the fourth most popular database management system behind Oracle, MySQL, and SQL Server.

The search term “mongodb” on Google Trends

Mongo’s popularity may be explained by its schemalessness, or more likely its scalability, and the general trend toward online business models that require a large scale. Users of MongoDB include LinkedIn, Craigslist, and eBay, all of which could be classified as Big Data companies, or at least ones with a lot of data where traditional databases don’t make sense.

Regardless, MongoDB may be the most popular way to write applications that don’t have schemas either . And, while schemalessness may be an excellent quality for a database, it is usually not a good quality for an application.

Side matter: What is a schemaless database?

A schemaless database allows you to write data with any set of fields without defining them in advance and without migrating old data in the same collection. Applications written on schemaless databases can persist complex structures quickly and adjust to new fields without affecting old ones. It allows an application to get off the ground and keep up with changes in its environment. It doesn’t matter if it is storing complex legal documents, 3D models, or the output from the Twitter API, a schemaless database will persist data and let you write code on it quickly.

Schemaless applications and schemaless databases

Let’s look at the difference between a truly schemaless application built on a schemaless database, and one that has a schema, but it’s not written down or checked against the data.

A truly schemaless application would need to be agnostic toward the fields it uses. The effectiveness of this application does not depend on the specific data it’s storing. A transaction log warehousing application, for instance. It makes sense for a log storing app to be truly schemaless. It doesn’t matter what the data looks like, only that it gets saved in a structured way. Customers are going to send structured data and retrieve and view it through their own viewer, or in raw form. Certain common fields might be used directly, and the whole thing might be put into a search index.

This is a rare application, however. Most use fields explicitly and depend on them to be effective.

The cake decorating machine

Suppose you have a birthday cake machine that writes greetings on cakes in an assembly line. The machine gets a greeting field and a border field. The data comes out of a schemaless database. A sample document looks like this:

{

  "greeting": "Happy Birthday Mikey",

  "border": "squiggly"

}

One day a greeting field, which has always been there, is not and the cake decorating machine stops decorating, unable to go forward because the next cake is already under the nozzle.

It’s fair to say the cake assembly line had a schema and it included what to write on the cake. The fact that the data was stored in a schemaless database doesn’t change the incomputability of a missing field. In other words, the fact that someone chose a schemaless database doesn’t mean the developer’s intention was “anything goes.” Chances are information was intended to be there and the application runs out of viable options without it.

This is a description of a system that most likely has an implicit schema . It’s a schema that’s not checked explicitly by the system in advance of using the data. It is implied by the code and is “checked” at any given time, sometimes at the last moment. This “last moment” effect really means that in practice it is not being checked by the code, but by the domain. It’s being checked by the real world .

The Schema in the Code

The root of implicit schema is the schemaless data source. And this makes perfect sense. If a developer is going to represent a data structure that could change, the tendency is to represent it — the changing data — with a changing data structure. The problem is that ideally we’re not representing the data itself. We’re representing something in the world. And whether or not your domain-agnostic database cares that there’s a greeting field, your domain probably does.

Mapping the data to an internal representation with an Object Document Mapping (ODM) library is one way to solve this, but there are issues with ODM libraries too. They will allow you to code a truly schemaless application like a log warehouse (as they should). There’s nothing to prevent a developer from binding schemaless data through an ODM library to schemaless collections like Maps (associative arrays, hashmaps, dictionaries, etc.) in the code. In this case the problem just gets more subtle.

Suppose you want your cake business to move into round cakes and you need a radius field. There’s no need to change your database schema or your application’s schema. Simply put the new field into the database. The mapper will populate the array and you’re free to use it. Nowhere, except in the line where it’s used to run the machine, do you need to write the word “radius.” You could have 100,000 lines of code on the round cake business and only one mention of “radius.”

Suppose we decided to use an ODM library because of the cake greeting disaster, but we’ve decided it’s too much to list all the information involved in the round cake business. In this case, it may not be intuitive that the schema is still implicit. Let’s write a cake class for the ODM and add a subdocument:

class cake {

  // Whew. Learned my lesson here.
  // Explicit greeting variable.
  String greeting

  // Trusty border. Always there.
  String border

  // Work stoppages. Got to move into
  // the circle cake business quick.
  HashMap circleCakeStuff // <-- Implicitness forming here.
}

Anything that’s in the “circleCakeStuff” subdocument in the database will go into the map, whether it’s nothing, “radius,” several other subdocuments, or the contents of Moby Dick with one field per word.

The value of circleCakeStuff:

{
  "Call": "",
  "me": "",
  "Ishmael.", "",
  "Can": "",
  "you": "",
  "make": "",
  "cakes": "",
  "with": "",
  "this?": ""
}

There is nothing to say that the intention was to populate the map with a radius value. And the only way to tell if a radius is there is to start decorating cakes.

Absent != null

A null value in a database has meaning. It is a valid and useful value. Among other properties, null can’t be represented in the underlying data type. There is no integer null or string of characters null the way there is an integer -1, 0 and there is a string of characters “” or “%%seriously, don’t use this%%.”

In an object relational mapper, and a programming language, null also has a meaning (usually it means your code’s going to break) and it’s a different meaning from the data value null . In schemaless object document mapping the puzzle is even greater because fields will be null if the field is null in the database, or if the field is absent. A null field and an absent field are two very different conditions. If you have a survey in which null in the data means the person didn’t answer the question, you wouldn’t want a completely empty document with no questions to represent someone not filling in the height field. The source of a blank survey is probably a photocopier, not a person who doesn’t want to fill in particular answers. A missing field — the survey question isn’t in the data at all — suggests the application doesn’t know about the field. It doesn’t suggest a known field with a valid value.

It is also hard to represent absence in a programming language. (It was already hard to represent null , the data value other than with null , the pointer). With an implicit schema you have twice as many ways to achieve a null object in your code, and fewer ways to tell the difference.

Reasons schemas make coding easier

  1. Compile time errors for misspelled or misplaced fields.
  2. Explicitness. One place to check what your data looks like.
  3. Fail fast architecture.
  4. Stack traces from unhandled exceptions will have the name of the missing field instead of a null pointer and a line number.
  5. Encapsulation. Functions will not “ leak ” into each other.
  6. Fewer lines of code. Accessing a field in an object is generally shorter and easier to read.
  7. Reasoning. The type of an object can tell you about how it should be used.
  8. Performance. Checking for a field once is faster than checking it several times.

To further the point that schemaless databases shift a responsibility somewhere else rather than obviating it, imagine someone is explaining document databases to you. They say, “As your needs change you can store data with an extra field without error.” Great, sounds good. Now imagine you’ve crafted a document about toasters and someone adds a field tail_color ? That’s insanity! Suppose they do this without error. The same phrase suddenly seems frightening, but these two situations are the same from the database’s perspective, and it will indeed not complain about toasters with tails.

In a schemaless stack, something has to take over the responsibility of keeping track of what goes in and out of the database. It could be something that logs unintended data; It may be all the way to an Apple golden master hard drive . The choice is application-specific. (A ten line application doesn’t need a schema.) Once your code is too large to glance at, it might be time to start thinking schema. If not, you might be composing your epitaph, in radiusless frosting.

Solutions

Solutions for checking and transmitting schema outside of a database include Avro , Thrift , and Protocol Buffers from Apache, Facebook, and Google respectively. Another project, called Variety , analyzes what fields are stored in MongoDB and gives the percentage of documents that have each field. And Twitter has Diffy , a regression testing app that will tell you if changes to your code significantly change the data in your endpoints.

稿源:Washington Developer Blog (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合技术 » Modelling the Real World: How to Avoid an Implicit Schema

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录