2009 09 02 how to write a linq provider the simple way again

Published on September 2nd, 2009 at 8:32

How to write a LINQ provider - the simple way, Part II

This is the second part of a two-part series of posts. Read the first part for a very short introduction to re-linq, read Stefan Wenig’s post or my whitepaper for more background.

As promised, here’s an introduction to the steps that need to be taken to implement a LINQ provider using re-linq.

Interfaces to start with

First, let’s take a look at the classes and interfaces LINQ and re-linq require you to implement. To start with, you need to provide an implementation of IQueryable<T>. That’s LINQ’s main query interface, and all of LINQ’s query methods, such as Queryable.Where, Queryable.OrderBy, or Queryable.Select are written against it. re-linq provides a base class, QueryableBase<T>, from which you can derive to implement this interface. Doing so is fairly trivial, it only requires adding two constructors – one used by your provider’s clients, one used by the LINQ infrastructure in the .NET framework.

Then, you need an implementation of IQueryProvider. LINQ query methods use this interface to create new queries around an existing IQueryable<T> and to actually execute queries. For example, a call to Queryable.Where will take an existing query and wrap its expression so that it now represents a query with a where clause. A call to Queryable.Single will use the IQueryProvider.Execute method to actually execute the query. Enumerating queries will also delegate to IQueryProvider.Execute.

re-linq provides an abstract base class, QueryProviderBase, and a default implementation, DefaultQueryProvider, which implement the IQueryProvider interface. Usually, DefaultQueryProvider is completely sufficient, so QueryableBase<T> uses that implementation by default.

While DefaultQueryProvider implements the query creation part of IQueryProvider, it of course cannot pre-implement the actual execution of a query against the target query system. Instead, it does the following:

First, it parses the query which is to be executed into a QueryModel. That’s a structured, interlinked object model defined by re-linq, which is much easier to understand and to transform than the native LINQ expression trees. If you’re interested in how the parsing works, take a look at the QueryParser class and the expression node parsers.
Then, it passes the QueryModel on to an implementation of IQueryExecutor.

IQueryExecutor is an interface representing the details of executing a query against a target queryable system. This means it needs to be implemented by you, of course, since you are the one who knows how to build queries for that system.

IQueryExecutor and result operators

When you take a look at IQueryExecutor, you can see that it has three methods: ExecuteScalar, ExecuteSingle, and ExecuteCollection.

Let’s start with ExecuteCollection, since that is the simplest of the three methods. Take a look at the following code:

var query = from o in QueryFactory.CreateLinqQuery<Order\>()
            where o.OrderNumber > 10
            select o;

foreach (var order in query)
{
  Console.WriteLine (order.OrderNumber);
}

When you execute that code, the query is enumerated and expected to return a collection (or sequence) of items. That’s why IQueryExecutor.ExecuteCollection() is called for that query (at least when the object returned by QueryFactory.CreateLinqQuery<T>() is based on QueryableBase<T>). ExecuteCollection is passed a QueryModel that has exactly one MainFromClause, one WhereClause, and one SelectClause. In short, the QueryModel directly corresponds to the LINQ query written above.

Now, what about ExecuteSingle and ExecuteScalar? Take a look at the following two queries:

var count = (from o in QueryFactory.CreateLinqQuery<Order\>()
             where o.OrderNumber > 10
             select o).Count();

var item = (from o in QueryFactory.CreateLinqQuery<Order\> ()
            where o.OrderNumber > 10
            select o).First ();

These two queries are different in that they are not expected to return collections. Instead, they are expected to return scalar, calculated values and single items from the sequence, respectively. Their QueryModels have operators attached to them that represent the calculation or single item selection. re-linq calls those ResultOperators.

The first query has a CountResultOperator, which represents a scalar value calculated from the query’s result sequence, therefore IQueryExecutor.ExecuteScalar is called in order to execute it. Other scalar operators are LongCountResultOperator, ContainsResultOperator, SumResultOperator, and AverageResultOperator.

The second query has a FirstResultOperator, which represents a single item that is selected from the result sequence, therefore IQueryExecutor.ExecuteSingle is called in order to execute it. Other single operators are SingleResultOperator, LastResultOperator, MinResultOperator, and MaxResultOperator. All of those choose a single item from the query sequence, so all of them are treated the same way. Note that even when those operators return a scalar value because the query returns a sequence of scalar values, they still invoke ExecuteSingle because a single item is chosen from the list rather than calculated.

Translating queries

For many target queryable systems it will be possible to simply implement ExecuteCollection and just delegate to that from ExecuteSingle or ExecuteScalar. For others, it might be important to take note of the semantic differences. Whichever path you follow, you’ll finally have to pose one important question. “How the heck do I create a query in my target system’s format from a QueryModel?”

And the answer is, of course, “That depends on your target system!” :)

However, re-linq gives you two important tools to do so: IQueryModelVisitor and RelinqExpressionVisitor.

The first of those two visitors operates on a large scale: it provides a way to execute specific code for each clause within a QueryModel, allowing you to translate one clause at a time. You can collect the partial results of your translations, and finally make one query for your target system from those parts.

The simplest way to make use of IQueryModelVisitor is to derive from QueryModelVisitorBase. That class implements the interface by automatically iterating over sub-clauses and collections, dispatching to the correct visitor methods for every element of the query. It’s also hardened against modifications of the QueryModel being iterated, but more about this later. Simply override its Visit... methods for the query components you want to handle, and generate your target query parts accordingly. Note that you need to handle all the clauses, result operators, and so on defined by re-linq. If you don’t at least throw an exception for those constructs you simply cannot translate, you’ll get invalid query translations.

While you’re visiting the clauses and result operators, you’ll notice that some of them contain LINQ Expressions. For example, WhereClause.Predicate contains an Expression, SelectClause.Selector does, and even MainFromClause.FromExpression is an expression tree. Now, haven’t I said earlier that LINQ expressions are inherently complex and hard to understand?

They are, but the expressions you can find in re-linq’s clauses have already been simplified. In them,

references to outer variables (closures) and other evaluatable expressions have already been pre-evaluated into constants,
sub-queries have been parsed and replaced by QueryModels wrapped in SubQueryExpressions, and, most importantly,
transparent identifiers have been removed and references to query sources (from clauses, joins) have been replaced by QuerySourceReferenceExpressions, which link back to the respective query source.

Therefore, the expressions you find in re-linq’s clauses are usually quite straight-forward to translate to the target query system. Depending on the target query system, of course.

To implement the translation of expressions, you derive a class from ExpressionTreeVisitor or, better, ThrowingExpressionTreeVisitor. Both of them are meant to iterate over an expression tree and to visit each of the nodes in the tree, but ThrowingExpressionTreeVisitor throws an exception for unsupported node types by default.

Simply override the Visit... methods for those node types you want to support, and generate a semantically equivalent query element for your target query system. Then, from your IQueryModelVisitor, take the elements and integrate them into the current query part.

All of this works very fine. Unless, of course, you encounter a construct that’s just way incompatible with your target query system. What now, throw a NotSupportedException? Realistically, you’ll have to do that, sometimes. But in other cases, it would actually be possible to support some of these constructs, although you’d have to simulate them using other query mechanisms… somehow…

Transforming queries

For example, your target query system might not support sub-queries in from clauses. But sometimes, sub-queries in from clauses can be flattened, thus turning the unsupported query into a supported one.

Or, in other scenarios, you might want to move a Where clause from one side of a join to the other side in order to avoid creating a dependent sub-query. Or you might want to detect group clauses with aggregates if those are well-translatable into your target query system.

While re-linq does not – and cannot – pre-implement all conceivable query model transformations, it does provide a lot of infrastructural support for them. Here’s a list of what we do in order to make transformations less difficult:

Apart from QuerySourceReferenceExpressions, there are no ordering dependencies between clauses in a QueryModel. You can simply remove clauses from the model, move them around, or insert new ones without any problems. Only when there are QuerySourceReferenceExpressions that reference those clauses, it is of course important to be more careful. Usually, referenced query sources must stay in the query, prior to the point where they are referenced, or the references must be updated (see below).
All properties of clauses are settable, i.e. it’s easy to replace a WhereClause’s predicate or change an AdditionalFromClause’s item name.
If both the original and the transformed QueryModel must be retained, the QueryModel.Clone() method provides a simple way of generating a deep copy (including clones of all query elements) of the QueryModel before it is transformed.
QueryModel.TransformExpressions() provides an easy-to-use mechanism to transform all expressions held by a query model in one go.
ReferenceReplacingExpressionTreeVisitor provides an easy-to-use mechanism to replace references to query sources after they were modified or removed, even across sub-queries. Use in combination with QueryModel.TransformExpressions() whenever replacing a query source or moving a clause from one QueryModel to another.
ExpressionTreeVisitor supports custom modification of the expression tree being visited. Simply return new nodes from any of its Visit... methods, and ExpressionTreeVisitor will automatically create an expression tree containing your new nodes.
QueryModelVisitorBase is hardened against changes made to the QueryModel while it is being visited. This means that from any QueryModelVisitorBase.Visit... method, you can modify any element of the QueryModel without having to fear exceptions because you’ve just modified a collection being iterated.
Whenever you need to get information about the data produced by a QueryModel or a result operator, you can use the GetOutputDataInfo() methods to calculate the kind (single item, scalar value, sequence) and type of the data being returned.

Writing custom extensions

Last, but not least, you may also run into situations where you’d like to have support for a certain feature that is not supported by re-linq or even LINQ. It happens quite often that LINQ providers define their own, target system-specific query methods; for example to implement full-text querying or query hinting.

For such scenarios, re-linq provides options on several levels. On the query method level, you can implement a custom IExpressionNode parser class. These classes are used to analyze the structure of a LINQ expression tree and to build the QueryModel corresponding to that tree. To make use of this extension point, derive from the MethodCallExpressionNodeBase or ResultOperatorExpressionNodeBase classes, depending on your scenario. Then, create a MethodCallExpressionNodeTypeRegistry instance and register your new parser classes. Pass that registry to the DefaultQueryProvider from your QueryableBase<T> implementation.

On the QueryModel level, you can provide custom IBodyClause implementations, derive from MainFromClause and SelectClause, or subclass ResultOperatorBase. How you integrate them into the QueryModel depends on your use case, but most often, you’ll integrate them from your expression node parser’s (see above) Apply methods.

Wrapping it up

Now, this text, which has turned out to become more an article than a blog post, has given a short overview about the concepts and features of re-linq and how to use them when writing a LINQ provider.

All the options provided by re-linq may seem a little overwhelming, but actually, re-linq is quite straight-forward. A basic LINQ provider only needs to implement a few interfaces to start with, as well as two visitors: one for the QueryModel, one for the expression trees. Sample code for this can be found at the Linq 2 HQL repository – the sample builds a LINQ provider for the open-source O/R mapper NHibernate based on the query language HQL.

As the LINQ provider evolves, it will need to support queries that are more difficult to translate to the target system, so it will start using query transformations. Transformations are incremental, so you can add new transformations on a feature-by-feature basis. Sophisticated LINQ providers will also want to provide their own query methods in addition to the standard query operators, and again, re-linq supports this in an incremental fashion.

All in all, I’m quite proud of re-linq’s architecture; I think, we’ve managed to build a robust piece of framework code with great utility. So, as I said in part I:

Are you planning to write a LINQ provider? Try re-linq – it’s open-source (LGPL) – and it will save you a lot of headaches.

- Fabian

Comments

Stefan Wenig - September 4th, 2009 at 12:07

Since Fabian is off for two weeks of well-deserved vacation, I’ll just post the linq, er, link, to the code sample that matches this post:

re-linq|ishing the Pain: Using re-linq to Implement a Powerful LINQ Provider on the Example of NHibernate

Ray - September 19th, 2009 at 16:07

Fantastic stuff, been pouring over the hql sample for the last few nights. Question- how would the IQueryExecutor materialize to anonymous types via query projections?

Fabian Schmied - September 20th, 2009 at 10:22

Ray,

Depending on whether you had a single query (Single, First, Last, Min, Max) or a collection query, ExecuteCollection<T> or ExecuteSingle<T> would be called with T being the anonymous type. In your SelectClause’s Selector, there’ll be a NewExpression (IIRC) that constructs the custom query.

For handling this within a specific LINQ provider, it doesn’t really matter whether the query projection constructs an anonymous type or an explicit constructor call, you’d handle both the same way.

You’d use an ExpressionTreeVisitor to analyze the Selector and to generate the actual projection in your target query language (e.g. SQL). At the same time, you’d construct a LambdaExpression that can take the result of your generated target query, pull out the required data, and put it into the right places in the constructor call.

Because that is rather abstract, I’ll try to write a blog post detailing the implementation some time next week.

Ray - September 21st, 2009 at 03:54

Fabian,

Good to hear that it just works. Will you also be addressing usage of the ‘backend’ infrastructure for sql generation? Related question- unless I’m missing something, both the SqlServer generation and the hql backend-generator don’t support grouping? If so, what’s the deal with that?

Fabian Schmied - September 21st, 2009 at 14:25

The backend is currently not a good place to look at for re-linq examples: it’s based on an old implementation of the QueryModel which was a lot more constrained and had fewer features.

It’s planned to rewrite the SQL-generating backend some time to form a better example of how to use re-linq, but I’m not sure about when we (= rubicon) will be able to schedule this. Once it’s done, I’ll of course blog about it.

About grouping support: re-linq does support grouping, of course, but LINQ-style grouping is much different from SQL-style grouping. If you take a look at Queryable.GroupBy (or the C# group keyword), you’ll see that it returns an IGrouping<TKey, TElement>, which is very hard to translate to SQL or HQL.

I guess Steve Strong, who’s currently implementing the real NH LINQ provider based on re-linq (http://blogs.imeta.co.uk/sstrong/archive/2009/09/15/756.aspx) will only implement a subset of grouping at first, where the IGrouping stuff is never enumerated, but only accessed via aggregate functions or via its Key property. This is then easily translatable to SQL or HQL (see How to support “group into” with aggregates about this). The rest of the grouping functionality can either be executed in memory (re-linq supports this) – or simply throw a NotSupportedException.

About the sample HQL provider: it was just out of scope for the CodeProject sample.

Ray - September 21st, 2009 at 20:10

OK, I definitely get the complexities of grouping. It just seems like, if one’s goal was to leverage re-linq for the purposes of generating sql, that using the backend stuff would be where you start, instead of starting at the querymodel and expression visitors. Because starting with nothing but visitors puts you more or less where you would be if you weren’t using re-linq to begin with, defeating much of the whole purpose. Just trying to grasp how you would get beyond the Frans Bouma ‘toy’ scenario to full on sql-generating linq provider, that’s all. I’ll stay tuned.

Fabian Schmied - September 22ndh, 2009 at 08:46

Ray,

re-linq is much more than just a couple of visitors, even without its SQL-generating backend. I’ve tried to explaing in this post: More Than a Couple of Visitors.

BTW, we’ve a new Google Group dedicated to questions about re-motion, including re-linq: http://groups.google.com/group/re-motion-users.

René - December 13th, 2009 at 17:07

Can you provide some samples for ‘Writing custom extensions’ ?

Greetings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly