[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Dealing with large collections [Re: URI constraints]
On Mon, Oct 11, 2004 at 01:10:20PM -0700, Ezra Cooper wrote:
> On Oct 11, 2004, at 11:51 AM, Robert Sayre wrote:
> >>There is an open question for what to do with the blog that has
> >>100,000
> >>posts in it. That would be a scary-huge PROPFIND result (legal, if a
> >>bit
> >>expensive; could be about a 20 meg response).
> >
> >What's stopping you from having child collections? But yes, in
> >general, we have a query problem.
>
> Robert, how are you suggesting child collections could be used to solve
> this problem?
The idea is to simply partition the result set into N subsets, each of
which has a manageable size.
> I'm having trouble picturing a use that would feel
> 'natural.'
How you present that to the user is a different question...
> Full blown search as in [1] seems a bit heavy for the Atom protocol.
Agreed, in part: I agree that the grammar specified in there is heavy. But
a couple limited uses of the SEARCH method would already provide
everything that we're talking about already. (limits, ranges, and sorting)
Note that DASL provides a mechanism for a server to specify a query
grammar. It would be entirely reasonable to define an "Atom search
grammar" that is greatly simplified to our particular usage scenarios. If
a server also supported the DAV:basicsearch grammar, then fine. We would
just require a server to support the atom grammar.
> The current problem, as I see it, is not "how to search an Atom
> collection" but "how to specify and return subsets of a collection."
> Search might be a legitimate function, but out of scope for now, I
> think.
Yes, and yes, but I think there are ways to simplify.
>...
> On Oct 11, 2004, at 12:19 PM, Greg Stein wrote:
> >I'd think that two types of limits would be reasonable: max-count and
> >since-this-date.
>
> Would "since-this-date" indicate those items that have been added since
> that date, or modified? I can see use cases for both. Similarly, if
> max-count is n, we'd need to pin down which n are expected to be
> returned (what the sort order is, in other words).
Good points.
>...
> Instead, I propose that a collection-getting request simply includes a
> sort-field element which specifies some field of the item. And instead
> of "since-this-date" we use "max-value," "min-value," or "limit-value".
Seems reasonable. Essentially, a small search grammar. I think that
*whatever* marshalling or expressive protocol we provide, some kind of
limited "get me <these>" is going to be needed. Thus, what are the types
and parameters of those searches? Do we allow arbitrary field searching?
Or is it just "post date" and "last mod date", with an arbitrary count
limiter applied? Is ordering allowed on anything, or just the queried
date? IOW, I can't order by post title, can I?
To put a stake in the ground, I'd suggest that any query only deal with
one of the N date fields, ordering of that field, limiting to start/end
range of dates, and applying a maximum count limit.
One usage scenario not ecapsulated by that is "get me the next 20 posts".
I'm not sure if that is required.
> Here's a sketchy example (whether the request is PROPFIND or something
> else is orthogonal):
>
> PROPFIND /my/silly/resource HTTP/1.1
> ...
> <A:limit>
> <A:sort-field>A:updated</A:sort-field>
> <A:limit-value>2004-09-01T12:00:00Z</A:limit-value>
> <A:sort-order>ascending</A:sort-order>
> </A:limit>
>
> Is there any precedent for using XML element names as the content of an
> element or attribute?
No. The application can't do the namespace translation, so it wouldn't
work. WebDAV's "prop" element can contain a list of other elements. Those
elements are the target fields. Note that a validating parser would choke
because the elements do not necessarily agreed with the schema. Your
example would look like:
PROPFIND /my/silly/resource HTTP/1.1
...
<A:limit>
<A:sort-field><A:updated/></A:sort-field>
<A:limit-value>2004-09-01T12:00:00Z</A:limit-value>
<A:sort-order>ascending</A:sort-order>
</A:limit>
> I can imagine a swamp of namespacing issues. Is
> there a better way to pick out an element?
Skip the validating parser (or at least within sort-field element), and
the above is WebDAV's approach. It has worked quite fine so far,
especially given that properties are generally "free form", so it is hard
to validate them anyways.
> Anyway, something like this could address client-specified limits. Is
> there a way for a server to legally return a subset of what was
> requested? For example, if the list contains 100,000 items, and the
> client requests a max-count of 100,000, can the server refuse to do
> that?
Returning a subset doesn't have a lot of precedent, but take a look at the
507 (Insufficient Storage) response code which is part of the WebDAV spec
(RFC 2518; [1]).
If you *do* return a partial result, then you're going to want some sort
of indicator to let the client know that.
> Presumably there are cases where an Atom client wants to build a
> complete list, limits be damned. If the server returns a subset the
> client would need to make multiple requests in order to form a full
> model of the server's state. How does the client know when it's got
> everything? Do we need a method to query for the total number of items?
That might work. There are always race conditions, so the count could
always be off; the client would need to deal with that.
The DeltaV spec introduced the REPORT method to provide a way to ask for
various reports about state from the server. I could easily see an Atom
report which provides information such as date of first post, date of the
most recent post, total count, blog metadata, a pointer to the author(s)
of the blog, etc.
Cheers,
-g
[1] http://www.webdav.org/specs/rfc2518.html#STATUS_507