Skip to content

Commit

Permalink
Added JSON Type
Browse files Browse the repository at this point in the history
Closes #1251
  • Loading branch information
fulmicoton committed Feb 23, 2022
1 parent d37633e commit a6c7fac
Show file tree
Hide file tree
Showing 31 changed files with 2,109 additions and 435 deletions.
37 changes: 37 additions & 0 deletions benches/index-bench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
schema_builder.add_text_field("severity", STRING | STORED);
schema_builder.build()
};
let dynamic_schema = {
let mut schema_builder = tantivy::schema::SchemaBuilder::new();
schema_builder.add_json_field("json", TEXT);
schema_builder.build()
};

let mut group = c.benchmark_group("index-hdfs");
group.sample_size(20);
Expand Down Expand Up @@ -74,6 +79,38 @@ pub fn hdfs_index_benchmark(c: &mut Criterion) {
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-no-commit-json-without-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split("\n") {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
group.bench_function("index-hdfs-with-commit-json-without-docstore", |b| {
b.iter(|| {
let index = Index::create_in_ram(dynamic_schema.clone());
let json_field = dynamic_schema.get_field("json").unwrap();
let mut index_writer = index.writer_with_num_threads(1, 100_000_000).unwrap();
for _ in 0..NUM_REPEATS {
for doc_json in HDFS_LOGS.trim().split("\n") {
let json_val: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(doc_json).unwrap();
let doc = tantivy::doc!(json_field=>json_val);
index_writer.add_document(doc).unwrap();
}
}
index_writer.commit().unwrap();
})
});
}

criterion_group! {
Expand Down
93 changes: 93 additions & 0 deletions doc/src/json.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Json

As of tantivy 0.17, tantivy supports a json object type.
This type can be used to allow for a schema-less search index.

When indexing a json object, we "flatten" the JSON. This operation emits terms that represent a triplet `(json_path, value_type, value)`

For instance, if user is a json field, the following document:

```json
{
"user": {
"name": "Paul Masurel",
"address": {
"city": "Tokyo",
"country": "Japan"
},
"created_at": "2018-11-12T23:20:50.52Z"
}
}
```

emits the following tokens:
- ("name", Text, "Paul")
- ("name", Text, "Masurel")
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("created_at", Date, 15420648505)


# Bytes-encoding and lexicographical sort.

Like any other terms, these triplets are encoded into a binary format as follows.
- `json_path`: the json path is a sequence of "segments". In the example above, `address.city`
is just a debug representation of the json path `["address", "city"]`.
Its representation is done by separating segments by a unicode char `\x01`, and ending the path by `\x00`.
- `value type`: One byte represents the `Value` type.
- `value`: The value representation is just the regular Value representation.

This representation is designed to align the natural sort of Terms with the lexicographical sort
of their binary representation (Tantivy's dictionary (whether fst or sstable) is sorted and does prefix encoding).

In the example above, the terms will be sorted as
- ("address.city", Text, "Tokyo")
- ("address.country", Text, "Japan")
- ("name", Text, "Masurel")
- ("name", Text, "Paul")
- ("created_at", Date, 15420648505)

As seen in "pitfalls", we may end up having to search for a value for a same path in several different fields. Putting the field code after the path makes it maximizes compression opportunities but also increases the chances for the two terms to end up in the actual same term dictionary block.


# Pitfalls, limitation and corner cases.

Json gives very little information about the type of the literals it stores.
All numeric types end up mapped as a "Number" and there are no types for dates.

At ingestion time, tantivy will try to interpret number and strings as different type with a
priority order.
Numbers will be interpreted as u64, i64 and f64 in that order.
Strings will be interpreted as rfc3999 dates or simple strings.

The first working time is picked and only one type will be emitted for indexing.

Note this interpretation happens on a per-document basis, and there is no effort to try to sniff
a consistent field type at the scale of a segment.

On the query parser side on the other hand, we may end up emitting more than one type.
For instance, we do not even know if the type is a number or string based.

So the query

```
my_path.my_segment:233
```

Will be interpreted as
`(my_path.my_segment, String, 233) or (my_path.my_segment, u64, 233)`

Likewise, we need to emit two tokens if the query contains an rfc3999 date.
Indeed the date could have been actually a single token inside the text of a document at ingestion time. Generally speaking, we will always at least emit a string token in query parsing, and sometimes more.

If one more json field is defined, things get even more complicated.

If the schema contains a text field called text and a json field called text:
`text:hello` could be interpreted as targetting the text field or as targetting the json field with the
json_path "text".

If there is such an ambiguity, we decide to only search the text:hello.
In other words, the parser will not search in default json fields if there is a schema hit.


Json field do not support range queries.
80 changes: 80 additions & 0 deletions examples/json_field.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
// # Json field example
//
// This example shows how the json field can be used
// to make tantivy partially schemaless.

use tantivy::collector::{Count, TopDocs};
use tantivy::query::QueryParser;
use tantivy::schema::{Schema, FAST, STORED, STRING, TEXT};
use tantivy::Index;

fn main() -> tantivy::Result<()> {
// # Defining the schema
//
// We need two fields:
// - a timestamp
// - a json object field
let mut schema_builder = Schema::builder();
schema_builder.add_date_field("timestamp", FAST | STORED);
let event_type = schema_builder.add_text_field("event_type", STRING | STORED);
let attributes = schema_builder.add_json_field("attributes", STORED | TEXT);
let schema = schema_builder.build();

// # Indexing documents
let index = Index::create_in_ram(schema.clone());

let mut index_writer = index.writer(50_000_000)?;
let doc = schema.parse_document(
r#"{
"timestamp": "2022-02-22T23:20:50.53Z",
"event_type": "click",
"attributes": {
"target": "submit-button",
"cart": {"product_id": 103},
"description": "the best vacuum cleaner ever"
}
}"#,
)?;
index_writer.add_document(doc)?;
let doc = schema.parse_document(
r#"{
"timestamp": "2022-02-22T23:20:51.53Z",
"event_type": "click",
"attributes": {
"target": "submit-button",
"cart": {"product_id": 133},
"description": "das keyboard"
}
}"#,
)?;
index_writer.add_document(doc)?;
index_writer.commit()?;

let reader = index.reader()?;
let searcher = reader.searcher();

let query_parser = QueryParser::for_index(&index, vec![event_type, attributes]);
{
let query = query_parser.parse_query("target:submit-button")?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(count_docs.len(), 2);
}
{
let query = query_parser.parse_query("target:submit")?;
let count_docs = searcher.search(&*query, &TopDocs::with_limit(2))?;
assert_eq!(count_docs.len(), 2);
}
{
let query = query_parser.parse_query("cart.product_id:103")?;
let count_docs = searcher.search(&*query, &Count)?;
assert_eq!(count_docs, 1);
}
{
let query = query_parser
.parse_query("event_type:click AND cart.product_id:133")
.unwrap();
let hits = searcher.search(&*query, &TopDocs::with_limit(2)).unwrap();
assert_eq!(hits.len(), 1);
}
Ok(())
}
4 changes: 2 additions & 2 deletions query-grammar/src/user_input_ast.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,15 +59,15 @@ pub enum UserInputBound {
}

impl UserInputBound {
fn display_lower(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
fn display_lower(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
match *self {
UserInputBound::Inclusive(ref word) => write!(formatter, "[\"{}\"", word),
UserInputBound::Exclusive(ref word) => write!(formatter, "{{\"{}\"", word),
UserInputBound::Unbounded => write!(formatter, "{{\"*\""),
}
}

fn display_upper(&self, formatter: &mut fmt::Formatter<'_>) -> Result<(), fmt::Error> {
fn display_upper(&self, formatter: &mut fmt::Formatter) -> Result<(), fmt::Error> {
match *self {
UserInputBound::Inclusive(ref word) => write!(formatter, "\"{}\"]", word),
UserInputBound::Exclusive(ref word) => write!(formatter, "\"{}\"}}", word),
Expand Down
4 changes: 2 additions & 2 deletions src/aggregation/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ pub(crate) fn f64_from_fastfield_u64(val: u64, field_type: &Type) -> f64 {
Type::U64 => val as f64,
Type::I64 => i64::from_u64(val) as f64,
Type::F64 => f64::from_u64(val),
Type::Date | Type::Str | Type::Facet | Type::Bytes => unimplemented!(),
Type::Date | Type::Str | Type::Facet | Type::Bytes | Type::Json => unimplemented!(),
}
}

Expand All @@ -262,7 +262,7 @@ pub(crate) fn f64_to_fastfield_u64(val: f64, field_type: &Type) -> u64 {
Type::U64 => val as u64,
Type::I64 => (val as i64).to_u64(),
Type::F64 => val.to_u64(),
Type::Date | Type::Str | Type::Facet | Type::Bytes => unimplemented!(),
Type::Date | Type::Str | Type::Facet | Type::Bytes | Type::Json => unimplemented!(),
}
}

Expand Down
5 changes: 2 additions & 3 deletions src/core/segment_reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,8 @@ impl SegmentReader {
self.fieldnorm_readers.get_field(field)?.ok_or_else(|| {
let field_name = self.schema.get_field_name(field);
let err_msg = format!(
"Field norm not found for field {:?}. Was the field set to record norm during \
indexing?",
field_name
"Field norm not found for field {field_name:?}. Was the field set to record norm \
during indexing?"
);
crate::TantivyError::SchemaError(err_msg)
})
Expand Down
9 changes: 6 additions & 3 deletions src/fastfield/multivalued/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,11 @@ mod tests {
assert_eq!(reader.num_docs(), 5);

{
let parser = QueryParser::for_index(&index, vec![date_field]);
let query = parser.parse_query(&format!("\"{}\"", first_time_stamp.to_rfc3339()))?;
let parser = QueryParser::for_index(&index, vec![]);
let query = parser.parse_query(&format!(
"multi_date_field:\"{}\"",
first_time_stamp.to_rfc3339()
))?;
let results = searcher.search(&query, &TopDocs::with_limit(5))?;
assert_eq!(results.len(), 1);
for (_score, doc_address) in results {
Expand Down Expand Up @@ -150,7 +153,7 @@ mod tests {
{
let parser = QueryParser::for_index(&index, vec![date_field]);
let range_q = format!(
"[{} TO {}}}",
"multi_date_field:[{} TO {}}}",
(first_time_stamp + Duration::seconds(1)).to_rfc3339(),
(first_time_stamp + Duration::seconds(3)).to_rfc3339()
);
Expand Down
Loading

0 comments on commit a6c7fac

Please sign in to comment.