Firehosedocs
Stream

Rules & query syntax

Create, read, update, and delete the queries attached to a tap — and the full query language they're written in.

View as Markdown

A rule is a query attached to a tap, with an optional tag label. A page is delivered to the tap's stream if it matches any rule on the tap. All rule endpoints authenticate with a tap token (fh_).

Rule object

FieldTypeDescription
idstringRule identifier
valuestringThe query (required)
tagstringOptional label, max 255 chars
nsfwbooleanInclude adult content. Default false
qualitybooleanApply quality filters. Default true

List rules

curl -s https://api.firehose.com/v1/rules \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"
{
  "data": [
    { "id": "1", "value": "tesla", "tag": "brand-mentions" },
    { "id": "2", "value": "\"site explorer\"", "tag": "product" }
  ],
  "meta": { "count": 2 }
}

Create a rule

curl -s -X POST https://api.firehose.com/v1/rules \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"value": "tesla OR \"electric vehicle\"", "tag": "ev"}'

Returns 201 with the created rule.

Update a rule

Partial updates are supported.

curl -s -X PUT https://api.firehose.com/v1/rules/1 \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tag": "new-tag", "nsfw": true}'

Delete a rule

curl -s -X DELETE https://api.firehose.com/v1/rules/1 \
  -H "Authorization: Bearer $FIREHOSE_TAP_TOKEN"

Returns 204 with no content.


Query syntax

A rule's value is written in Firehose query syntax, which is Lucene-compatible. Queries are evaluated against indexed fields extracted from each crawled page.

Indexed fields

FieldTypeCaseDescription
addedtextinsensitiveDefault field. Text from inserted diff chunks
removedtextinsensitiveText from deleted diff chunks
added_anchortextinsensitiveAnchor text from inserted links
removed_anchortextinsensitiveAnchor text from deleted links
titletextinsensitivePage title
urlkeywordsensitiveFull URL as one exact token
domainkeywordsensitiveDomain extracted from the URL
publish_timekeywordsensitiveISO-8601 local datetime
page_categorykeywordsensitiveML category label, e.g. /News
page_typekeywordsensitiveML type label, e.g. /Article/How_to
languagekeywordsensitiveISO 639-1 code, e.g. en, fr, zh-cn
recentfilterRecency filter (see below)

Text fields are tokenized and lowercased (case-insensitive). Keyword fields are stored as a single exact, case-sensitive token. Null/empty fields are absent and never match. Multi-valued fields match if any value matches.

Terms and phrases

tesla                        # "tesla" anywhere in added content (default field)
title:tesla                  # "tesla" in the title
"quick brown fox"            # exact phrase in content
title:"breaking news"        # exact phrase in title

Boolean operators

java AND programming
title:tesla OR added:"electric vehicle"
NOT malware
title:tesla AND added:earnings
removed:"old feature"        # term appeared in deleted content

URL and domain filtering

url and domain are exact, case-sensitive tokens. You can match them three ways: exact, wildcard (*, ?), and regex (/pattern/). Forward slashes are special and must be escaped with \.

url:"https://example.com/news/article-1"   # exact
domain:techcrunch.com                       # exact domain
url:*\/category\/*                          # wildcard: contains /category/
url:/.*\/page\/[0-9]+.*/                     # regex: pagination URLs

Excluding junk URLs is the most common pattern:

title:tesla AND language:"en"
  AND NOT url:/.*\/page\/[0-9]+.*/
  AND NOT url:*\/category\/*
  AND NOT url:*\/tag\/*

JSON double-escaping. In a JSON request body, \/ is just /. To send a literal backslash before a slash in the query, write \\/ in JSON. For example the query url:*\/abs\/* must be sent as "url:*\\/abs\\/*".

Filtering on url narrows which crawled pages match — it does not tell Firehose to crawl that URL. A tap only ever sees pages the crawler visits, on the crawler's own schedule, so a change to a specific page won't surface until (and unless) the crawler re-crawls it. To monitor a specific page for changes on a cadence you control, use URL Watch instead.

Date ranges on publish_time

Colons in timestamps must be escaped with \\:

publish_time:[2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59]   # inclusive
publish_time:{2025-01-01T00\\:00\\:00 TO 2025-12-31T23\\:59\\:59}   # exclusive

recent — recency filter

A query-level filter (not an indexed field). Format: a positive integer followed by h, d, or mo.

recent:1h                      # published in the last hour
recent:7d                      # last 7 days
title:tesla AND recent:24h     # tesla in title, last 24 hours

nsfw — adult content

A boolean on the rule object, not in the query. false (default) excludes adult content; true includes it.

{ "value": "title:tesla", "nsfw": true }

quality — quality filter

A boolean on the rule object (default true). When on, results are limited to pages published in the last 7 days, with no pagination, tag/category index, or query-parameter URLs — removing low-value and duplicate pages.

{ "value": "domain:\"example.com\"", "quality": false }

Category and type values

page_category and page_type accept a large fixed vocabulary (25 top-level categories with 700+ subcategories, and 110+ page types). The complete list lives in the canonical /skill.md reference. A few examples:

page_category:"/News"
page_category:"/Sports/Winter_Sports/Skiing_and_Snowboarding"
page_type:"/Article/How_to"
page_type:"/Document/White_Paper"

Next steps