Data Pipeline: Legacy Schema v2.2.5
Caution: Legacy Documentation
This documentation refers to an earlier version of the Parse.ly Data Pipline (versions below 2.3.0). Customers using data after September 2019 should refer to the current Data Pipeline schema documentation. If you’re unsure of your schema version, check the schema_version
. If that column is not available, you are on on the right page.
JSON Format
Whether you access your raw data via S3 (bulk) or Kinesis (streaming), you are going to be dealing with lines of JSON objects, aka JSONLines.
This is generally very easy to parse in every programming language, cloud SQL engine, and big data tool.
This page describes the schema of these JSON records (keys and values) so that you can interpret the raw events as they come in.
Example JSON Page View Record
We’ll start with an example pageview record, with the keys simply alphabetically sorted, from one of our sites.
You’ll notice that for the most part, these are straightforward key-value pairs, typically strings, but occasionally numbers, null
, or booleans (true
/ false
).
Base Event Fields
name | description | example value |
---|---|---|
action | event type identifier | “pageview” |
apikey | site identifier | “mashable.com” |
referrer | raw referring URL | “https://www.facebook.com/instantarticles#v1“ |
session_id | Session identifier | 1 |
user_agent | Raw User-Agent (UA) string | “Mozilla/5.0 (iPhone; CPU … Safari/601.1” |
url | Raw URL on which action occurred | “http://mashable.com/1234#d3d“ |
vistior_site_id | Visitor first-party site identifier | “0beabdd1-7b0c-423b-9fae-660101fc8953” |
engaged_time_inc | Engaged time in seconds; only available where action = ‘heartbeat’ or ‘vheartbeat’ | 10 |
These are the raw required fields we get from your integration with our data collection infrastructure, whether that’s:
- basic integration for standard web pages
- dynamic tracking of custom events/data
- mobile SDKs for iOS or Android
They will be present in every single event, regardless of event type or source. Note that excluding the session_id
and the visitor_ip
fields is possible, though all of our integrations attempt to support these fields to the best of their ability.
On One-Time Historical Imports
Customers often ask us whether it might be possible to do a one-time import of historical pageview (or other) event data from legacy web analytics systems. The answer to this question is “yes”, though it does require some custom work on Parse.ly’s side. We also need to have equivalents for the above “Base Event” fields in order to make sense of your historical data.
Timestamp Fields
We record two raw timestamps per event. One comes from our data collection servers and one comes from our client-side trackers. These are stored as numbers that represent seconds since the UNIX epoch, aka UNIX time. Our server clocks are in UTC. You may therefore need to pull data spanning multiple days for timezone conversions.
name | description | example value |
---|---|---|
timestamp_info | Flag to indicate if timestamp info is available | true |
timestamp_info_nginx_ms | The automatic server-side event timestamp | 1493598778000 |
timestamp_info_pixel_ms | The automatic client_side event timestamp | 1493598778538 |
timestamp_info_override_ms | A client side override timestamp | 1493598778000 |
ts_action | Date/time of the event. This is a formatted date/time of the timestamp_info_nginx_ms in GMT | 2017-05-01 00:32:58 |
ts_session_current | Date/time of the current session derived from timestamp_info_pixel_ms | 2017-05-01 00:30:00 |
ts_session_last | Date/time of the previous session | 2017-04-14 20:22:47 |
In general, Parse.ly’s internal attitude is, “the client-side timestamp cannot be trusted”. However, there are situations in which it can make sense to trust it over server timestamp.
Our nginx
(server-side) timestamp is at second resolution, whereas our pixel
(client-side) timestamp is at millisecond resolution. If a pixel
timestamp is within a few seconds of the corresponding nginx
timestamp, it is likely more accurate. It represents when the event was sent, at millisecond resolution, rather than when the event was received, at second resolution. With our standard JavaScript tracker, both nginx
and pixel
are always captured together, so combining them lets us make JavaScript tracker-based events be as accurate as possible.
In mobile SDKs for iOS and Android, it is common to “batch” events if devices are offline. These are also known as “late-arriving” events. In these cases, neither the auto-generated server-side timestamp (in nginx
) nor the auto-generated client-side timestamp (in pixel
) can be trusted; instead, the client-side override
timestamp may be a more accurate representation of reality. The mobile SDK populates these by filling a ts
field in the data
key-value object sent with every event.
On Timezones
Parse.ly’s JavaScript tracker populates the client-side timestamp using new
, which means that it is in UTC. Our server clocks are also in UTC. So, these should be comparable. However note that the UNIX time itself does not embed any timezone information. It simply represents the number of seconds since a specific UTC time in the past, the UNIX epoch. Your could try to infer the user’s local timezone from their IP address, based on their estimated geography. If you combine these fields, you can interpret the user’s local time.
Date().getTime()
Event ID
name | description | example value |
---|---|---|
event_id | unique event identifier string | “0xe6508eda93d5598367b18555ae9b828d” |
A unique, hex-encoded ID string is also generated for each Event. This property can be used to deduplicate events for easier ingestion and processing.
This unique ID is generated by hashing the values of apikey
, action
, url
, timestamp
(internal, generated property), visitor_site_id
, and timestamp_info_pixel_ms
. To ensure that each event_id
is truly unique, make sure that all events sent to Parse.ly provide all of these required fields (excluding timestamp
, which is generated on our side) at an appropriate level of cardinality and granularity.
For example, if visitor_site_id
is not provided for a series of events, then the only properties able to generate unique values for those events are the event type and the timestamp.
Visitors
name | description | example value |
---|---|---|
visitor | Flag to indicator if visitor info is available | true |
visitor_site_id | Visitor first-party site identifier | “0beabdd1-7b0c-423b-9fae-660101fc8953” |
visitor_network_id | Visitor third-party network identifier | “1acdecd1-8e0d-483c-7aef-660101fc9354” |
Note
As previously mentioned, this is legacy documentation. We no longer set third-party cookies.
The visitor_site_id
is set by a first-party cookie and visitor_network_id
is set by a third-party cookie.
Session Enrichments
Parse.ly’s JavaScript tracker automatically creates some useful session information that can help with user session analysis. For one thing, Parse.ly’s session_id
also doubles as a “number of visits” value, since it’s an auto-incrementing integer that starts at 1
and moves up by one for every new visit by a visitor with the same visitor_site_id
.
Note that these enrichments are done client-side by Parse.ly’s JavaScript tracker; they will not apply to events that arrive via other integrations.
The other fields stored with the session are described below:
name | description | example value |
---|---|---|
session_id | auto-incrementing session identifier, unique to visitor_site_id | 1 |
session_initial_referrer | the raw referring URL of the first pageview event of this session | “http://facebook.com“ |
session_initial_url | the raw URL of the first pageview event of this session | “http://mashable.com/1234#d3d“ |
session_last_session_timestamp | Timestamp of the last visit, or 0 if none | 0 |
session_timestamp | Timestamp of first pageview event of this session | 1466214847371 |
session | flag to indicate if session info is avialable | true |
Timestamp Enrichments
Based on the above timestamp fields, we also create an important field called ts_action
. This is timestamp_info_nginx_ms
(our server time) re-interpreted as a formatted date string that is highly compatible with a number of systems. For example, it is the same format expected by Amazon Redshift and Google BigQuery’s JSON value parsers.
ts_action
:"2016-06-18 02:03:24"
This value above is derived from epoch time 1466215404000
; it also lacks timezone information but can be interpreted as a UTC time. It turns out, including timezone information as one might for the “full” ISO8601
standard makes this string incomaptible with some SQL engines, so we chose a maximally compatible format, instead.
Geo IP Enrichments
Based on the visitor_ip
field, we enrich the following:
name | description | example value |
---|---|---|
ip_continent | Continent from GeoIP | “NA” |
ip_country | Country from GeoIP | “US” |
ip_city | City from GeoIP | “New York” |
ip_lat | Latitude from GeoIP (postal code granularity) | 40.676 |
ip_lon | Longitude from GeoIP (postal code granularity) | -73.963 |
ip_postal | Postal code from GeoIP | “11238” |
ip_subdivision | Subdivision (e.g. US state) from GeoIP | “NY” |
ip_timezone | Time Zone of visitor based on GeoIP | “America/New_York” |
ip_market_name | Nielsen DMA name (see note below) | “New York” |
ip_market_nielsen | Nielsen DMA ID (see note below) | “501” |
ip_market_doubleclick | Google DoubleClick DMA ID (see note below) | “3” |
On Nielsen Designated Market Areas (DMA)
ip_market_name
, ip_market_nielsen
, and ip_market_doubleclick
all refer to Nielsen Designated Market Areas, which are only defined in the United States. This means these fields will only be populated for events that originate from U.S.-based IP addresses.
URL and Referrer Enrichments
Based on the url
, referrer
, session_initial_url
and session_initial_referrer
fields, we provide a number of enrichments. For the sake of illustration, we’ll assume the following values:
field | value |
---|---|
url | “https://www.example.com/article-1234?campaignid=1234#fragment“ |
referrer | “https://www.google.ca/“ |
session_initial_url | “https://www.example.com/article-1234?campaignid=1234#fragment“ |
session_initial_referrer | “https://www.google.ca/“ |
On URL Parsing
Attributes added to parsed URLs such as:
fragment
,netloc
,params
,query
andscheme
adhere to RFC 1808.
name | description | example value |
---|---|---|
url_clean | Cleaned url (strip query/fragment) | “https://www.example.com/article-1234“ |
url_domain | url parsed domain, matched against TLD list | “example.com” |
url_fragment | Fragment portion of url | “fragment” |
url_netloc | Netloc portion of url | “www.example.com” |
url_params | Params portion of url | “” |
url_path | Path portion of url | “/article-1234” |
url_query | Query portion of url | “campaignid=1234” |
url_scheme | Scheme portion of url | “https” |
ref_category | referrer category (traffic source categorization) | “search” |
ref_clean | Clean referrer URL (strip query/fragment) | “https://www.google.ca/“ |
ref_domain | referrer parsed domain, matched against TLD list | “google.ca” |
ref_fragment | Fragment portion of referrer | “” |
ref_netloc | Netloc portion of referrer | “www.google.ca” |
ref_params | Params portion of referrer | “” |
ref_path | Path portion of referrer | “/” |
ref_query | Query portion of referrer | “” |
ref_scheme | Scheme portion of referrer | “https” |
surl_clean | Cleaned session_initial_url (strip query/fragment) | “https://www.example.com/article-1234“ |
surl_domain | session_initial_url parsed domain, matched against TLD list | “example.com” |
surl_fragment | Fragment portion of session_initial_url | “fragment” |
surl_netloc | Netloc portion of session_initial_url | “www.example.com” |
surl_params | Params portion of session_initial_url | “” |
surl_path | Path portion of session_initial_url | “/article-1234” |
surl_query | Query portion of session_initial_url | “campaignid=1234” |
surl_scheme | Scheme portion of session_initial_url | “https” |
sref_category | Session referrer category (traffic source categorization) | “search” |
sref_clean | Clean session referrer URL (strip query/fragment) | “https://www.google.ca/“ |
sref_domain | Referrer parsed domain, matched against TLD list | “google.ca” |
sref_fragment | Fragment portion of session_initial_referrer | “” |
sref_netloc | Netloc portion of session_initial_referrer | “www.google.ca” |
sref_params | Params portion of session_initial_referrer | “” |
sref_path | Path portion of session_initial_referrer | “/” |
sref_query | Query portion of session_initial_referrer | “” |
sref_scheme | Scheme portion of session_initial_referrer | “https” |
surl_utm_campaign | The utm_campaign specified in the session_initial_url | “subscriber_newsletter” |
surl_utm_content | The utm_content specified in the session_initial_url | “template_a” |
surl_utm_medium | The utm_medium specified in the session_initial_url | “email” |
surl_utm_source | The utm_source specified in the session_initial_url | “newsletter_2016-06-01” |
surl_utm_term | The utm_term specified in the session_initial_url | “footer” |
Metadata
Whether metadata was crawled via JSON-LD or passed directly in pixels (as is the case in Parse.ly’s video integration), metadata associated with the url
field is passed along in a series of metadata_
fields:
name | description | example value |
---|---|---|
metadata | Flag to indicate if metadata is available | true |
metadata_authors | Array of authors for the post/video | [“Albert Einstein”, “Richard Feynman”] |
metadata_canonical_url | The canonical URL of a post, or in the case of videos, the video ID | “http://www.example.com/article-1234“ |
metadata_pub_date_tmsp | Publish date of the post in milliseconds since the UNIX epoch | 1471392000000 |
metadata_custom_metadata | String of optional custom metadata (for more information, see the integration docs | “{“internal_post_id”: “2134”}” |
metadata_section | Section the post/video was published in | “Physics” |
metadata_tags | Array of tags associated with the post/video | [“science”, “physics”, “quantum mechanics”] |
metadata_title | Title of the post/video | “Thoughts on Quantum Electrodynamics” |
metadata_image_url | URL to image for the post/video | “https://www.evernote.com/l/AAFSrhKOoExCqKji3f9BS9YKfZEC-yerafgB/image.png“ |
metadata_full_content_word_count | Word count of the post (irrelevant for videos) | 1562 |
metadata_data_source | How the metadata was collected i.e. ‘crawl’, ‘pixel’, etc | “crawl” |
metadata_urls | The aliased URLs that the post lives on (i.e. Google AMP, http://m., main page) that reference the metadata_canonial_url | “https://m.google.com/article“ |
metadata_post_id | The post id of the article. This is the unique id of a post when the metadata exists | 99999 |
metadata_share_urls | The social share URLs of the post in a comma separated list. Share links are from: Facebook, LinkedIn, Pinterest, and Twitter | [“http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”] |
metadata_page_type | Type of page (i.e. post, section, frontpage, etc) | “post” |
metadata_save_date_tmsp | Save date of the post in milliseconds (epoch format) | 1471392000000 |
metadata_thumb_url | The url of the thumbnail image for the post | https://images.example.com/imagelocation |
UA and Device Enrichments
Based on the ua
field, we enrich the following:
name | description | example value |
---|---|---|
ua_browser | Browser derived from UA | “Mobile Safari” |
ua_browserversion | Browser version derived from UA | “9.1.2” |
ua_devicebrand | Device Brand derived from UA | “Apple” |
ua_devicemodel | Device Model derived from UA | “iPhone” |
ua_devicetouchcapable | Flag to indicate if device is touch capable | true |
ua_devicetype | Device Type (mobile/tablet/desktop) from UA | “mobile” |
ua_os | Device Operating System from UA | “iOS” |
ua_osversion | Device Operating System version From UA | “9.3” |
We also provide information regarding the display of the device:
name | description | example value |
---|---|---|
display | Flag to indicate if display info is available | true |
display_avail_height | available height of the display, in pixels (equivalent to JavaScript’s screen.availHeight property) | 877 |
display_avail_width | available width of pixels (equivalent to JavaScript’s screen.availWidth property) | 1436 |
display_pixel_depth | color resolution (in bits per pixel) | 24 |
display_total_height | total height of the display, in pixels | 900 |
display_total_width | total width of the display, in pixels | 1440 |
slot | flag to indicate if the slot position on the page is available | true |
UTM Parameter Enrichments
Based on the url
field, we enrich the following from its query parameters. Note that “UTM parameters” are a web-wide defacto standard for campaign tracking that was first introduced by Urchin and Google Analytics. Google runs a free tool called the URL builder to build URLs with this format, but many tools will automatically add these parameters to allow for easier tracking, especially in places where HTTP referrers are not automatically set.
In this example, we take the above article URL, http://mashable.com/1234
, and we assume that it were clicked from an email newsletter. It might then have had query parameters like the following (scroll to read):
http://mashable.com/1234?utm_source=newsletter_2016-06-01&utm_medium=email&utm_term=footer&utm_content=template_a&utm_campaign=subscriber_newsletter
Which would be parsed as follows:
name | description | example value |
---|---|---|
campaign_id | Campaign identifier or name | “subscribers_email” |
utm_campaign | Campaign identifier or name | “subscriber_newsletter” |
utm_content | Template or style (e.g. for A/B tests) | “template_a” |
utm_medium | Medium campaign ran on (e.g. email, social) | “email” |
utm_source | The specific identifier for the source content | “newsletter_2016-06-01” |
utm_term | A keyword or term associated with the click | “footer” |
UTM parameter tracking is powerful because it allows you to do grouping, rollup, and slice-and-dice of your campaigns, which often have associated costs and thus can be part of an ROI calculation. It also helps tremendously with decoding “direct” traffic; e.g. in many email service providers, the above click from an email newsletter would have no HTTP referrer set, and thus UTM parameters would be the only way to understand this traffic.
Extra Data
Arbitrary key-value pairs can be passed via Parse.ly’s dynamic tracking or our implementation for custom segments. Such custom data may include subscriber information, or IDs for use in joining to other data sources. In these situations, your key/value pairs will appear as a nested JSON object in the extra_data
field.
As part of your own ETL, you can “flatten” these fields up into your root document format if you wish to include them in whatever downstream database in which you store Parse.ly raw data.
"action"
:"_scroll"
"extra_data"
:{"_y": 1430}
In this example, a custom event, _scroll
, was sent to our data pipeline, and it had associated custom data, {"_y": 1430}
, which represents 1,430 pixels on the y-axis of scroll-depth within the browser. This kind of raw data could be used to implement scroll depth tracking.
Other Possibilities
This raw data schema is already quite rich and allows for quite a large number of queries that are not supported in Parse.ly’s dashboard or APIs. Nonetheless, you may want some help thinking through the possibilities of “what else” to store in your raw data events. For example:
- subscriber identifiers, to do detailed loyalty analysis
- more granular information about on-page or in-app activities
- a specialized set of query parameters for social virality modeling
- ad impression or revenue data
- and anything else you can think up!
Next Steps
Read on for our Code Examples.
Or, get help from our team:
- If you are already a Parse.ly customer, get in touch with us, and we’ll be happy to consult you on advanced use cases for your raw data.
- If you are not a Parse.ly customer, you’ll first need to go through our basic integration, but we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with this unlimited flexibility.
Last updated: August 15, 2024