Data Pipeline: Legacy Schema v2.2.5

Caution: Legacy Documentation

This documentation refers to an earlier version of the Parse.ly Data Pipline (versions below 2.3.0). Customers using data after September 2019 should refer to the current Data Pipeline schema documentation. If you’re unsure of your schema version, check the schema_version. If that column is not available, you are on on the right page.

JSON Format

Whether you access your raw data via S3 (bulk) or Kinesis (streaming), you are going to be dealing with lines of JSON objects, aka JSONLines.

This is generally very easy to parse in every programming language, cloud SQL engine, and big data tool.

This page describes the schema of these JSON records (keys and values) so that you can interpret the raw events as they come in.

Example JSON Page View Record

We’ll start with an example pageview record, with the keys simply alphabetically sorted, from one of our sites.

You’ll notice that for the most part, these are straightforward key-value pairs, typically strings, but occasionally numbers, null, or booleans (true / false).

Base Event Fields

name	description	example value
action	event type identifier	“pageview”
apikey	site identifier	“mashable.com”
referrer	raw referring URL	“https://www.facebook.com/instantarticles#v1“
session_id	Session identifier	1
user_agent	Raw User-Agent (UA) string	“Mozilla/5.0 (iPhone; CPU … Safari/601.1”
url	Raw URL on which action occurred	“http://mashable.com/1234#d3d“
vistior_site_id	Visitor first-party site identifier	“0beabdd1-7b0c-423b-9fae-660101fc8953”
engaged_time_inc	Engaged time in seconds; only available where action = ‘heartbeat’ or ‘vheartbeat’	10

These are the raw required fields we get from your integration with our data collection infrastructure, whether that’s:

basic integration for standard web pages
dynamic tracking of custom events/data
mobile SDKs for iOS or Android

They will be present in every single event, regardless of event type or source. Note that excluding the session_id and the visitor_ip fields is possible, though all of our integrations attempt to support these fields to the best of their ability.

On One-Time Historical Imports

Customers often ask us whether it might be possible to do a one-time import of historical pageview (or other) event data from legacy web analytics systems. The answer to this question is “yes”, though it does require some custom work on Parse.ly’s side. We also need to have equivalents for the above “Base Event” fields in order to make sense of your historical data.

Timestamp Fields

We record two raw timestamps per event. One comes from our data collection servers and one comes from our client-side trackers. These are stored as numbers that represent seconds since the UNIX epoch, aka UNIX time. Our server clocks are in UTC. You may therefore need to pull data spanning multiple days for timezone conversions.

name	description	example value
timestamp_info	Flag to indicate if timestamp info is available	true
timestamp_info_nginx_ms	The automatic server-side event timestamp	1493598778000
timestamp_info_pixel_ms	The automatic client_side event timestamp	1493598778538
timestamp_info_override_ms	A client side override timestamp	1493598778000
ts_action	Date/time of the event. This is a formatted date/time of the timestamp_info_nginx_ms in GMT	2017-05-01 00:32:58
ts_session_current	Date/time of the current session derived from timestamp_info_pixel_ms	2017-05-01 00:30:00
ts_session_last	Date/time of the previous session	2017-04-14 20:22:47

In general, Parse.ly’s internal attitude is, “the client-side timestamp cannot be trusted”. However, there are situations in which it can make sense to trust it over server timestamp.

Our nginx (server-side) timestamp is at second resolution, whereas our pixel (client-side) timestamp is at millisecond resolution. If a pixel timestamp is within a few seconds of the corresponding nginx timestamp, it is likely more accurate. It represents when the event was sent, at millisecond resolution, rather than when the event was received, at second resolution. With our standard JavaScript tracker, both nginx and pixel are always captured together, so combining them lets us make JavaScript tracker-based events be as accurate as possible.

In mobile SDKs for iOS and Android, it is common to “batch” events if devices are offline. These are also known as “late-arriving” events. In these cases, neither the auto-generated server-side timestamp (in nginx) nor the auto-generated client-side timestamp (in pixel) can be trusted; instead, the client-side override timestamp may be a more accurate representation of reality. The mobile SDK populates these by filling a ts field in the data key-value object sent with every event.

On Timezones

Parse.ly’s JavaScript tracker populates the client-side timestamp using new Date().getTime(), which means that it is in UTC. Our server clocks are also in UTC. So, these should be comparable. However note that the UNIX time itself does not embed any timezone information. It simply represents the number of seconds since a specific UTC time in the past, the UNIX epoch. Your could try to infer the user’s local timezone from their IP address, based on their estimated geography. If you combine these fields, you can interpret the user’s local time.

Event ID

name	description	example value
event_id	unique event identifier string	“0xe6508eda93d5598367b18555ae9b828d”

A unique, hex-encoded ID string is also generated for each Event. This property can be used to deduplicate events for easier ingestion and processing.

This unique ID is generated by hashing the values of apikey, action, url, timestamp (internal, generated property), visitor_site_id, and timestamp_info_pixel_ms. To ensure that each event_id is truly unique, make sure that all events sent to Parse.ly provide all of these required fields (excluding timestamp, which is generated on our side) at an appropriate level of cardinality and granularity.

For example, if visitor_site_id is not provided for a series of events, then the only properties able to generate unique values for those events are the event type and the timestamp.

Visitors

name	description	example value
visitor	Flag to indicator if visitor info is available	true
visitor_site_id	Visitor first-party site identifier	“0beabdd1-7b0c-423b-9fae-660101fc8953”
visitor_network_id	Visitor third-party network identifier	“1acdecd1-8e0d-483c-7aef-660101fc9354”

Note

As previously mentioned, this is legacy documentation. We no longer set third-party cookies.

The visitor_site_id is set by a first-party cookie and visitor_network_id is set by a third-party cookie.

Session Enrichments

Parse.ly’s JavaScript tracker automatically creates some useful session information that can help with user session analysis. For one thing, Parse.ly’s session_id also doubles as a “number of visits” value, since it’s an auto-incrementing integer that starts at 1 and moves up by one for every new visit by a visitor with the same visitor_site_id.

Note that these enrichments are done client-side by Parse.ly’s JavaScript tracker; they will not apply to events that arrive via other integrations.

The other fields stored with the session are described below:

name	description	example value
session_id	auto-incrementing session identifier, unique to visitor_site_id	1
session_initial_referrer	the raw referring URL of the first pageview event of this session	“http://facebook.com“
session_initial_url	the raw URL of the first pageview event of this session	“http://mashable.com/1234#d3d“
session_last_session_timestamp	Timestamp of the last visit, or 0 if none	0
session_timestamp	Timestamp of first pageview event of this session	1466214847371
session	flag to indicate if session info is avialable	true

Timestamp Enrichments

Based on the above timestamp fields, we also create an important field called ts_action. This is timestamp_info_nginx_ms (our server time) re-interpreted as a formatted date string that is highly compatible with a number of systems. For example, it is the same format expected by Amazon Redshift and Google BigQuery’s JSON value parsers.

ts_action: "2016-06-18 02:03:24"

This value above is derived from epoch time 1466215404000; it also lacks timezone information but can be interpreted as a UTC time. It turns out, including timezone information as one might for the “full” ISO8601 standard makes this string incomaptible with some SQL engines, so we chose a maximally compatible format, instead.

Geo IP Enrichments

Based on the visitor_ip field, we enrich the following:

name	description	example value
ip_continent	Continent from GeoIP	“NA”
ip_country	Country from GeoIP	“US”
ip_city	City from GeoIP	“New York”
ip_lat	Latitude from GeoIP (postal code granularity)	40.676
ip_lon	Longitude from GeoIP (postal code granularity)	-73.963
ip_postal	Postal code from GeoIP	“11238”
ip_subdivision	Subdivision (e.g. US state) from GeoIP	“NY”
ip_timezone	Time Zone of visitor based on GeoIP	“America/New_York”
ip_market_name	Nielsen DMA name (see note below)	“New York”
ip_market_nielsen	Nielsen DMA ID (see note below)	“501”
ip_market_doubleclick	Google DoubleClick DMA ID (see note below)	“3”

On Nielsen Designated Market Areas (DMA)

ip_market_name, ip_market_nielsen, and ip_market_doubleclick all refer to Nielsen Designated Market Areas, which are only defined in the United States. This means these fields will only be populated for events that originate from U.S.-based IP addresses.

URL and Referrer Enrichments

Based on the url, referrer, session_initial_url and session_initial_referrer fields, we provide a number of enrichments. For the sake of illustration, we’ll assume the following values:

field	value
url	“https://www.example.com/article-1234?campaignid=1234#fragment“
referrer	“https://www.google.ca/“
session_initial_url	“https://www.example.com/article-1234?campaignid=1234#fragment“
session_initial_referrer	“https://www.google.ca/“

On URL Parsing

Attributes added to parsed URLs such as: fragment, netloc, params, query and scheme adhere to RFC 1808.

name	description	example value
url_clean	Cleaned `url` (strip query/fragment)	“https://www.example.com/article-1234“
url_domain	`url` parsed domain, matched against TLD list	“example.com”
url_fragment	Fragment portion of `url`	“fragment”
url_netloc	Netloc portion of `url`	“www.example.com”
url_params	Params portion of `url`	“”
url_path	Path portion of `url`	“/article-1234”
url_query	Query portion of `url`	“campaignid=1234”
url_scheme	Scheme portion of `url`	“https”
ref_category	`referrer` category (traffic source categorization)	“search”
ref_clean	Clean `referrer` URL (strip query/fragment)	“https://www.google.ca/“
ref_domain	`referrer` parsed domain, matched against TLD list	“google.ca”
ref_fragment	Fragment portion of `referrer`	“”
ref_netloc	Netloc portion of `referrer`	“www.google.ca”
ref_params	Params portion of `referrer`	“”
ref_path	Path portion of `referrer`	“/”
ref_query	Query portion of `referrer`	“”
ref_scheme	Scheme portion of `referrer`	“https”
surl_clean	Cleaned `session_initial_url` (strip query/fragment)	“https://www.example.com/article-1234“
surl_domain	`session_initial_url` parsed domain, matched against TLD list	“example.com”
surl_fragment	Fragment portion of `session_initial_url`	“fragment”
surl_netloc	Netloc portion of `session_initial_url`	“www.example.com”
surl_params	Params portion of `session_initial_url`	“”
surl_path	Path portion of `session_initial_url`	“/article-1234”
surl_query	Query portion of `session_initial_url`	“campaignid=1234”
surl_scheme	Scheme portion of `session_initial_url`	“https”
sref_category	Session referrer category (traffic source categorization)	“search”
sref_clean	Clean session referrer URL (strip query/fragment)	“https://www.google.ca/“
sref_domain	Referrer parsed domain, matched against TLD list	“google.ca”
sref_fragment	Fragment portion of `session_initial_referrer`	“”
sref_netloc	Netloc portion of `session_initial_referrer`	“www.google.ca”
sref_params	Params portion of `session_initial_referrer`	“”
sref_path	Path portion of `session_initial_referrer`	“/”
sref_query	Query portion of `session_initial_referrer`	“”
sref_scheme	Scheme portion of `session_initial_referrer`	“https”
surl_utm_campaign	The utm_campaign specified in the `session_initial_url`	“subscriber_newsletter”
surl_utm_content	The utm_content specified in the `session_initial_url`	“template_a”
surl_utm_medium	The utm_medium specified in the `session_initial_url`	“email”
surl_utm_source	The utm_source specified in the `session_initial_url`	“newsletter_2016-06-01”
surl_utm_term	The utm_term specified in the `session_initial_url`	“footer”

Metadata

Whether metadata was crawled via JSON-LD or passed directly in pixels (as is the case in Parse.ly’s video integration), metadata associated with the url field is passed along in a series of metadata_ fields:

name	description	example value
metadata	Flag to indicate if metadata is available	true
metadata_authors	Array of authors for the post/parse-ly-video-tracking/	[“Albert Einstein”, “Richard Feynman”]
metadata_canonical_url	The canonical URL of a post, or in the case of videos, the video ID	“http://www.example.com/article-1234“
metadata_pub_date_tmsp	Publish date of the post in milliseconds since the UNIX epoch	1471392000000
metadata_custom_metadata	String of optional custom metadata (for more information, see the integration docs	“{“internal_post_id”: “2134”}”
metadata_section	Section the post/parse-ly-video-tracking/ was published in	“Physics”
metadata_tags	Array of tags associated with the post/parse-ly-video-tracking/	[“science”, “physics”, “quantum mechanics”]
metadata_title	Title of the post/parse-ly-video-tracking/	“Thoughts on Quantum Electrodynamics”
metadata_image_url	URL to image for the post/parse-ly-video-tracking/	“https://www.evernote.com/l/AAFSrhKOoExCqKji3f9BS9YKfZEC-yerafgB/image.png“
metadata_full_content_word_count	Word count of the post (irrelevant for videos)	1562
metadata_data_source	How the metadata was collected i.e. ‘crawl’, ‘pixel’, etc	“crawl”
metadata_urls	The aliased URLs that the post lives on (i.e. Google AMP, http://m., main page) that reference the metadata_canonial_url	“https://m.google.com/article“
metadata_post_id	The post id of the article. This is the unique id of a post when the metadata exists	99999
metadata_share_urls	The social share URLs of the post in a comma separated list. Share links are from: Facebook, LinkedIn, Pinterest, and Twitter	[“http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”]
metadata_page_type	Type of page (i.e. post, section, frontpage, etc)	“post”
metadata_save_date_tmsp	Save date of the post in milliseconds (epoch format)	1471392000000
metadata_thumb_url	The url of the thumbnail image for the post	https://images.example.com/imagelocation

UA and Device Enrichments

Based on the ua field, we enrich the following:

name	description	example value
ua_browser	Browser derived from UA	“Mobile Safari”
ua_browserversion	Browser version derived from UA	“9.1.2”
ua_devicebrand	Device Brand derived from UA	“Apple”
ua_devicemodel	Device Model derived from UA	“iPhone”
ua_devicetouchcapable	Flag to indicate if device is touch capable	true
ua_devicetype	Device Type (mobile/tablet/desktop) from UA	“mobile”
ua_os	Device Operating System from UA	“iOS”
ua_osversion	Device Operating System version From UA	“9.3”

We also provide information regarding the display of the device:

name	description	example value
display	Flag to indicate if display info is available	true
display_avail_height	available height of the display, in pixels (equivalent to JavaScript’s `screen.availHeight` property)	877
display_avail_width	available width of pixels (equivalent to JavaScript’s `screen.availWidth` property)	1436
display_pixel_depth	color resolution (in bits per pixel)	24
display_total_height	total height of the display, in pixels	900
display_total_width	total width of the display, in pixels	1440
slot	flag to indicate if the slot position on the page is available	true

UTM Parameter Enrichments

Based on the url field, we enrich the following from its query parameters. Note that “UTM parameters” are a web-wide defacto standard for campaign tracking that was first introduced by Urchin and Google Analytics. Google runs a free tool called the URL builder to build URLs with this format, but many tools will automatically add these parameters to allow for easier tracking, especially in places where HTTP referrers are not automatically set.

In this example, we take the above article URL, http://mashable.com/1234, and we assume that it were clicked from an email newsletter. It might then have had query parameters like the following (scroll to read):

http://mashable.com/1234?utm_source=newsletter_2016-06-01&utm_medium=email&utm_term=footer&utm_content=template_a&utm_campaign=subscriber_newsletter

Which would be parsed as follows:

name	description	example value
campaign_id	Campaign identifier or name	“subscribers_email”
utm_campaign	Campaign identifier or name	“subscriber_newsletter”
utm_content	Template or style (e.g. for A/B tests)	“template_a”
utm_medium	Medium campaign ran on (e.g. email, social)	“email”
utm_source	The specific identifier for the source content	“newsletter_2016-06-01”
utm_term	A keyword or term associated with the click	“footer”

UTM parameter tracking is powerful because it allows you to do grouping, rollup, and slice-and-dice of your campaigns, which often have associated costs and thus can be part of an ROI calculation. It also helps tremendously with decoding “direct” traffic; e.g. in many email service providers, the above click from an email newsletter would have no HTTP referrer set, and thus UTM parameters would be the only way to understand this traffic.

Extra Data

Arbitrary key-value pairs can be passed via Parse.ly’s dynamic tracking or our implementation for custom segments. Such custom data may include subscriber information, or IDs for use in joining to other data sources. In these situations, your key/value pairs will appear as a nested JSON object in the extra_data field.

As part of your own ETL, you can “flatten” these fields up into your root document format if you wish to include them in whatever downstream database in which you store Parse.ly raw data.

"action": "_scroll"
"extra_data": {"_y": 1430}

In this example, a custom event, _scroll, was sent to our data pipeline, and it had associated custom data, {"_y": 1430}, which represents 1,430 pixels on the y-axis of scroll-depth within the browser. This kind of raw data could be used to implement scroll depth tracking.

Other Possibilities

This raw data schema is already quite rich and allows for quite a large number of queries that are not supported in Parse.ly’s dashboard or APIs. Nonetheless, you may want some help thinking through the possibilities of “what else” to store in your raw data events. For example:

subscriber identifiers, to do detailed loyalty analysis
more granular information about on-page or in-app activities
a specialized set of query parameters for social virality modeling
ad impression or revenue data
and anything else you can think up!

Next Steps

Read on for our Code Examples.

Or, get help from our team:

If you are already a Parse.ly customer, get in touch with us, and we’ll be happy to consult you on advanced use cases for your raw data.
If you are not a Parse.ly customer, you’ll first need to go through our basic integration, but we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with this unlimited flexibility.

Last updated: June 05, 2025