Skip to content

Data Pipeline: Legacy Schema v2.2.5

Caution: Legacy Documentation

This documentation refers to an earlier version of the Parse.ly Data Pipline (versions below 2.3.0). Customers using data after September 2019 should refer to the current Data Pipeline schema documentation. If you’re unsure of your schema version, check the schema_version. If that column is not available, you are on on the right page.

JSON Format

Whether you access your raw data via S3 (bulk) or Kinesis (streaming), you are going to be dealing with lines of JSON objects, aka JSONLines.

This is generally very easy to parse in every programming language, cloud SQL engine, and big data tool.

This page describes the schema of these JSON records (keys and values) so that you can interpret the raw events as they come in.

Example JSON Page View Record

We’ll start with an example pageview record, with the keys simply alphabetically sorted, from one of our sites.

You’ll notice that for the most part, these are straightforward key-value pairs, typically strings, but occasionally numbers, null, or booleans (true / false).

Base Event Fields

namedescriptionexample value
actionevent type identifier“pageview”
apikeysite identifier“mashable.com”
referrerraw referring URLhttps://www.facebook.com/instantarticles#v1
session_idSession identifier1
user_agentRaw User-Agent (UA) string“Mozilla/5.0 (iPhone; CPU … Safari/601.1”
urlRaw URL on which action occurredhttp://mashable.com/1234#d3d
vistior_site_idVisitor first-party site identifier“0beabdd1-7b0c-423b-9fae-660101fc8953”
engaged_time_incEngaged time in seconds; only available where action = ‘heartbeat’ or ‘vheartbeat’10

These are the raw required fields we get from your integration with our data collection infrastructure, whether that’s:

They will be present in every single event, regardless of event type or source. Note that excluding the session_id and the visitor_ip fields is possible, though all of our integrations attempt to support these fields to the best of their ability.

On One-Time Historical Imports

Customers often ask us whether it might be possible to do a one-time import of historical pageview (or other) event data from legacy web analytics systems. The answer to this question is “yes”, though it does require some custom work on Parse.ly’s side. We also need to have equivalents for the above “Base Event” fields in order to make sense of your historical data.

Timestamp Fields

We record two raw timestamps per event. One comes from our data collection servers and one comes from our client-side trackers. These are stored as numbers that represent seconds since the UNIX epoch, aka UNIX time. Our server clocks are in UTC. You may therefore need to pull data spanning multiple days for timezone conversions.

namedescriptionexample value
timestamp_infoFlag to indicate if timestamp info is availabletrue
timestamp_info_nginx_msThe automatic server-side event timestamp1493598778000
timestamp_info_pixel_msThe automatic client_side event timestamp1493598778538
timestamp_info_override_msA client side override timestamp1493598778000
ts_actionDate/time of the event. This is a formatted date/time of the timestamp_info_nginx_ms in GMT2017-05-01 00:32:58
ts_session_currentDate/time of the current session derived from timestamp_info_pixel_ms2017-05-01 00:30:00
ts_session_lastDate/time of the previous session2017-04-14 20:22:47

In general, Parse.ly’s internal attitude is, “the client-side timestamp cannot be trusted”. However, there are situations in which it can make sense to trust it over server timestamp.

Our nginx (server-side) timestamp is at second resolution, whereas our pixel (client-side) timestamp is at millisecond resolution. If a pixel timestamp is within a few seconds of the corresponding nginx timestamp, it is likely more accurate. It represents when the event was sent, at millisecond resolution, rather than when the event was received, at second resolution. With our standard JavaScript tracker, both nginx and pixel are always captured together, so combining them lets us make JavaScript tracker-based events be as accurate as possible.

In mobile SDKs for iOS and Android, it is common to “batch” events if devices are offline. These are also known as “late-arriving” events. In these cases, neither the auto-generated server-side timestamp (in nginx) nor the auto-generated client-side timestamp (in pixel) can be trusted; instead, the client-side override timestamp may be a more accurate representation of reality. The mobile SDK populates these by filling a ts field in the data key-value object sent with every event.

On Timezones

Parse.ly’s JavaScript tracker populates the client-side timestamp using new
Date().getTime()
, which means that it is in UTC. Our server clocks are also in UTC. So, these should be comparable. However note that the UNIX time itself does not embed any timezone information. It simply represents the number of seconds since a specific UTC time in the past, the UNIX epoch. Your could try to infer the user’s local timezone from their IP address, based on their estimated geography. If you combine these fields, you can interpret the user’s local time.

Event ID

namedescriptionexample value
event_idunique event identifier string“0xe6508eda93d5598367b18555ae9b828d”

A unique, hex-encoded ID string is also generated for each Event. This property can be used to deduplicate events for easier ingestion and processing.

This unique ID is generated by hashing the values of apikey, action, url, timestamp (internal, generated property), visitor_site_id, and timestamp_info_pixel_ms. To ensure that each event_id is truly unique, make sure that all events sent to Parse.ly provide all of these required fields (excluding timestamp, which is generated on our side) at an appropriate level of cardinality and granularity.

For example, if visitor_site_id is not provided for a series of events, then the only properties able to generate unique values for those events are the event type and the timestamp.

Visitors

namedescriptionexample value
visitorFlag to indicator if visitor info is availabletrue
visitor_site_idVisitor first-party site identifier“0beabdd1-7b0c-423b-9fae-660101fc8953”
visitor_network_idVisitor third-party network identifier“1acdecd1-8e0d-483c-7aef-660101fc9354”

Note

As previously mentioned, this is legacy documentation. We no longer set third-party cookies.

The visitor_site_id is set by a first-party cookie and visitor_network_id is set by a third-party cookie.

Session Enrichments

Parse.ly’s JavaScript tracker automatically creates some useful session information that can help with user session analysis. For one thing, Parse.ly’s session_id also doubles as a “number of visits” value, since it’s an auto-incrementing integer that starts at 1 and moves up by one for every new visit by a visitor with the same visitor_site_id.

Note that these enrichments are done client-side by Parse.ly’s JavaScript tracker; they will not apply to events that arrive via other integrations.

The other fields stored with the session are described below:

namedescriptionexample value
session_idauto-incrementing session identifier, unique to visitor_site_id1
session_initial_referrerthe raw referring URL of the first pageview event of this sessionhttp://facebook.com
session_initial_urlthe raw URL of the first pageview event of this sessionhttp://mashable.com/1234#d3d
session_last_session_timestampTimestamp of the last visit, or 0 if none0
session_timestampTimestamp of first pageview event of this session1466214847371
sessionflag to indicate if session info is avialabletrue

Timestamp Enrichments

Based on the above timestamp fields, we also create an important field called ts_action. This is timestamp_info_nginx_ms (our server time) re-interpreted as a formatted date string that is highly compatible with a number of systems. For example, it is the same format expected by Amazon Redshift and Google BigQuery’s JSON value parsers.

  • ts_action: "2016-06-18 02:03:24"

This value above is derived from epoch time 1466215404000; it also lacks timezone information but can be interpreted as a UTC time. It turns out, including timezone information as one might for the “full” ISO8601 standard makes this string incomaptible with some SQL engines, so we chose a maximally compatible format, instead.

Geo IP Enrichments

Based on the visitor_ip field, we enrich the following:

namedescriptionexample value
ip_continentContinent from GeoIP“NA”
ip_countryCountry from GeoIP“US”
ip_cityCity from GeoIP“New York”
ip_latLatitude from GeoIP (postal code granularity)40.676
ip_lonLongitude from GeoIP (postal code granularity)-73.963
ip_postalPostal code from GeoIP“11238”
ip_subdivisionSubdivision (e.g. US state) from GeoIP“NY”
ip_timezoneTime Zone of visitor based on GeoIP“America/New_York”
ip_market_nameNielsen DMA name (see note below)“New York”
ip_market_nielsenNielsen DMA ID (see note below)“501”
ip_market_doubleclickGoogle DoubleClick DMA ID (see note below)“3”

On Nielsen Designated Market Areas (DMA)

ip_market_nameip_market_nielsen, and ip_market_doubleclick all refer to Nielsen Designated Market Areas, which are only defined in the United States. This means these fields will only be populated for events that originate from U.S.-based IP addresses.

URL and Referrer Enrichments

Based on the url, referrer, session_initial_url and session_initial_referrer fields, we provide a number of enrichments. For the sake of illustration, we’ll assume the following values:

fieldvalue
urlhttps://www.example.com/article-1234?campaignid=1234#fragment
referrerhttps://www.google.ca/
session_initial_urlhttps://www.example.com/article-1234?campaignid=1234#fragment
session_initial_referrerhttps://www.google.ca/

On URL Parsing

Attributes added to parsed URLs such as: fragment, netloc, params, query and scheme adhere to RFC 1808.

namedescriptionexample value
url_cleanCleaned url (strip query/fragment)https://www.example.com/article-1234
url_domainurl parsed domain, matched against TLD list“example.com”
url_fragmentFragment portion of url“fragment”
url_netlocNetloc portion of url“www.example.com”
url_paramsParams portion of url“”
url_pathPath portion of url“/article-1234”
url_queryQuery portion of url“campaignid=1234”
url_schemeScheme portion of url“https”
ref_categoryreferrer category (traffic source categorization)“search”
ref_cleanClean referrer URL (strip query/fragment)https://www.google.ca/
ref_domainreferrer parsed domain, matched against TLD list“google.ca”
ref_fragmentFragment portion of referrer“”
ref_netlocNetloc portion of referrer“www.google.ca”
ref_paramsParams portion of referrer“”
ref_pathPath portion of referrer“/”
ref_queryQuery portion of referrer“”
ref_schemeScheme portion of referrer“https”
surl_cleanCleaned session_initial_url (strip query/fragment)https://www.example.com/article-1234
surl_domainsession_initial_url parsed domain, matched against TLD list“example.com”
surl_fragmentFragment portion of session_initial_url“fragment”
surl_netlocNetloc portion of session_initial_url“www.example.com”
surl_paramsParams portion of session_initial_url“”
surl_pathPath portion of session_initial_url“/article-1234”
surl_queryQuery portion of session_initial_url“campaignid=1234”
surl_schemeScheme portion of session_initial_url“https”
sref_categorySession referrer category (traffic source categorization)“search”
sref_cleanClean session referrer URL (strip query/fragment)https://www.google.ca/
sref_domainReferrer parsed domain, matched against TLD list“google.ca”
sref_fragmentFragment portion of session_initial_referrer“”
sref_netlocNetloc portion of session_initial_referrer“www.google.ca”
sref_paramsParams portion of session_initial_referrer“”
sref_pathPath portion of session_initial_referrer“/”
sref_queryQuery portion of session_initial_referrer“”
sref_schemeScheme portion of session_initial_referrer“https”
surl_utm_campaignThe utm_campaign specified in the session_initial_url“subscriber_newsletter”
surl_utm_contentThe utm_content specified in the session_initial_url“template_a”
surl_utm_mediumThe utm_medium specified in the session_initial_url“email”
surl_utm_sourceThe utm_source specified in the session_initial_url“newsletter_2016-06-01”
surl_utm_termThe utm_term specified in the session_initial_url“footer”

Metadata

Whether metadata was crawled via JSON-LD or passed directly in pixels (as is the case in Parse.ly’s video integration), metadata associated with the url field is passed along in a series of metadata_ fields:

namedescriptionexample value
metadataFlag to indicate if metadata is availabletrue
metadata_authorsArray of authors for the post/parse-ly-video-tracking/[“Albert Einstein”, “Richard Feynman”]
metadata_canonical_urlThe canonical URL of a post, or in the case of videos, the video IDhttp://www.example.com/article-1234
metadata_pub_date_tmspPublish date of the post in milliseconds since the UNIX epoch1471392000000
metadata_custom_metadataString of optional custom metadata (for more information, see the integration docs“{“internal_post_id”: “2134”}”
metadata_sectionSection the post/parse-ly-video-tracking/ was published in“Physics”
metadata_tagsArray of tags associated with the post/parse-ly-video-tracking/[“science”, “physics”, “quantum mechanics”]
metadata_titleTitle of the post/parse-ly-video-tracking/“Thoughts on Quantum Electrodynamics”
metadata_image_urlURL to image for the post/parse-ly-video-tracking/https://www.evernote.com/l/AAFSrhKOoExCqKji3f9BS9YKfZEC-yerafgB/image.png
metadata_full_content_word_countWord count of the post (irrelevant for videos)1562
metadata_data_sourceHow the metadata was collected i.e. ‘crawl’, ‘pixel’, etc“crawl”
metadata_urlsThe aliased URLs that the post lives on (i.e. Google AMP, http://m., main page) that reference the metadata_canonial_urlhttps://m.google.com/article
metadata_post_idThe post id of the article. This is the unique id of a post when the metadata exists99999
metadata_share_urlsThe social share URLs of the post in a comma separated list. Share links are from: Facebook, LinkedIn, Pinterest, and Twitter[“http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”,”http://example.com/post”]
metadata_page_typeType of page (i.e. post, section, frontpage, etc)“post”
metadata_save_date_tmspSave date of the post in milliseconds (epoch format)1471392000000
metadata_thumb_urlThe url of the thumbnail image for the posthttps://images.example.com/imagelocation

UA and Device Enrichments

Based on the ua field, we enrich the following:

namedescriptionexample value
ua_browserBrowser derived from UA“Mobile Safari”
ua_browserversionBrowser version derived from UA“9.1.2”
ua_devicebrandDevice Brand derived from UA“Apple”
ua_devicemodelDevice Model derived from UA“iPhone”
ua_devicetouchcapableFlag to indicate if device is touch capabletrue
ua_devicetypeDevice Type (mobile/tablet/desktop) from UA“mobile”
ua_osDevice Operating System from UA“iOS”
ua_osversionDevice Operating System version From UA“9.3”

We also provide information regarding the display of the device:

namedescriptionexample value
displayFlag to indicate if display info is availabletrue
display_avail_heightavailable height of the display, in pixels (equivalent to JavaScript’s screen.availHeight property)877
display_avail_widthavailable width of pixels (equivalent to JavaScript’s screen.availWidth property)1436
display_pixel_depthcolor resolution (in bits per pixel)24
display_total_heighttotal height of the display, in pixels900
display_total_widthtotal width of the display, in pixels1440
slotflag to indicate if the slot position on the page is availabletrue

UTM Parameter Enrichments

Based on the url field, we enrich the following from its query parameters. Note that “UTM parameters” are a web-wide defacto standard for campaign tracking that was first introduced by Urchin and Google Analytics. Google runs a free tool called the URL builder to build URLs with this format, but many tools will automatically add these parameters to allow for easier tracking, especially in places where HTTP referrers are not automatically set.

In this example, we take the above article URL, http://mashable.com/1234, and we assume that it were clicked from an email newsletter. It might then have had query parameters like the following (scroll to read):

http://mashable.com/1234?utm_source=newsletter_2016-06-01&utm_medium=email&utm_term=footer&utm_content=template_a&utm_campaign=subscriber_newsletter

Which would be parsed as follows:

namedescriptionexample value
campaign_idCampaign identifier or name“subscribers_email”
utm_campaignCampaign identifier or name“subscriber_newsletter”
utm_contentTemplate or style (e.g. for A/B tests)“template_a”
utm_mediumMedium campaign ran on (e.g. email, social)“email”
utm_sourceThe specific identifier for the source content“newsletter_2016-06-01”
utm_termA keyword or term associated with the click“footer”

UTM parameter tracking is powerful because it allows you to do grouping, rollup, and slice-and-dice of your campaigns, which often have associated costs and thus can be part of an ROI calculation. It also helps tremendously with decoding “direct” traffic; e.g. in many email service providers, the above click from an email newsletter would have no HTTP referrer set, and thus UTM parameters would be the only way to understand this traffic.

Extra Data

Arbitrary key-value pairs can be passed via Parse.ly’s dynamic tracking or our implementation for custom segments. Such custom data may include subscriber information, or IDs for use in joining to other data sources. In these situations, your key/value pairs will appear as a nested JSON object in the extra_data field.

As part of your own ETL, you can “flatten” these fields up into your root document format if you wish to include them in whatever downstream database in which you store Parse.ly raw data.

  • "action": "_scroll"
  • "extra_data": {"_y": 1430}

In this example, a custom event, _scroll, was sent to our data pipeline, and it had associated custom data, {"_y": 1430}, which represents 1,430 pixels on the y-axis of scroll-depth within the browser. This kind of raw data could be used to implement scroll depth tracking.

Other Possibilities

This raw data schema is already quite rich and allows for quite a large number of queries that are not supported in Parse.ly’s dashboard or APIs. Nonetheless, you may want some help thinking through the possibilities of “what else” to store in your raw data events. For example:

  • subscriber identifiers, to do detailed loyalty analysis
  • more granular information about on-page or in-app activities
  • a specialized set of query parameters for social virality modeling
  • ad impression or revenue data
  • and anything else you can think up!

Next Steps

Read on for our Code Examples.

Or, get help from our team:

  • If you are already a Parse.ly customer, get in touch with us, and we’ll be happy to consult you on advanced use cases for your raw data.
  • If you are not a Parse.ly customer, you’ll first need to go through our basic integration, but we are glad to schedule a demo where we can share some of the awesome things our existing customers have done with this unlimited flexibility.

Last updated: September 30, 2024