Data Pipeline: Commonly asked questions
How to query all data from one day in a specific timezone?
Querying for a single day of traffic for your local timezone usually requires loading data from multiple S3 buckets for the following reasons:
- Each S3 bucket name reflects UTC time
- Processing latency may delay the delivery of events so that they’re not received during the day they occurred
Typically, the events for a single day in your timezone should be confined to three S3 buckets: the day that matches your timezone day, the day before, and the day after. However, in rare cases, it is possible for events to be delivered up to 24 hours late. We recommend configuring your ETL process to continue importing these events and updating tables and data warehouses accordingly. This can be accomplished by using ts_action
to represent event time instead of relying on the S3 bucket date.
Transforming data to reflect a specific timezone is dependent on the platform you are using. Amazon Athena utilizes the AT TIME ZONE
syntax and Google Bigquery has a timezone
input parameter to transform dates and timestamps.
How to replicate a calculation from the Parse.ly dashboard?
The Data Pipeline provides a raw data feed of the same events that are used to power the Parse.ly dashboard. However, the dashboard calculations contain additional logic that must be replicated in your Data Pipeline queries in order to generate the same results. Here are some factors to consider when attempting to reconcile pageviews, engaged time, and videostarts:
- All data in the Parse.ly Dashboard is specified according to the local timezone configured in the dashboard preferences, while Data Pipeline data is always in UTC. The Data Pipeline data for a single day in your timezone will usually be found in multiple S3 buckets.
- Apple News data is excluded from Data Pipeline data by default. However, if you already have Apple News data in your Parse.ly dashboard and would like it included in your Data Pipeline data, please contact support@parsely.com and we will set that up for you.
How to understand the expected latency for the Data Pipeline Standard loading events into S3?
The standard Parse.ly Data Pipeline that inputs events into S3 buckets in 15-minute chunks with maximum of 128 MiB of size per chunk.. In rare occasions, latency between 1 and 24 hours is possible.
Latency is a system-wide Parse.ly metric and is not specific to one customer. If we are experiencing system-wide latency with one of our tracking systems, then the Data Pipeline will also experience latency. Possible causes could include a massive traffic spike or server issues. For any real-time latency delays, which directly impact Data Pipeline latency, please see our status.parsely.com page or follow us on Twitter @parselysupport.
How to explain missing session data?
Session data is not available for all distributed channels. For detailed descriptions of the limitations of each channel, see the Channel FAQs. A summary of this information is below:
- The following channels do not support sessions. You can find these in the
channel
field:- Accelerated Mobile Pages (
channel = 'amp'
) - Apple News Real Time (
channel = 'apln-rta'
)
- Accelerated Mobile Pages (
- The following channels do support sessions (Caveats: noted where applicable)
- Website: only if the user or browser has not blocked third party cookies
- Facebook Instant Article (
channel = 'fbia'
)
- The following integrations may or may not include session data depending on how they were configured by the customer:
How to interpret the value in the engaged_time_inc column?
Engaged time is defined and handled differently for pages and videos, with details described in our engaged time documentation. In general, the engaged_time_inc
field indicates how many seconds a user was active either:
- on the page, if
action = 'heartbeat'
- watching a video, if
action = 'vheartbeat'
To correctly sum or aggregate total engaged time, use the fields pageview_id
, videostart_id
, and pageload_id
detailed in the ID fields documentation.
How to explain why the visitor_site_id is empty?
Visitor data can be missing for any of the following reasons:
- The following channel does not support visitors. You can identify this traffic in the
channel
field or theurl_domain
fields:- Apple News Real Time (
channel = 'apln-rta'
)
- Apple News Real Time (
- When a user or browser blocks cookies, that causes the
visitor_site_id
to be null.
For detailed descriptions of the limitations of each channel, see the Channel FAQs.
How to define a Parse.ly session?
The Parsely JavaScript tracker uses a 30-minute expiry on the session cookie and imposes no maximum length of time for a session. The session cookie is stored on the host page’s first-party domain. This means that a device that visits pages on the same parsely-integrated domain repeatedly with less than 30 minutes between each visit will create a session that lasts indefinitely. If a user returns to a browser that has been inactive for more than 30 minutes, that will count as a new session.
Note that sessions for server-side and mobile SDK integrations are up to the integrator as those integrations are not created or maintained by Parse.ly.
How to define what data is excluded from the Data Pipeline?
Parse.ly automatically removes bot and spider traffic, customer-specified IP addresses to ignore, customer-specified domains to ignore, and incorrectly formatted event data. For more details, see our filtering documentation.
How to find the Parse.ly Data Pipeline data retention & archiving policy?
Parse.ly retains your data for 3 months inside the Data Pipeline AWS S3 bucket and separately for 13 months in the Parse.ly Dashboard. This is all specified in the contract.
Why is some of the geolocation data missing?
Per Parse.ly’s geolocation provider, MaxMind, “Not all IP addresses can be geolocated with enough specificity to return information about the subdivision, city, or metro code of an IP address.” (Source)
You may test some IP addresses here.
Last updated: October 23, 2024