Types of Data Sources
posted on 03 January 2014 by stephen compston
Gnip provides access to two general types of data sources through its APIs – Complete Access Sources, and Public API Access Sources. The differences in these types of access dictate what characteristics to expect from the data, and how to integrate it into your app.
Complete Access Sources
Complete Access sources are based off of a source’s full firehose of public data. These sources provide Gnip access to their full firehoses through formal partnerships, allowing Gnip to integrate the data into APIs which provide your client with access. These APIs are able to provide full fidelity – complete access to the data you need from the full firehose.
Gnip APIs which allow you to integrate with Complete Access Sources include:
- Firehose Streams
- Search API
- Historical PowerTrack
- Rehydration API
Complete access sources offer many advantages over public API access sources. Three key advantages are full coverage of data, the ability to precisely define the data you need, and the option of low latency delivery.
Second, your app can define a set of data to be delivered with a Complete Access source that is much more precise than what is possible via a public API. Where public APIs at most offer a few basic types of filtering options (e.g. keywords in activity text), Gnip’s APIs enable very precise filtering on the various metadata that are associated with social activities. Products like PowerTrack, Search API, and Historical Powertrack provide complex filtering logic that ensures you only get the data you need.
Third, Complete access sources also offer the option of very low latency at scale. Because Gnip consumes complete access sources in realtime, our APIs can pass that data on very quickly into your app. PowerTrack and Firehose streams can provide realtime delivery of high volumes of data with low latency as it is created. Additionally, Search API offers low-latency delivery of recent data for specific queries, with data available very quickly after it is created on the source platform.
Public API Access Sources
Public API access sources are social data publishers (e.g. Facebook) which make an API available to the public for retrieval of various types of data.
Each Public API is unique, but the following are some of the common characteristics that define them:
- Endpoints - Each public API has different endpoints. An endpoint provides a method for retrieving a specific type of data, with specific limits and parameters. For example, an endpoint might provide a way to search for “posts” and “comments” by keyword on the platform, or another might provide a means for getting “video upload” activities from a specific user.
- Query Parameters - What types of queries can be made? For example, on a “keyword search” endpoint, is the query limited to simple keywords, or does it also support more complex filtering like exact phrase matching, the ability to exclude specific keywords from results, or using boolean logic in queries.
- Request Protocols - How does your app get the data from the API? Generally, this either a “pull” method, in which an app sends periodic requests for batches of data, or a “push” method, in which the API proactively sends new data to the app on its own, either in batches or via a streaming connection.
- Rate limits - Rate limits answer the question, “how often can your app request data from the API?” This is most pertinent in pull-model APIs, and is a main reason (along with volume limits) for why full coverage of data is not available from public APIs. Rate limits are also a primary cause of latency between the time an activity is created on a platform, and the time it is collected by your app.
- Volume limits - Volume limits restrict how much data can you get from the API with a single request, and are another contributing factor in the inability to get full coverage from public APIs.
- Data format - The Public APIs use a variety of data formats (e.g. XML, JSON), and each has unique fields within the metadata that are not consistent from source to source.
As described above, each public API is unique in a variety of ways, each of which defines the types of data and the level of coverage available.
Gnip’s Data Collector product provides a much more normalized way to integrate public API sources, over integrating directly with a custom-built app. Rather than writing custom clients for each public API you want to integrate into your app, your app integrates with the Data Collector APIs which makes requests to the APIs on your behalf, and passes through the data in a normalized format and structure, in the delivery protocol of your choice (“push” or “pull” model). Data collector has already integrated with the sources’ public APIs, and handles the different rate limits, volume limits, performance characteristics, and ongoing maintenance for any changes that the public API makes in the future.
Data collection is performed in accordance with the terms and requirements set forth by the API for data retrieval, providing the same level of access to the APIs that you would have with a direct integration. Data Collector does not provide increased access to the data provided by these sources, but is set up to get as much data as possible, within the restrictions put in place by the API to ensure that you have a reliable source of data to build into your app.