Jump to content

Data Platform Engineering

From Wikitech
(Redirected from Data Engineering)


The infrastructure and services maintained by the Data Platform Engineering team support data producers and consumers in collecting, discovering, and using trustworthy data to derive data insights, conduct research and build new data products.

Documentation for users and administrators of Data Platform systems, including the Data Lake, analytics tools, Event Platform, Metrics Platform, and data pipelines.

List of teams in the Data Platform Engineering group, links to their documentation, and information about team processes, current projects, and roadmaps.

To contact us, please use the intake process.

About the Data Platform Engineering docs

Content structure

Caption text
Path Intended contents Category tag
Data_Platform/ Documentation primarily meant for users of the Data Platform and its systems. category:Data_platform
Data_Platform/Systems Technical documentation focused on the administration and maintenance of the infrastructure, pipelines, components, and systems that make up the Data Platform. is under Category:Data_platform. category:Data_platform AND category:Data_platform_systems
Data_Platform_Engineering/ Team pages left over from the migration of the former Data_Engineering docs. Their content covers procedures and team processes, so it should be integrated into the relevant team or organizational pages on mediawiki.org (see phab:T367580 and phab:T364572). varies

Page categories

Categories are easier to maintain and less disruptive to change than page structure. Categories also enable us to organize and navigate pages along multiple axes simultaneously, regardless of where the page is located in the content structure. Expand the section below to see the major categories that exist for these docs.

Disclaimer: this is an incomplete list! Other categories on Wikitech could be useful for some of these docs, and the docs may already be tagged in other categories not listed here.

Category list

Categories for types of data platform systems/components:

Categories for specific data platform systems/components:

Categories for topics covered by the data platform documentation:

Categories for types of documentation (in order of their usage within the data platform docs, from most-used to least-used):

Categories for maintaining / navigationg the docs (see details in the previous section):

FAQs and guidelines for maintaining these docs

Find techncial documentation guidance and templates at mw:Documentation.
Where to put decision records?
This may vary by product or project. The Data Platform Engineering teams already have multiple places where these docs may be living. Some are on Wikitech at Metrics_Platform/Decision_Records, or in pages under /Evaluations. Some are on mediawiki.org at Data_Platform_Engineering/Data_Products/Decision_Records. For systems or products that already have a decision record somewhere, it may be best to continue that pattern and keep things consistent, but you should link to your decision record location from other locations where people might look for it. If you publish on Wikitech, add Category:Decision_log. Tip: the technical documentation toolkit has a Decision log template.
Where to put evaluations or design docs?
This may vary by product or project. For systems or products that already have evaluations or design docs on Wikitech, it may be simplest to continue that pattern. For those that don't have any extant evaluations or design docs, it's up to you. Consider your primary audience, and put the documentation in the place they're most likely to look for it. Then, add cross-references to and from the other places where people might look for the documentation. If you publish on Wikitech, add the
Where to put project updates and product roadmaps?
In Phabricator, and/or with the team's pages on mediawiki.org. For example: Data_Products/work_focus.
Where to put metrics documentation?
This may vary by product or project. Consider your primary audience for the metrics dataset, and put the documentation in the place they're most likely to look for it. Then, add cross-references to and from the other places where people might look for the documentation. So, for example, Commons Impact Metrics documentation (of the dataset, not the project) may be published under Data_Platform/Data_Lake so that it's collocated with many other dataset documentation pages. More important than where the docs live is that you add links to those docs in DataHub, from the project pages, and anywhere else someone may be looking when they're in need of that information.

Keys to sustainable doc maintenance:

  • Apply categories to pages: this helps them remain discoverable through methods other than relying only on page structure / prefixing.
  • Be consistent about where you put docs related to a given product or component. Then, add cross-references between that place and all the other places people might expect to find the information (other wikis, DataHub, Github READMEs, etc.).
  • Avoid many levels of deep page nesting (more than 3 is probably too deep).
  • Docs on Wikitech should generally be documentating how to use or administer a technology/system, not documenting things about the team or org that maintains a given technology/system. (Note: this is not a pattern that has consistently been followed in the past, so the current state of docs on Wikitech doesn't always reflect it.)