TFS System Architecture

Some great resources are available elsewhere, but a lot of thought and research went into our design, so I thought I’d pass it along. Also, not all the information on the internet is accurate or clear.

Click here for a PDF version of this design.

Overview

In the diagram, Services are listed on the left, along with their DNS name (where applicable). The background boxes exist to show VLAN, data center, or security boundaries. One primary function of the illustration is to show data flow and port information.

The most important consideration was addressing the concerns of data center catastrophe and malicious deployment of unapproved software. This means that no one should be able to short-circuit the TFS build and deployment process. And TFS should still be available shortly after a data center disaster.

Security

Code from these TFS Projects handle millions of credit card numbers and billions of credit card transactions. The organization is held to the highest PCI (and ISO 2700x PII) standards.

The best way to prevent a malicious coder from trying to do something creative is capture who made the changes, fully review and approve the changes, and ensure that there is no other means to get malicious code onto the servers.

The build output file shares are on dedicated servers. Firewall rules prevent Octopus Deploy from accessing any file shares other than these dedicated build output file shares. Permissions allow only the TFS Administrators and the build agent service accounts to write to these file shares (although any TFS user can read from these shares). Consequently, the only code that can be deployed come from the TFS build shares, and only code generated by TFS ends up on those share.

The Octopus Deploy environment has distinct Deployment Agent servers. Firewalls prevent the Agent server from accessing resources outside of their defined area. When an Octopus user is given access to an Environment, including the Agent server, he or she will not be able to initiate work outside of their concern.

Availability and Disaster Recovery

In the event of a data center disaster we wanted the TFS environment to still be available, since this might be the key to being able to quickly implement any necessary business work-arounds. The DR plan includes a panoply of DNS failover, SQL Availability groups, Distributed File Shares, SAN replication, and virtual machine migrations.

SQL Availability Groups

The two database servers in the local data center are in a synchronous SQL Availability Group. SQL 2014 has improved this functionality over SQL 2012. The Availability Group has proven to be a great solution; when the DBAs failover between nodes, TFS responds so quickly that in most cases, the users don't even notice.

The third SQL server is also a part of the Availability Group, but it is an asynchronous mirror. Previously, our standard practice had been to alias all connections with an application-specific DNS name, so that DB migration from one server to another require only the change of DNS. However, when testing failover to this remote (different subnet) database server, we discovered that the DNS information was not being updated (even CNAME values). This was not an issue for the synchronous mirrored pairs because the IP address didn't change. After some thought, we decided that an Availability Group Listener was essentially serving the same purpose as the DNS record and we didn't need both. So be warned, using DNS record to resolve to a SQL Availability Group (even using a CNAME) will cause failover problems with multiple subnets.

DNS

Globally resolving the TFS name is handled by an F5 Global Traffic Manager. The TFS application server in the remote data center is always online (so it can be patched and validated), connecting to the remote database in the primary data center. The GTM ignores this web server, but if the load-balancer in the primary data center were to report an sysetm unavailability, the GTM would automatically direct all traffic to the web server in the remote data center.

Once the GTM fails over to the remote datacenter, it is incumbent upon the DBAs to manually failover the database shortly afterward.

Distributed File Shares

This technology is used to ensure that we have off-site backup of all of our builds. We also use it to cache binaries for developers and tester who are regionally separated from their datacenter.

Virtual Machine Migration

Most of the “stateless” servers (web servers, controllers and build servers, etc.) can be rehydrated in the remote data center using the standard VM technology stack.

Licensing

Unfortunately, much of this implementation requires SQL Enterprise Edition, which is really expensive. A single SQL Standard edition license is included with your TFS license, but you will pay handsomely for the Enterprise licenses.

The SQL Availability Groups are a SQL Enterprise feature.

Also, the design show a load-balanced set of Reporting Services servers. If I were doing this again, I would not do that. Load-balance Reporting Services is also a SQL Enterprise Edition only feature. The truth is that not too many of our users even use the Reporting Services reports and it is simply not a critical function. With virtualized instances, we would be able to quickly recover from hardware issues. The benefit of load-balancing the SQL Reporting Services server is not worth the expense in our environment.

There was a lot of interest in the Kanban reports, but that uses a Tabular data model — another SQL Enterprise Edition only feature. After a lot of initial interest, this has fallen into disuse. It is probably not worth the expense of a SQL Enterprise license.

So, in the end, the only SQL instance that uses Standard Edition is the Analysis Services instance.