Add more ops debugging tips

2025-08-16 06:24:12 +02:00 · 2023-06-13 14:10:54 -05:00 · 2023-06-13 14:10:54 -05:00 · 737184130f
commit 737184130f
parent 78a9c50561
1 changed files with 111 additions and 1 deletions
--- a/docs/operations/README.md
+++ b/docs/operations/README.md
@ -1,5 +1,4 @@
 # Operations
 ========================
 Some basic information and setup steps are included in this README.
@ -46,3 +45,114 @@ Your sandbox space should've been setup as part of the onboarding process. If th
 We are using [WhiteNoise](http://whitenoise.evans.io/en/stable/index.html) plugin to serve our static assets on cloud.gov. This plugin is added to the `MIDDLEWARE` list in our apps `settings.py`.
 Note that it’s a good idea to run `collectstatic` locally or in the docker container before pushing files up to your sandbox. This is because `collectstatic` relies on timestamps when deciding to whether to overwrite the existing assets in `/public`. Due the way files are uploaded, the compiled css in the `/assets/css` folder on your sandbox will have a slightly earlier timestamp than the files in `/public/css`, and consequently running `collectstatic` on your sandbox will not update `public/css` as you may expect. For convenience, both the `deploy.sh` and `build.sh` scripts will take care of that. 
 # Debugging
 Debugging errors observed in applications running on Cloud.gov requires being
 able to see the log information from the environment that the application is
 running in. There are (at least) three different ways to see that information:
 Cloud.gov dashboard, CloudFoundry CLI application, and Cloud.gov Kibana logging
 queries. There is also SSH access into Cloud.gov containers and Github Actions
 that can be used for specific tasks.
 ## Cloud.gov dashboard
 At <https://dashboard.fr.cloud.gov/applications> there is a list for all of the
 applications that a Cloud.gov user has access to. Clicking on an application
 goes to a screen for that individual application, e.g.
 <https://dashboard.fr.cloud.gov/applications/2oBn9LBurIXUNpfmtZCQTCHnxUM/53b88024-1492-46aa-8fb6-1429bdb35f95/summary>.
 On that page is a left-hand link for "Log Stream" e.g.
 <https://dashboard.fr.cloud.gov/applications/2oBn9LBurIXUNpfmtZCQTCHnxUM/53b88024-1492-46aa-8fb6-1429bdb35f95/log-stream>.
 That log stream shows a stream of Cloud.gov log messages. Cloud.gov has
 different layers that log requests. One is `RTR` which is the router within
 Cloud.gov. Messages from our Django app are prefixed with `APP/PROC/WEB`. While
 it is possible to search inside the browser for particular log messages, this
 is not a sophisticated interface for querying logs.
 ## CloudFoundry CLI
 When logged in with the CloudFoundry CLI (see
 [above](#authenticating-to-cloudgov-via-the-command-line)) Cloudfoundry
 application logs can be viewed with the `cf logs <application>` where
 `<application>` is the name of the application in the currently targeted space.
 By default `cf logs` starts a streaming view of log messages from the
 application. It appears to show the same information as the dashboard web
 application, but in the terminal. There is a `--recent` option that will dump
 things that happened prior to the current time rather than starting a stream of
 the present log messages, but that is also not a full log archive and search
 system.
 CloudFoundry also offers a `run-task` command that can be used to run a single
 command in the running Cloud.gov container. For example, to run our Django
 admin command that loads test fixture data:
 ```
 cf run-task getgov-nmb --command "./manage.py load" --name fixtures
 ```
 However, this task runs asynchronously in the background without any command
 output, so it can sometimes be hard to know if the command has completed and if
 so, if it was successful.
 ## Cloud.gov Kibana
 Cloud.gov provides an instance of the log query program Kibana at
 <https://logs.fr.cloud.gov>. Kibana is powerful, but also complicated software
 that can take time to learn how to use most effectively. A few hints:
  - Set the timeframe of the display appropriately, the default is the last
    15 minutes which may not show any results in some environments.
  - Kibana queries and filters can be used to narrow in on particular
    environments. Try the query `@source.type:APP` to focus on messages from the
    Django application or `@cf.app:"getgov-nmb"` to see results from a single
    environment.
 Currently, our application emits Python's default log format which is textual
 and not record-based. In particular, tracebacks are on multiple lines and show
 up in Kibana as multiple records that are not necessarily connected. As the
 application gets closer to production, we may want to switch to a JSON log format
 where errors will be captured by Kibana as a single message, however with a
 slightly more difficult developer experience when reading logs by eyeball.
 ## SSH access
 The CloudFoundry CLI provides SSH access to the running container of an
 application. Use `cf ssh <application>` to SSH into the container. To make sure
 that your shell is seeing the same configuration as the running application, be
 sure to run `/tmp/lifecycle/shell` very first.
 Inside the container, the python code should be in `/app` and you can check
 there to see if the expected version of code is deployed in a particular file.
 There is no hot-reloading inside the container, so it isn't possible to make
 code changes there and see the results reflected in the running application.
 (Templates may be read directly from disk every page load so it is possible
 that you could change a page template and see the result in the application.)
 Inside the container, it can be useful to run various Django admin commands
 using `./manage.py`. For example, `./manage.py shell` can be used to give a
 python interpreter where code can be run to modify objects in the database, say
 to make a user an administrator.
 ## Github Actions
 In order to allow some ops activities by people without CloudFoundry on a
 laptop, we have some ops-related actions under
 <https://github.com/cisagov/getgov/actions>.
 ### Migrate data
 This Github action runs Django's `manage.py migrate` command on the specified
 environment. **This is the first thing to try when fixing 500 errors from an
 application environment**. The migrations should be idempotent, so running the
 same migrations more than once should never cause an additional problem.
 ### Reset database
 Very occasionally, there are migrations that don't succeed when run against a
 database with data already in it. This action drops the database and re-creates
 it with the latest model schema. Once launched, this should never be used on
 the `stable` environment, but during development, it may be useful on the
 various sandbox environments. After launch, some schema changes may take the
 involvement of a skilled DBA to fix problems like this.