Skip to main content

Pysa Overview

Pyre has applications beyond type checking python code: it can also run static analysis, more specifically called Taint Analysis, to identify potential security issues. The Python Static Analyzer feature of Pyre is usually abbreviated to Pysa (pronounced like the Leaning Tower of Pisa).

Taint Analysis​

Tainted data is data that must be treated carefully. Pysa works by tracking flows of data from where they originate (sources) to where they terminate in a dangerous location (sinks). For example, we might use it to track flows where user-controllable request data flows into an eval call, leading to a remote code execution vulnerability. This analysis is made possible by user-created models which provide annotations on source code, as well as rules that define which sources are dangerous for which sinks. Pysa comes with many pre-written models and rules for builtin and common python libraries.

Pysa propagates taint as operations are performed on tainted data. For example, if we start with a tainted integer and perform a number of operations on it, the end results will still be tainted:

x = some_function_that_returns_a_tainted_value() # 'x' is marked as tainted
y = x + 10
s = str(x)
f = f"Value = {s}" # 'f' is marked with the same taint 'x' had

Pysa will only analyze the code in the repo that it runs on, as well as code in directories listed in the search_path of your .pyre_configuration file. It does not see the source of your dependencies. Just because you can see code in your editor does not mean Pysa has access to that code during analysis. Because of this limitation, Pysa makes some simplifying assumptions. If taint flows into a function Pysa doesn't have the source for, it will assume that the return type of that function has the same taint. This helps prevents false negatives, but can also lead to false positives.

When an object is tainted, that means that all attributes of that object are also tainted. Note that this is another source of potential false positives, such as taint flows that include some_obj.__class__. This means that Pysa will detect all of the following flows:

x = some_source() # 'x' is marked as tainted

some_sink(x) # This is detected
some_sink(x.some_attribute) # This is also detected
some_sink(x.__class__) # This is (unfortunately) also detected

Pysa Configuration​

Pysa uses two types of files for configuration: a single taint.config file, and an unlimited number of files with a .pysa extension. The taint.config file is a JSON document which stores definitions for sources, sinks, features, and rules (discussed below). The .pysa files are model files (also discussed below) which annotate your code with the sources, sinks, and features defined in your taint.config file. Examples of these files can be found in the Pyre repository.

These files live in the directory configured by taint_models_path in your .pyre_configuration file. Any .pysa file found in this folder will be parsed by Pysa and the models will be used during the analysis.

Sources​

Sources are where tainted data originates. They are declared in your taint.config file like this:

"sources": [
{
"name": "Cookies",
"comment": "used to annotate cookie sources"
}
]

Models that indicate what is a source are then defined in .pysa files. Sources are declared with the same syntax as type annotations in Python 3. Function return types, class/model attributes, and even entire classes can be declared as sources by adding TaintSource[SOURCE_NAME] wherever you would add a python type:

# Function return source
def django.http.request.HttpRequest.get_signed_cookie(
self,
key,
default=...,
salt=...,
max_age=...
) -> TaintSource[Cookies]: ...

# Class attribute source:
django.http.request.HttpRequest.COOKIES: TaintSource[Cookies]

When tainting an entire class, any return from a method or access of an attribute of the class will count as a returning tainted data. The specifics of these model files are discussed further in the Models section.

# Class source:
class BaseException(TaintSource[Exception]): ...

When tainting indexable return types such as Dicts, Lists, and Tuples, the ReturnPath syntax can be used to only mark a portion of the return type as tainted:

def applies_to_index.only_applies_to_nested() -> TaintSource[Test, ReturnPath[_[0][1]]]: ...
def applies_to_index.only_applies_to_a_key() -> TaintSource[Test, ReturnPath[_["a"]]]: ...

Note that ReturnPath syntax can also be applied to fields of classes and globals, which can be particularly helpful when annotating dictionaries.

# Source file: a.py
class C:
dictionary_field = {"text": "will_be_tainted"}

# Model file: models.pysa
a.C.dictionary_field: TaintSource[Test, ReturnPath[_["text"]]]

See Parameter and Return Path for additional information.

Sinks​

Sinks are where tainted data terminates. They are declared in your taint.config file like this:

"sinks": [
{
"name": "SQL",
"comment": "use to annotate places of SQL injection risk"
}
]

Models that indicate what is a sink are then defined in .pysa files. Sinks can be added to the same files as sources. Like sources, sinks are declared with the same syntax as type annotations in Python 3. Function parameters, class attributes, and even whole classes can be declared as sinks by adding TaintSink[SINK_NAME] where you would add a python type:

# Function parameter sink
def sqlite3.dbapi2.Cursor.execute(self, sql: TaintSink[SQL], parameters): ...

# Attribute sink
file_name.ClassName.attribute_name: TaintSink[RemoteCodeExecution]

When tainting an entire class, any flow into a method or attribute of the class will count as a flow to a taint sink. The specifics of these model files are discussed further in the Models section.

# Entire class sink
class BaseException(TaintSink[Logging]): ...

Implicit Sinks​

Implicit sinks are program expressions that we want to act as sinks, but that cannot be specified via taint signatures in .pysa files. Currently, only conditional tests are supported as implicit sinks. This allows writing rules that track whether a particular source is used in a conditional test expression.

"implicit_sinks": {
"conditional_test": [ <your kind> ]
}

Rules​

Rules declare which flows from source to sink we are concerned about. They are declared in your taint.config file like this:

"rules": [
{
"name": "SQL injection.",
"code": 1,
"sources": [ "UserControlled" ],
"sinks": [ "SQL" ],
"message_format": "Data from [{$sources}] source(s) may reach [{$sinks}] sink(s)"
}
]

Each rule needs a brief name that explains its purpose and a unique code. The rule must define a list of one or more sources, which we are concerned about flowing into one or more sinks. message_format can further explain the issue. When a flow is detected the {$sources} and {$sinks} variables will be replaced with the name of the specific source(s) and sink(s) that were involved in the detected flow.

Sanitizers​

Sanitizers break a taint flow by removing taint from data. Models that indicate sanitizing functions are defined in .pysa files. Sanitizers can be added to the same files as sources and sinks. Functions are declared as sanitizers by adding a special decorator:

# This will remove any taint passing through a function, regardless of whether
# it is a taint source returned by this function, taint reaching sinks within
# the function via 'text', or taint propagateing through 'text' to the
# return value.
@Sanitize
def django.utils.html.escape(text): ...

This annotation is useful in the case of explicit sanitizers such as escape, which helps prevent cross site scripting (XSS) by escaping HTML characters. The annotation is also useful, however, in cases where a function is not intended to sanitize inputs, but is known to always return safe data despite touching tainted data. One such example could be hmac.digest(key, msg, digest), which returns sufficiently unpredictable data that the output should no longer be considered attacker-controlled after passing through.

Sanitizers can also be scoped to only remove taint returned by a function, passing through a specific argument, or passing through all arguments.

# This will remove any taint returned by this function, but allow taint
# to be passed in to the function via 'argument'. It also prevents taint
# from propagating from any argument to the return value.
def module.sanitize_return(argument) -> Sanitize: ...

# This prevents any taint which passes through 'argument' from reaching a sink within
# the function, but allows taint which originates within the function to be returned.
def module.sanitize_parameter(argument: Sanitize): ...

# This prevents any taint which passes through any parameter from entering the function,
# but allows taint which originates within the function to be returned. It also prevents
# taint from propagating from any argument to the return value.
@Sanitize(Parameters)
def module.sanitize_all_parameters(): ...

# This will remove any taint which propagates through any argument to the return
# value, but allow taint sources to be returned from the function as well as
# allow taint to reach sinks within the function via any argument.
@Sanitize(TaintInTaintOut)
def module.sanitize_tito(a, b, c): ...

# Same as before, but only for parameter 'b'
def module.sanitize_tito_b(a, b: Sanitize[TaintInTaintOut], c): ...

Pysa also supports only sanitizing specific sources or sinks to ensure that the sanitizers used for a rule don't have adverse effects on other rules. The syntax used is identical to how taint sources and sinks are specified normally:

# Sanitizes only the `UserControlled` source kind.
def module.return_not_user_controlled() -> Sanitize[TaintSource[UserControlled]]: ...

# Sanitizes both the `SQL` and `Logging` sinks.
def module.sanitizes_sql_and_logging_sinks(
flows_to_sql: Sanitize[TaintSink[SQL]],
logged_parameter: Sanitize[TaintSink[Logging]],
): ...

For taint-in-taint-out (TITO) sanitizers, Pysa supports only sanitizing specific sources and sinks through TITO:

# With this annotation, whenever `escape(data)` is called, the UserControlled taint of `data`
# will be sanitized, whereas other taint that might be present on `data` will be preserved.
@Sanitize(TaintInTaintOut[TaintSource[UserControlled]])
def django.utils.html.escape(text): ...

@Sanitize(TaintInTaintOut[TaintSink[SQL, Logging]])
def module.sanitize_for_logging_and_sql(): ...

Note that you can use any combination of annotations, i.e sanitizing specific sources or specific sinks, on the return value, a specific parameter or all parameters:

def sanitize.sanitize_return() -> Sanitize: ...

def sanitize.sanitize_return_no_user_controlled() -> Sanitize[TaintSource[UserControlled]]: ...

def sanitize.sanitize_return_no_sql() -> Sanitize[TaintSink[SQL]]: ...
def sanitize.sanitize_parameter(x: Sanitize): ...

def sanitize.sanitize_parameter_all_tito(x: Sanitize[TaintInTaintOut]): ...

def sanitize.sanitize_parameter_no_user_controlled(x: Sanitize[TaintSource[UserControlled]]): ...

def sanitize.sanitize_parameter_no_sql(x: Sanitize[TaintSink[SQL]]): ...

def sanitize.sanitize_parameter_no_rce(x: Sanitize[TaintSink[RemoteCodeExecution]]): ...

def sanitize.sanitize_parameter_no_user_controlled_tito(x: Sanitize[TaintInTaintOut[TaintSource[UserControlled]]]): ...

def sanitize.sanitize_parameter_no_sql_tito(x: Sanitize[TaintInTaintOut[TaintSink[SQL]]]): ...
@Sanitize(Parameters)
def sanitize.sanitize_all_parameters(): ...

@Sanitize(Parameters[TaintInTaintOut])
def sanitize.sanitize_all_parameters_all_tito(): ...

@Sanitize(Parameters[TaintSource[UserControlled]])
def sanitize.sanitize_all_parameters_no_user_controlled(): ...

@Sanitize(Parameters[TaintSink[SQL]])
def sanitize.sanitize_all_parameters_no_sql(): ...

@Sanitize(Parameters[TaintSink[RemoteCodeExecution]])
def sanitize.sanitize_all_parameters_no_rce(): ...

@Sanitize(Parameters[TaintInTaintOut[TaintSource[UserControlled]]])
def sanitize.sanitize_all_parameters_no_user_controlled_tito(): ...

@Sanitize(Parameters[TaintInTaintOut[TaintSink[SQL]]])
def sanitize.sanitize_all_parameters_no_sql_tito(): ...

@Sanitize(Parameters[TaintInTaintOut[TaintSource[Cookies], TaintSink[SQL]]])
def sanitize.sanitize_all_parameters_no_cookies_sql_tito(): ...

Attributes can also be marked as sanitizers to remove all taint passing through them:

django.http.request.HttpRequest.GET: Sanitize

Sanitizing specific sources and sinks can also be used with attributes:

def module.Node.id: Sanitize[TaintSource[UserSecrets]] = ...
def module.Node.id: Sanitize[TaintSink[Logging]] = ...

Note that sanitizers come with the risk of losing legitimate taint flows. They remove all taint and aren't restricted to a specific rule or individual source to sink flows. This means you need to ensure you aren't potentially affecting other flows when you add a sanitizer for a flow you care about. For this reason, some of the above sanitizer examples might not be a good idea to use. For example, if you are trying to track flows where SQL injection occurs, the escape sanitizer removing all taint kinds would prevent you from seeing any flows where data going into your SQL query happened to be HTML escaped. The best practice with sanitizers, then, is to make them as specific as possible. It's recommended to sanitize specific sources and sinks over using the general @Sanitize, -> Sanitize or : Sanitize annotations.

TITO Sanitizers vs Source/Sink Sanitizers​

Source/Sink sanitizers are used to sanitize functions belonging to the source/sink trace. Example

def render_string_safe(string: str):
safe_strings_list = ["safe", "string", "list"]
if string in safe_strings_list:
return render(string)

def render_input_view(request: HttpRequest):
user_input = request.GET["user_input"]
return render_string_safe(user_input)

Without any sanitizer this code would raise a pysa issue since the UserControlled input is flowing into the render function (imagining that the render function is an XSS sink). To avoid this we can create a model:

def render_string_safe(string: Sanitize[TaintSink[XSS]]): ...

This will instruct pysa to remove the XSS taint on the string parameter in this way even if we have a XSS sink (render) inside the render_string_safe function we will not trigger an issue.

TITO Sanitizers instead are used to remove the taint when tainted value is flowing into (TaintIn) a function as a parameter and then it is returned (TaintOut) by the same function.

def sanitize_string(string: str):
return re.sub('[^0-9a-z]+', '*', string)

def render_input_view(request: HttpRequest):
user_input = request.GET["user_input"]
safe_str = sanitize_string(user_input)
return render(safe_str)

Like in the example before this code would generate a Pysa XSS issue. To avoid this we can create a model:

def sanitize_string(string: Sanitize[TaintInTaintOut[TaintSink[XSS]]]): ...

This will instruct pysa to remove the XSS taint from the value returned by the sanitize_string when a tainted value is passed as string parameter to the sanitize_string function.

Taint Propagation​

Sometimes, Pysa is unable to infer that tainted data provided as an argument to a function will be returned by that function. In such cases, Pysa models can be annotated with TaintInTaintOut[LocalReturn] to encode this information for the analysis. This annotation can be applied to any parameter, including self, and is useful in scenarios such as when retrieving a value from a collection containting tainted data:

# This tells Pysa that if a 'dict' contains tainted data, the result
# of calling 'get' on that dict will also contain tainted data
def dict.get(self: TaintInTaintOut[LocalReturn], key, default): ...

Note that TaintInTaintOut (ie. without square brackets) is also accepted and can be used as a short hand for TaintInTaintOut[LocalReturn]. LocalReturn is ony ever required when using the Updates syntax below and wanting to preserve the LocalReturn behaviour.

For performance reasons, Pysa does not keep track of when functions place taint into their parameters, such as when a function adds a tainted entry to a list it received (with some notable exceptions for taint assigned to self in a constructor or property). The TaintInTaintOut[Updates[PARAMETER]] annotation can be used to work around Pysa's limitations by telling Pysa that taint will flow the the annotated parameter into the parameter named PARAMETER:

# This tells Pysa that if 'dict.update' is called with tainted data,
# then the 'self' object (ie. the dictionary itself) should then be
# considered tainted.
def dict.update(self, __m: TaintInTaintOut[Updates[self]]): ...

Note that constructors and property setters are treated as if they were returning self. This means you should use LocalReturn instead of Updates[self] when writing models those. For instance:

def MyClass.__init__(self, argument: TaintInTaintOut[LocalReturn]): ...
@foo.setter
def MyClass.foo(self, value: TaintInTaintOut[LocalReturn]): ...

Feature annotations may also be placed inside the [] blocks of TaintInTaintOut[...] annotations.

Features​

Feature annotations are also placed in your taint.config and .pysa files. This is a larger topic and will be covered in detail on its own page.

Model files​

Usage​

By default, Pysa computes an inferred model for each function and combines it with any declared models in .pysa files (of which there can be more than one). The union of these models and their annotations will be used. For example, cookies are both user controlled and potentially sensitive to log, and Pysa allows us apply two different annotations to them:

django.http.request.HttpRequest.COOKIES: TaintSource[UserControlled]
django.http.request.HttpRequest.COOKIES: TaintSource[Cookies]

Requirements and Features​

Fully qualified names​

Any declarations in .pysa files must use the fully qualified name for the function/attribute they are attempting to annotate. You can usually find the fully qualified name for a type by looking at how it is imported, however, it's important to note that fully qualified names correspond to where something is declared, not necessarily where it is imported from. For example, you can import HttpRequest directly from the django.http module, even though it is defined in django.http.request. If you wanted to taint an attribute of HttpRequest, you would need to use the module in which it was defined:

django.http.request.HttpRequest.GET: TaintSource[UserControlled]

Matching signatures​

The signature of any modeled function needs to match the signature of the function, as seen by Pyre. Note that Pyre doesn't always see the definition of the functions directly. If .pyi stub files are present, Pyre will use the signatures from those files, rather than the actual signature from the function definition in your or your dependencies' code. See the Gradual Typing page for more info about these .pyi stubs.

This matching signature requirement means that all parameters being modelled must be named identically to the parameters in the corresponding code or .pyi file. Unmodelled parameters, *args, and **kwargs may be included, but are not required. When copying parameters to your model, all type information must be removed, and all default values must be elided (see below).

If a function includes an * that indicates keyword only parameters, or a / that indicates positional-only parameters, then that may be included in your model. Note that unlike when modeling named parameters, you need to include all positional only parameters the model so that Pysa knows what position is being tainted.

For example, urllib.request.urlopen has the following signature:

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None,
capath=None, cadefault=False, context=None):

Given that signature, either of the following models are acceptable:

def urllib.request.urlopen(url: TaintSink[HTTPClientRequest], data,
timeout, *, cafile, capath,
cadefault, context): ...
def urllib.request.urlopen(url: TaintSink[HTTPClientRequest]): ...

Pysa will complain if the signature of your model doesn't match the implementation. When working with functions defined outside your project, where you don't directly see the source, you can use pyre query with the signature argument to have Pysa dump it's internal model of a function, so you know exactly how to write your model.

Eliding​

As you can see from the above examples, unmodelled parameters and function bodies can both be elided with .... Additionally, type annotations must be entirely omitted (not replaced with ...), even when present on the declaration of the function. This is done to make parsing taint annotations unambiguous.