Model Domain Specific Language (DSL)
We have started developing a model Domain Specific Language (DSL) that can be
used to solve many of the same problems as model
generators, while still keeping model information in
.pysa
files. The DSL aims to provide a compact way to generate models for all
code that matches a given query. This allows users to avoid writing hundreds or
thousands of models.
Basicsβ
The most basic form of querying Pysa's DSL is by generating models based on function names. To
do so, add a ModelQuery
to your .pysa
file:
ModelQuery(
# Indicates the name of the query
name = "get_foo_sources",
# Indicates that this query is looking for functions
find = "functions",
# Indicates those functions should be called 'foo'
where = [name.matches("foo")],
# Indicates that matched function should be modeled as returning 'Test' taint
model = [
Returns(TaintSource[Test]),
],
# Indicates that the generated models should include the 'foo' and 'foo2' functions
expected_models = [
"def file.foo() -> TaintSource[Test]: ...",
"def file.foo2() -> TaintSource[Test]: ..."
],
# Indicates that the generated models should not include the 'bar' function
unexpected_models = [
"def file.bar() -> TaintSource[Test]: ..."
]
)
Things to note in this example:
- The
name
clause is the name of your query. - The
find
clause lets you pick whether you want to model functions, methods or attributes. - The
where
clause is how you refine your criteria for when a model should be generated - in this example, we're filtering for functions whose names containfoo
. - The
model
clause is a list of models to generate. Here, the syntax means that the functions matching the where clause should be modelled as returningTaintSource[Test]
. - The
expected_models
andunexpected_models
clauses are optional and allow you to specify models that should or should not be generated by your query.
When invoking Pysa, if you add the --dump-model-query-results /path/to/output/file
flag to your invocation, the generated models, sorted under the respective ModelQuery that created them, will be written to a file in JSON format.
$ pyre analyze --dump-model-query-results /path/to/output/file.txt
...
> Emitting the model query results to `/my/home/dir/.pyre/model_query_results.pysa`
You can then view this file to see the generated models.
You can also test DSL queries using pyre query
.
Name clausesβ
The name
clause describes what the query is meant to find. Normally it follows the format of get_
+ [what the query matches with in the where
clause] + [_sinks
, _source
and/or _tito
]. This clause should be unique for every ModelQuery within a file.
Find clausesβ
The find
clause specifies what entities to model, and currently supports "functions"
, "methods"
, "attributes"
, and "globals"
. "functions"
indicates that you're querying for free functions, "methods"
indicates that you're only querying class methods, "attributes"
indicates that you're querying for attributes on classes, and "globals"
indicates that you're querying for names available in the global scope.
Note that "attributes"
also includes constructor-initialized attributes, such as C.y
in the following case:
class C:
x = ...
def __init__(self):
self.y = ...
Note that "globals"
currently don't infer the type annotation of their value, so querying is more effective when they're properly annotated.
def fun(x: int, y: str) -> int:
return x + int(y)
a = fun(1, "2") # -> typing.Any
b: int = fun(1, "2") # -> int
Where clausesβ
where
clauses are a list of predicates, all of which must match for an entity to be modelled. Note that certain predicates are only compatible with specific find clause kinds.
fully_qualified_name.matches
β
The most basic query predicate is a name match - the name you're searching for is compiled as a regex, and the entity's fully qualified name is compared against it. A fully qualified name includes the module and class - for example, for a method foo
in class C
which is part of module bar
, the fully qualified name is bar.C.foo
.
Example:
ModelQuery(
name = "get_starting_with_foo",
find = ...,
where = [
fully_qualified_name.matches("foo.*")
],
model = ...
)
matches
performs a partial match! For instance, matches("bar")
will match against a function named my_module.foobarbaz
.
To perform a full match, use ^
and $
. For instance: matches("^.*\.bar$")
.
fully_qualified_name.equals
β
This clause will match when the entity's fully qualified name is exactly the same as the specified string.
Example:
ModelQuery(
name = "get_bar_C_foo",
find = ...,
where = [
fully_qualified_name.equals("bar.C.foo")
],
model = ...
)
name.matches
β
The name.matches
clause is similar to fully_qualified_name.matches
, but matches against the actual name of the entity, excluding module and class names.
Example:
ModelQuery(
name = "get_starting_with_foo",
find = ...,
where = [
name.matches("foo.*")
],
model = ...
)
matches
performs a partial match! For instance, matches("bar")
will match against a function named foobarbaz
.
To perform a full match, use ^
and $
. For instance: matches("^.*bar$")
.
name.equals
β
The name.equals
clause is similar to fully_qualified_name.equals
, but matches against the actual name of the entity, excluding module and class names.
ModelQuery(
name = "get_foo",
find = ...,
where = [
name.equals("foo")
],
model = ...
)
return_annotation
clausesβ
Model queries allow for querying based on the return annotation of a callable. Note that this where
clause does not work when the find
clause specifies "attributes"
.
return_annotation.equals
β
The clause will match when the fully-qualified name of the callable's return type matches the specified value exactly.
ModelQuery(
name = "get_return_HttpRequest_sources",
find = "functions",
where = [
return_annotation.equals("django.http.HttpRequest"),
],
model = Returns(TaintSource[UserControlled, Via[http_request]])
)
return_annotation.matches
β
This is similar to the previous clause, but will match when the fully-qualified name of the callable's return type matches the specified pattern.
ModelQuery(
name = "get_return_Request_sources",
find = "methods",
where = [
return_annotation.matches(".*Request"),
],
model = Returns(TaintSource[UserControlled, Via[http_request]])
)
return_annotation.is_annotated_type
β
This will match when a callable's return type is annotated with typing.Annotated
. This is a type used to decorate existing types with context-specific metadata, e.g.
from typing import Annotated
def bad() -> Annotated[str, "SQL"]:
...
Example:
ModelQuery(
name = "get_return_annotated_sources",
find = functions,
where = [
return_annotation.is_annotated_type(),
],
model = Returns(TaintSource[SQL])
)
This query would match on functions like the one shown above.
return_annotation.extends
β
This will match when a callable's return type is a class that is a subclass of the provided class names. Note that this will only work on class names. More complex types like Union
, Callable
are not supported. The extends
clause also takes boolean parameters is_transitive
, which when set to true means it will match when the class is a transitive subclass, otherwise it will only match when it is a direct subclass, and includes_self
, which determines whether extends(T)
should include T
itself.
Example:
ModelQuery(
name = "get_return_annotation_extends",
find = functions,
where = [
return_annotation.extends("test.A", is_transitive=True, includes_self=True),
],
model = Returns(TaintSource[Test])
)
Given the following Python code in module test
:
class A:
pass
class B(A):
pass
class C:
pass
def foo() -> A: ...
def bar() -> B: ...
def baz() -> C: ...
The above query would match bar
and baz
which are transitive subclasses of A
, but not foo
, since includes_self
was False
.
If the return type is Optional[T]
, or ReadOnly[T]
, they will be effectively treated as if they were type T
for the purpose of matching.
from typing import Optional
from pyre_extensions import ReadOnly
# These should all also match
def bar_optional() -> Optional[B]: ...
def bar_readonly() -> ReadOnly[B]: ...
def baz2() -> Optional[ReadOnly[Optional[C]]]: ...
type_annotation
clausesβ
Model queries allow for querying based on the type annotation of a global
. Note that this is similar to the return_annotation
clauses shown previously. See also: Parameters
model type_annotation
clauses.
type_annotation.equals
β
The clause will match when the fully-qualified name of the global's explicitly annotated type matches the specified value exactly.
ModelQuery(
name = "get_string_dicts",
find = "globals",
where = [
type_annotation.equals("typing.Dict[(str, str)]"),
],
model = GlobalModel(TaintSource[SelectDict])
)
For example, the above query when run on the following code:
unannotated_dict = {"hello": "world", "abc": "123"}
annotated_dict: Dict[str, str] = {"hello": "world", "abc": "123"}
will result in a model for annotated_dict: TaintSource[SelectDict]
.
type_annotation.matches
β
This is similar to the previous clause, but will match when the fully-qualified name of the global's explicit type annotation matches the specified pattern.
ModelQuery(
name = "get_anys",
find = "globals",
where = [
return_annotation.matches(".*typing.Any.*"),
],
model = GlobalModel(TaintSource[SelectAny])
)
type_annotation.is_annotated_type
β
This will match when a global's type is annotated with typing.Annotated
. This is a type used to decorate existing types with context-specific metadata, e.g.
from typing import Annotated
result: Annotated[str, "SQL"] = ...
Example:
ModelQuery(
name = "get_return_annotated_sources",
find = globals,
where = [
return_annotation.is_annotated_type(),
],
model = GlobalModel(TaintSource[SQL])
)
This query would match on functions like the one shown above.
type_annotation.extends
β
This behaves the same way as the return_annotation.extends()
clause. Please refer to the section above.
any_parameter
clausesβ
Model queries allow matching callables where any parameter matches a given clause. For now, the only clauses we support for parameters is specifying conditions on the type annotation of a callable's parameters. These can be used in conjunction with the Parameters
model clause (see type_annotation
) to taint specific parameters. Note that this where
clause does not work when the find
clause specifies "attributes"
.
any_parameter.annotation.equals
β
This clause will match all callables which have at least one parameter where the fully-qualified name of the parameter type matches the specified value exactly.
Example:
ModelQuery(
name = "get_parameter_HttpRequest_sources",
find = "functions",
where = [
any_parameter.annotation.equals("django.http.HttpRequest")
],
model =
Parameters(
TaintSource[UserControlled],
where=[
name.equals("request"),
name.matches("data$")
]
)
)
any_parameter.annotation.matches
β
This clause will match all callables which have at least one parameter where the fully-qualified name of the parameter type matches the specified pattern.
Example:
ModelQuery(
name = "get_parameter_Request_sources",
find = "methods",
where = [
any_parameter.annotation.matches(".*Request")
],
model =
Parameters(
TaintSource[UserControlled],
where=[
type_annotation.matches(".*Request"),
]
)
)
any_parameter.annotation.is_annotated_type
β
This clause will match all callables which have at least one parameter with type typing.Annotated
.
Example:
ModelQuery(
name = "get_parameter_annotated_sources",
find = "functions",
where = [
any_parameter.annotation.is_annotated_type()
],
model =
Parameters(
TaintSource[Test],
where=[
type_annotation.is_annotated_type(),
]
)
)
AnyOf
clausesβ
There are cases when we want to model entities which match any of a set of clauses. The AnyOf
clause represents exactly this case.
Example:
ModelQuery(
name = "get_AnyOf_example",
find = "methods",
where = [
AnyOf(
any_parameter.annotation.is_annotated_type(),
return_annotation.is_annotated_type(),
)
],
model = ...
)
AllOf
clausesβ
There are cases when we want to model entities which match all of a set of clauses. The AllOf
clause may be used in this case.
Example:
ModelQuery(
name = "get_AllOf_example",
find = "methods",
where = [
AnyOf(
AllOf(
cls.extends("a.b"),
cls.name.matches("Foo"),
),
AllOf(
cls.extends("c.d"),
cls.name.matches("Bar")
)
)
],
model = ...
)
Decorator
clausesβ
Decorator
clauses are used to find callables decorated with decorators that match a pattern. This clause takes decorator clauses as arguments.
Decorator fully_qualified_callee
clausesβ
The fully_qualified_callee
decorator clause is used to match on the fully qualified name of a decorator. That is, the fully qualified name of a higher order function.
The supported name clauses are the same as the ones discussed above for model query constraints, i.e.,
fully_qualified_callee.matches("pattern")
, which will match when the decorator matches the regex pattern specified as a string, andfully_qualified_callee.equals("foo.bar.d1")
, which will match when the fully-qualified name of the decorator equals the specified string exactly.
For example, if you wanted to find all functions that are decorated by @App().route()
, a decorator whose definition is in file my_module.py
:
class App:
def route(self, func: Callable) -> Callable:
...
You can write:
ModelQuery(
name = "get_my_module_app_route_decorator",
find = "functions",
where = Decorator(fully_qualified_callee.equals("my_module.App.route")),
...
)
which is arguably better because it is more precise than regex matching, or
ModelQuery(
name = "get_app_route_decorator",
find = "functions",
where = Decorator(fully_qualified_callee.matches(".*\.App\.route")),
...
)
Clarification. As another example, assume the following code is in file test.py
:
class Flask:
def route(self, func: Callable) -> Callable:
...
application = Flask()
@application.route
def my_view():
pass
Then, for decorator @application.route
, clause fully_qualified_callee
matches against the decorator's fully qualified name test.Flask.route
, as oppposed to the local identifier's fully qualified name test.application.route
(that refers to this decorator).
Decorator name
clausesβ
The name
clause is similar to fully_qualified_name
, but matches against the actual name of the entity, excluding module and class names.
Decorator arguments
clausesβ
The arguments
clauses is used to match on the arguments provided to the decorator. The supported arguments clauses are arguments.contains(...)
, which will match when the arguments specified are a subset of the decorator's arguments, and arguments.equals(...)
, which will match when the decorator has the specified arguments exactly.
arguments.contains()
supports both positional and keyword arguments. For positional arguments, the list of positonal arguments supplied to the arguments.contains()
clause must be a prefix of the list of positional arguments on the actual decorator, i.e. the value of the argument at each position should be the same. For example, with the following Python code:
@d1(a, 2)
def match1():
...
@d1(a, 2, 3, 4)
def match2():
...
@d1(2, a):
def nomatch():
...
This query will match both match1()
and match2()
, but not nomatch()
, since the values of the positional arguments don't match up.
ModelQuery(
name = "get_d1_decorator",
find = "functions",
where = Decorator(
fully_qualified_name.matches("d1"),
arguments.contains(a, 2)
),
...
)
For keyword arguments in arguments.contains()
, the specified keyword arguments must be a subset of the decorator's keyword arguments, but can be specified in any order. For example, with the following Python code:
@d1(a, 2, foo="Bar")
def match1():
...
@d1(baz="Boo", foo="Bar")
def match2():
...
This query will match both match1()
and match2()
:
ModelQuery(
name = "get_d1_decorator",
find = "functions",
where = Decorator(
fully_qualified_name.matches("d1"),
arguments.contains(foo="Bar")
),
...
)
arguments.equals()
operates similarly, but will only match if the specified arguments match the decorator's arguments exactly. This means that for positional arguments, all arguments in each position must match by value exactly. Keyword arguments can be specified in a different order, but the set of specified keyword arguments and the set of the decorator's actual keyword arguments must be the same. For example, with the following Python code:
@d1(a, 2, foo="Bar", baz="Boo")
def match1():
...
@d1(a, 2, baz="Boo", foo="Bar")
def match2():
...
@d1(2, a, baz="Boo", foo="Bar")
def nomatch1():
...
@d1(a, 2, 3, baz="Boo", foo="Bar")
def nomatch2():
...
This query will match both match1()
and match2()
, but not nomatch1()
or nomatch2()
:
ModelQuery(
name = "get_d1_decorator",
find = "functions",
where = Decorator(
fully_qualified_name.matches("d1"),
arguments.equals(a, 2, foo="bar", baz="Boo")
),
...
)
Decorator Not
, AllOf
and AnyOf
clausesβ
The Not
, AllOf
and AnyOf
clauses can be used in decorators clauses in the same way as they are in the main where
clause of the model query.
cls.fully_qualified_name.equals
clauseβ
You may use the cls
clause to specify predicates on the class. This predicate can only be used when the find clause specifies methods or attributes.
The cls.fully_qualified_name.equals
clause is used to model entities when the class's fully qualified name is an exact match for the specified string.
Example:
ModelQuery(
name = "get_childOf_foo_Bar",
find = "methods",
where = cls.name.equals("foo.Bar"),
...
)
cls.fully_qualified_name.matches
clauseβ
The cls.fully_qualified_name.matches
clause is used to model entities when the class's fully qualified name matches the provided regex.
Example:
ModelQuery(
name = "get_childOf_Foo",
find = "methods",
where = cls.fully_qualified_name.matches(".*Foo.*"),
...
)
cls.name.matches
clauseβ
The cls.name.matches
clause is similar to cls.fully_qualified_name.matches
, but matches against the actual name of the class, excluding modules.
cls.name.equals
clauseβ
The cls.name.equals
clause is similar to cls.fully_qualified_name.equals
, but matches against the actual name of the class, excluding modules.
cls.extends
clauseβ
The cls.extends
clause is used to model entities when the class is a subclass of the provided class name.
Example:
ModelQuery(
name = "get_subclassOf_C",
find = "attributes",
where = cls.extends("C"),
...
)
The default behavior is that it will only match if the class is an instance of, or a direct subclass of the specified class. For example, with classes:
class C:
x = ...
class D(C):
y = ...
class E(D):
z = ...
the above query will only model the attributes C.z
and D.y
, since C
is considered to extend itself, and D
is a direct subclass of C
. However, it will not model E.z
, since E
is a sub-subclass of C
.
If you would like to model a class and all subclasses transitively, you can use the is_transitive
flag.
Example:
ModelQuery(
name = "get_transitive_subclassOf_C",
find = "attributes",
where = cls.extends("C", is_transitive=True),
...
)
This query will model C.x
, D.y
and E.z
.
If you do not want to match on the class itself, you can use the includes_self
flag.
Example:
ModelQuery(
name = "get_transitive_subclassOf_C",
find = "attributes",
where = cls.extends("C", is_transitive=True, includes_self=False),
...
)
This query will model D.y
and E.z
.
cls.decorator
clauseβ
The cls.decorator
clause is used to specify constraints on a class decorator, so you can choose to model entities on classes only if the class it is part of has the specified decorator.
The arguments for this clause are identical to the non-class constraint Decorator
, for more information, please see the Decorator
clauses section.
Example:
ModelQuery(
name = "get_childOf_d1_decorator_sources",
find = "methods",
where = [
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
),
name.equals("__init__")
],
model = [
Parameters(TaintSource[Test], where=[
Not(name.equals("self")),
Not(name.equals("a"))
])
]
)
For example, the above query when run on the following code:
@d1(2)
class Foo:
def __init__(self, a, b):
...
@d1()
class Bar:
def __init__(self, a, b):
...
@d2(2)
class Baz:
def __init__(self, a, b):
...
will result in a model for def Foo.__init__(b: TaintSource[Test])
.
cls.any_child
clauseβ
The cls.any_child
clause is used to model entities when any child of the current class meets the specified constraints.
The arguments for this clause are any combination of valid class constraints (cls.name.equals
, cls.name.matches
, cls.fully_qualified_name.equals
, cls.fully_qualified_name.matches
, cls.extends
, cls.decorator
) and logical clauses (AnyOf
, AllOf
, Not
), along with the optional is_transitive
and includes_self
clauses.
Example:
ModelQuery(
name = "get_parent_of_d1_decorator_sources",
find = "methods",
where = [
cls.any_child(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
)
),
name.equals("__init__")
],
model = [
Parameters(TaintSource[Test], where=[
Not(name.equals("self")),
Not(name.equals("a"))
])
]
)
Similar to the cls.extends
constraint, the default behavior is that it will only match if any immediate children (or itself) of the class of the method or attribute matches against the inner clause. For example, with classes:
class Foo:
def __init__(self, a, b):
...
class Bar(Foo):
def __init__(self, a, b):
...
@d1(2)
class Baz(Bar):
def __init__(self, a, b):
...
The above query will only model the methods Bar.__init__
and Baz.__init__
, since Bar
is an immediate parent of Baz
, and Baz
is considered to extend itself. However, it will not model Foo.__init__
, since Bar
is a sub-subclass of Foo
.
If you would like to model a class and all subclasses transitively, you can use the is_transitive
flag.
Example:
ModelQuery(
name = "get_transitive_parent_of_d1_decorator_sources",
find = "attributes",
where = [
cls.any_child(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
),
is_transitive=True
),
name.equals("__init__")
],
...
)
This query will model Foo.__init__
, Bar.__init__
and Baz.__init__
.
If you would like to model all subclasses of a class excluding itself, you can use the includes_self
flag.
Example:
ModelQuery(
name = "get_transitive_parent_of_d1_decorator_sources",
find = "attributes",
where = [
cls.any_child(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
),
is_transitive=True,
includes_self=False
),
name.equals("__init__")
],
...
)
This query will model Foo.__init__
, Bar.__init__
but NOT Baz.__init__
.
We recommend to always specify both is_transitive
and includes_self
to avoid confusion.
cls.any_parent
clauseβ
The cls.any_parent
clause is used to model entities when any parent of the current class meets the specified constraints.
The arguments for this clause are any combination of valid class constraints (cls.name.equals
, cls.name.matches
, cls.fully_qualified_name.equals
, cls.fully_qualified_name.matches
, cls.extends
, cls.decorator
) and logical clauses (AnyOf
, AllOf
, Not
), along with the optional is_transitive
and includes_self
clauses.
Example:
ModelQuery(
name = "get_children_of_d1_decorator_sources",
find = "methods",
where = [
cls.any_parent(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
)
),
name.equals("__init__")
],
model = [
Parameters(TaintSource[Test], where=[
Not(name.equals("self")),
Not(name.equals("a"))
])
]
)
Similar to the cls.extends
constraint, the default behavior is that it will only match if any immediate parent (or itself) of the class of the method or attribute matches against the inner clause. For example, with classes:
@d1(2)
class Foo:
def __init__(self, a, b):
...
class Bar(Foo):
def __init__(self, a, b):
...
class Baz(Bar):
def __init__(self, a, b):
...
The above query will only model the methods Bar.__init__
and Foo.__init__
, since Foo
is an immediate parent of Bar
, and Foo
is considered to extend itself. However, it will not model Baz.__init__
, since Foo
is not an immediate parent of Baz
.
If you would like to model a class and all transitive parents, you can use the is_transitive
flag.
Example:
ModelQuery(
name = "get_transitive_children_of_d1_decorator_sources",
find = "attributes",
where = [
cls.any_parent(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
),
is_transitive=True
),
name.equals("__init__")
],
...
)
This query will model Foo.__init__
, Bar.__init__
and Baz.__init__
.
If you would like to model all parents of a class excluding itself, you can use the includes_self
flag.
Example:
ModelQuery(
name = "get_transitive_parent_of_d1_decorator_sources",
find = "attributes",
where = [
cls.any_parent(
cls.decorator(
fully_qualified_name.matches("d1"),
arguments.contains(2)
),
is_transitive=True,
includes_self=False
),
name.equals("__init__")
],
...
)
This query will model Bar.__init__
, Baz.__init__
but NOT Foo.__init__
.
We recommend to always specify both is_transitive
and includes_self
to avoid confusion.
Not
clausesβ
The Not
clause negates any existing clause that is valid for the entity being modelled.
Example:
ModelQuery(
name = "get_Not_example",
find = "methods",
where = [
Not(
name.matches("foo.*"),
cls.fully_qualified_name.matches("testing.unittest.UnitTest"),
)
],
model = ...
)
Generated models (Model clauses)β
The last bit of model queries is actually generating models for all entities that match the provided where clauses. For callables, we support generating models for parameters by name or position, as well as generating models for all paramaters. Additionally, we support generating models for the return annotation.
Returned taintβ
Returned taint takes the form of Returns(TaintSpecification)
, where TaintSpecification
is either a taint annotation or a list of taint annotations.
ModelQuery(
name = "get_Returns_sources",
find = "methods",
where = ...,
model = [
Returns(TaintSource[Test, Via[foo]])
]
)
Parameter taintβ
Parameters can be tainted using the Parameters()
clause. By default, all parameters will be tained with the supplied taint specification. If you would like to only taint specific parameters matching certain conditions, an optional where
clause can be specified to accomplish this, allowing for constraints on parameter names, the annotation type of the parameter, or parameter position. For example:
ModelQuery(
name = "get_Parameters_sources",
find = "methods",
where = ...,
model = [
Parameters(TaintSource[A]), # will taint all parameters by default
Parameters(
TaintSource[B],
where=[
Not(index.equals(0)) # will only taint parameters that are not the first parameter
]
),
]
)
name
clausesβ
To specify a constraint on parameter name, the name.equals()
or name.matches()
clauses can be used. As in the main where
clause of the model query, equals()
searches for an exact match on the specified string, while matches()
allows a regex to be supplied as a pattern to match against.
Example:
ModelQuery(
name = "get_request_data_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
name.equals("request"),
name.matches("data$")
]
)
]
)
index
clauseβ
To specify a constraint on parameter position, the index.equals()
clause can be used. It takes a single integer denoting the position of the parameter.
Example:
ModelQuery(
name = "get_index_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
index.equals(1)
]
)
]
)
has_position
clauseβ
To match on parameters that have a position, the has_position()
clause can be used. This is mostly used to exclude keyword-only parameters, *args
and **kwargs
.
Example:
ModelQuery(
name = "get_index_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
has_position()
]
)
]
)
has_name
clauseβ
To match on parameters that have a name, the has_name()
clause can be used. This is mostly used to exclude *args
and **kwargs
.
Example:
ModelQuery(
name = "get_index_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
has_name()
]
)
]
)
type_annotation
clauseβ
This clause is used to specify a constraint on parameter type annotation. Currently the clauses supported are: type_annotation.equals()
, which takes the fully-qualified name of a Python type or class and matches when there is an exact match, type_annotation.matches()
, which takes a regex pattern to match type annotations against, and type_annotation.is_annotated_type()
, which will match parameters of type typing.Annotated
.
Example:
ModelQuery(
name = "get_annotated_parameters_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
type_annotation.equals("foo.bar.C"), # exact match
type_annotation.matches("^List\["), # regex match
type_annotation.is_annotated_type(), # matches Annotated[T, x]
]
)
]
)
To match on the annotation portion of Annotated
types, consider the following example. Suppose this code was in test.py
:
from enum import Enum
from typing import Annotated, Option
class Color(Enum):
RED = 1
GREEN = 2
BLUE = 3
class Foo:
x: Annotated[Optional[int], Color.RED]
y: Annotated[Optional[int], Color.BLUE]
z: Annotated[int, "z"]
Note that the type name that should be matched against is its fully qualified name, which also includes the fully qualified name of any other types referenced (for example, typing.Optional
rather than just Optional
). When multiple arguments are provided to the type they are implicitly treated as being in a tuple.
Here are some examples of where
clauses that can be used to specify models for the annotated attributes in this case:
ModelQuery(
name = "get_annotated_attributes_sources",
find = "attributes",
where = [
AnyOf(
type_annotation.equals("typing.Annotated[(typing.Optional[int], test.Color.RED)]"),
type_annotation.equals("typing.Annotated[(int, z)]"),
type_annotation.matches(".*Annotated\[.*Optional[int].*Color\..*\]")
type_annotation.is_annotated_type()
)
],
model = [
AttributeModel(TaintSource[Test]),
]
)
This query should generate the following models:
test.Foo.x: TaintSource[Test]
test.Foo.y: TaintSource[Test]
test.Foo.z: TaintSource[Test]
Not
, AllOf
and AnyOf
clausesβ
The Not
, AllOf
and AnyOf
clauses can be used in the same way as they are in the main where
clause of the model query. Not
can be used to negate any existing clause, AllOf
to match when all of several supplied clauses match, and AnyOf
can be used to match when any one of several supplied clauses match.
Example:
ModelQuery(
name = "get_Not_AnyOf_AllOf_example_sources",
find = "methods",
where = ...,
model = [
Parameters(
TaintSource[Test],
where=[
Not(
AnyOf(
AllOf(
cls.extends("a.b"),
cls.name.matches("Foo"),
),
AllOf(
cls.extends("c.d"),
cls.name.matches("Bar")
)
)
)
]
)
]
)
Using ViaTypeOf
with the Parameters
clauseβ
Usually when specifying a ViaTypeOf
the argument that you want to capture the value or type of should be specified. However, when writing model queries and trying to find all parameters that match certain conditions, we may not know the exact name of the parameters that will be modelled. For example:
def f1(bad_1, good_1, good_2):
pass
def f2(good_3, bad_2, good_4):
pass
Suppose we wanted to model all parameters with the prefix bad_
here and attach a ViaTypeOf
to them. In this case it is still possible to attach these features to the parameter model, by using a standalone ViaTypeOf
as follows:
ModelQuery(
name = "get_f_sinks",
find = "functions",
where = name.matches("f"),
model = [
Parameters(
TaintSink[Test, ViaTypeOf],
where=[
name.matches("bad_")
]
)
]
)
This would produce models equivalent to the following:
def f1(bad_1: TaintSink[Test, ViaTypeOf[bad_1]]): ...
def f2(bad_2: TaintSink[Test, ViaTypeOf[bad_2]]): ...
Models for attributesβ
Taint for attribute models requires a AttributeModel
model clause, which can only be used when the find clause specifies attributes.
Example:
ModelQuery(
name = "get_attribute_sources_sinks",
find = "attributes",
where = ...,
model = [
AttributeModel(TaintSource[Test], TaintSink[Test])
]
)
Using ViaAttributeName
with the AttributeModel
clauseβ
ViaAttributeName
can be used within AttributeModel
to add a feature containing
the name of the attribute to any taint flowing through the given attributes.
For instance:
ModelQuery(
name = "get_attribute_of_Foo",
find = "attributes",
where = [cls.name.equals("Foo")],
model = [
AttributeModel(ViaAttributeName[WithTag["Foo"]])
]
)
On the following code:
class Foo:
first_name: str
last_name: str
def last_name_to_sink(foo: Foo):
sink(foo.last_name)
This will add the feature via-Foo-attribute:last_name
on the flow to the sink.
Models for globalsβ
Taint for global models requires a GlobalModel
model clause, which can only be used when the find clause specifies globals.
Example:
ModelQuery(
name = "get_global_sources",
find = "globals",
where = ...,
model = [
GlobalModel(TaintSource[Test])
]
)
Models for setting modesβ
This model clause is different from the others in this section in the sense that it doesn't produce taint for the models it targets, but updates their models with specific modes to change their behavior with taint analysis.
The available modes are:
Obscure
- Marks the function or method as obscure
SkipObscure
- Prevents a function or method from being marked as obscure
SkipAnalysis
- Skips inference of the function or model targeted, and forces the use of user-defined models for taint flow
SkipOverrides
- Prevents taint propagation from the targeted model into and from overridden methods on subclasses
Entrypoint
- Specifies functions or methods to be used as entrypoints for analysis, so only transitive calls from that function are analyzed
SkipDecoratorWhenInlining
- Prevents the selected decorator from being inlined during analysis
- Note: this mode will be a no-op, since model queries are generated after decorators are inlined
SkipModelBroadening
- Prevents model broadening for the given function or method
For instance, instead of annotating each function separately, as in the following .pysa
file:
@Entrypoint
def myfile.func1(): ...
@Entrypoint
def myfile.func2(): ...
@Entrypoint
def myfile.func3(): ...
@Entrypoint
def myfile.func4(): ...
One could instead use the following model query:
ModelQuery(
name = "get_myfile_entrypoint_functions",
find = "functions",
where = [
name.matches("myfile\.func.*")
],
model = [
Modes([Entrypoint])
]
)
The benefit is that any new functions that matches that name will also be considered entrypoints.
Note that it is also possible to include multiple modes in a Modes
model clause by extending the list (e.g Modes([SkipOverrides, Obscure])
.
Expected and Unexpected Models clausesβ
The optional expected_models
and unexpected_models
clauses allow you to specify models that your ModelQuery should or should not generate the equivalent of. The models in these clauses should be syntactically correct Pysa models (see this documentation for a guide on how to write a Pysa model). If your query does not generate a model in expected_models
, or if it generates a model in unexpected_models
, an error will be raised.
Example:
ModelQuery(
name = "get_foo_returns_sources",
find = "functions",
where = [name.matches("foo")],
model = [
Returns(TaintSource[Test]),
],
expected_models = [
"def file.foo() -> TaintSource[Test]: ...",
"def file.foo2() -> TaintSource[Test]: ..."
],
unexpected_models = [
"def file.bar() -> TaintSource[Test]: ..."
]
)
This would not produce any errors, since the models the ModelQuery generates will contain expected_models
and not unexpected_models
.
Cache Queriesβ
Generating models for a large number of queries can be quite slow. Cache queries allow to speed up model generation by factoring out queries with similar where
clause into a single query, which builds a mapping from an arbitrary name to a set of matching entities. Then, other queries can read from this cache, making them quick to execute.
For instance, imagine having the following queries:
ModelQuery(
...
find = "methods",
where = [
AnyOf(cls.extends("my_module.Foo"), cls.extends("other_module.Bar")),
fully_qualified_name.matches("\.ClassA\.method$"),
],
model = ...
)
ModelQuery(
...
find = "methods",
where = [
AnyOf(cls.extends("my_module.Foo"), cls.extends("other_module.Bar")),
fully_qualified_name.matches("\.ClassB\.method$"),
],
model = ...
)
ModelQuery(
...
find = "methods",
where = [
AnyOf(cls.extends("my_module.Foo"), cls.extends("other_module.Bar")),
fully_qualified_name.matches("\.ClassC\.other_method$"),
],
model = ...
)
# etc.
We can factor out the expensive where clause into a single query which writes to a key-value cache,
using the WriteToCache
clause.
ModelQuery(
...
find = "methods",
where = [AnyOf(cls.extends("my_module.Foo"), cls.extends("other_module.Bar"))],
model = WriteToCache(kind="FooBar", name=f"{class_name}:{function_name}")
)
All matching methods will be stored in a cache named FooBar
, under the key {class_name}:{function_name}
.
After executing the query, we might get the following cache FooBar
:
ClassA:method -> {some_module.ClassA.method}
ClassB:method -> {some_other_module.ClassB.method}
ClassC:other_method -> {some_module.ClassC.other_method}
We can then read from the cache using the where clause read_from_cache
:
ModelQuery(
find = "methods",
where = read_from_cache(kind="FooBar", name="ClassA:method",
model = ...
)
ModelQuery(
find = "methods",
where = read_from_cache(kind="FooBar", name="ClassB:method",
model = ...
)
ModelQuery(
find = "methods",
where = read_from_cache(kind="FooBar", name="ClassC:other_method",
model = ...
)
This will generate the same models as the first example, but model generation will be a lot faster.
In terms of time complexity, if the number of entities (methods here) is N
, the number of queries is Q
and the average cost of evaluating a where clause is C
, the first example would have a O(N*Q*C)
complexity. Using cache queries, this turns into O(N*C+Q)
, which is much better.
WriteToCache clauseβ
WriteToCache
is a model clause that is used to store entities into a cache. It takes the following arguments:
- A
kind
, which is the name of the cache. - A
name
as a format string, which will be the key for the entity in the cache.
For instance:
ModelQuery(
...
find = "methods",
model = WriteToCache(kind="cache_name", name=f"{class_name}:{function_name}")
)
Note that you can write multiple entities under the same name. For instance, this happens if you use name=f"{class_name}"
and multiple methods of the same class match against the where clause.
read_from_cache clauseβ
read_from_cache
is a where clause that will only match against entities with the given name in the cache. It takes the following arguments:
- A
kind
, which is the name of the cache. - A
name
as a string, which is the key for the entities in the cache.
For instance:
ModelQuery(
find = "methods",
where = read_from_cache(kind="cache_name", name="Class:method"),
model = ...
)
Note that you can use read_from_cache
in combination with other where clauses, as long as at least one read_from_cache
clause is active on all branches.
For instance, this is disallowed:
ModelQuery(
find = "methods",
where = AnyOf(
read_from_cache(kind="cache_name", name="Class:method"),
cls.extends("module.Foo")
),
model = ...
)
Format stringsβ
Format strings can be used to craft a string using information from the matched entity.
They can be used in the WriteToCache
name argument as well as the CrossRepositoryTaintAnchor
canonical name and port arguments.
For instance:
WriteToCache(kind="cache_name", name=f"{class_name}:{function_name}")
CrossRepositoryTaintAnchor[TaintSink[Thrift], f"{class_name}:{function_mame}", f"formal({parameter_position + 1})"]
The following variables can be used:
function_name
: The (non-qualified) name of the function;method_name
: The (non-qualified) name of the method;class_name
: The (non-qualified) name of the class;parameter_name
: The parameter name, when used within theParameters
clause;parameter_position
: The parameter position, when used within theParameters
clause. This will give -1 for keyword only parameters;capture(identifier)
: The regular expression capture group calledidentifier
. See documentation below.
Math operators such as +
, -
and *
can be used on parameter_position
and integer literals, such as f"{parameter_position * 2 + 1}"
.
Regular expression captureβ
name.matches
and cls.name.matches
clause can use named capturing groups, which can be used in the name
of WriteToCache
clauses.
For instance:
ModelQuery(
find = "functions",
where = name.matches("^get_(?P<attribute>[a-z]+)$"),
model = WriteToCache(kind="cache_name", name=f"{capture(attribute)}")
)
For a function get_foo
, this will create a cache for key foo
.
Be careful when using regular expression captures. If the capture group is not found (e.g, a typo), WriteToCache
will use the empty string.
Note that we do not support numbered capture groups, e.g Foo(.*)
.
Logging group clausesβ
The logging_group_name
clause specifies that the model query should be considered part of the given group for logging purposes.
This is useful when auto generating large amounts of model queries.
When verbose logging is enabled (-n
), Pysa will print a single line Model Query group 'XXX' generated YYY models
instead of printing one line per model query in the group.
For instance:
ModelQuery(
name = "generated_dangerous_foo",
logging_group_name = "generated_dangerous",
find = "methods",
where = read_from_cache(kind="annotated", name="foo"),
model = ...
)
ModelQuery(
name = "generated_dangerous_bar",
logging_group_name = "generated_dangerous",
find = "methods",
where = read_from_cache(kind="annotated", name="bar"),
model = ...
)