---------------------------------------------------------------------- This is the API documentation for the tidysdmx library. ---------------------------------------------------------------------- ## FMR & Schemas Fetch and parse SDMX schemas from a Fusion Metadata Registry. fetch_schema(base_url: str, artefact_id: str, context: Literal['dataflow', 'datastructure', 'provisionagreement']) -> pysdmx.model.dataflow.Schema Fetch the schema of a specified artefact from an SDMX registry. Args: base_url: The base URL of the FMR. artefact_id: The identifier of the artefact, typically in the format ``"agency:id(version)"``. context: The context of the artefact to fetch. Returns: The fetched schema object. fetch_dsd_schema(fmr_params: dict, env: str, dsd_id: str) -> pysdmx.model.dataflow.Schema Fetch a DSD schema from a Fusion Metadata Registry (FMR). .. deprecated:: Use :func:`fetch_schema` instead. Args: fmr_params: Base URL and endpoints to access FMR's API. env: FMR environment (e.g. ``'sandbox'``, ``'qa'``, ``'dev'``, ``'prod'``). dsd_id: The DSD identifier in the format ``"agency:id(version)"``. Returns: The schema of the requested Data Structure Definition. Raises: ValueError: If the URL is not syntactically valid. parse_artefact_id(artefact_id: str) -> tuple[str, str, str] Parse an artefact identifier into its components: agency, id and version. Args: artefact_id: The identifier of the artefact, typically in the format ``"agency:id(version)"``. Returns: A tuple containing the agency, id, and version. Raises: ValueError: If the artefact_id is not in the expected format. parse_dsd_id(dsd_id: str) -> tuple[str, str, str] Parse a DSD identifier into its components. .. deprecated:: Use :func:`parse_artefact_id` instead. Args: dsd_id: The DSD identifier in the format ``"agency:id(version)"``. Returns: A tuple containing the agency, id, and version. Raises: ValueError: If the dsd_id is not in the expected format. create_schema_from_table(dataframe: pandas.core.frame.DataFrame, dimensions: list[str], measure: str, time_dimension: str, attributes: list[str] | None = None, agency_id: str = 'WB.DP', schema_id: str = 'DP_SCHEMA', version: str = '1.0', uppercase_code_ids: bool = True) -> tidysdmx.structures.SchemaComponents Create a DSD, ConceptScheme, and Codelists from a DataFrame. Args: dataframe: The source DataFrame. dimensions: Column names to use as SDMX dimensions. measure: Column name to use as the SDMX measure. time_dimension: Column name to use as the SDMX time dimension. attributes: Optional column names to use as SDMX attributes. agency_id: Agency identifier for the generated artefacts. schema_id: Base identifier for the generated DSD and concept scheme. version: Version string for the generated artefacts. uppercase_code_ids: If True (default), codelist code IDs are uppercased. Set to False to preserve the original casing of code values. Returns: SchemaComponents: A named tuple with ``dsd``, ``concept_scheme``, and ``codelists`` fields. Raises: ValueError: If any of the specified column names are missing from the DataFrame. ## Structure Maps Build, parse, validate, and write SDMX structure maps. parse_mapping_template_wb(path: str | pathlib.Path) -> dict[str, pandas.core.frame.DataFrame] Read an Excel mapping template and return all sheets as DataFrames. Args: path: Path to the Excel file. Returns: A dictionary where keys are sheet names and values are DataFrames. Raises: FileNotFoundError: If the provided file path does not exist. ValueError: If the file is not an Excel file (.xlsx or .xls). RuntimeError: If reading the Excel file fails. build_structure_map_from_template_wb(mappings: dict[str, pandas.core.frame.DataFrame], agency: str = 'SDMX', structure_map_id: str = 'WB_STRUCTURE_MAP', structure_type: Literal['datastructure', 'dataflow', 'provisionagreement'] = 'datastructure', version: str = '1.0', required_keys: collections.abc.Iterable[str] = ('INFO', 'COMP_MAPPING', 'REP_MAPPING'), valid_rules: collections.abc.Iterable[str] = ('representation', 'implicit'), valid_prefixes: collections.abc.Iterable[str] = ('fixed:',), generate_urns: bool = True, source_structure_id: str | None = None, target_structure_id: str | None = None) -> pysdmx.model.map.StructureMap Build a complete StructureMap object by parsing a WB-format Excel template. Args: mappings: Dictionary of DataFrames containing all sheets. agency: Fallback agency ID if not found in INFO. structure_map_id: ID for the resulting StructureMap. structure_type: The type of artefact to extract from INFO. version: Fallback version if not found in INFO. required_keys: Required sheet names to validate. valid_rules: Valid literal mapping rules. valid_prefixes: Valid prefixes for parameterized mapping rules. generate_urns: If True, automatically generate URNs for StructureMap and nested RepresentationMaps. Defaults to True. source_structure_id: Optional source structure reference in ``"AGENCY:ID(VERSION)"`` format (e.g. ``"WB:DSD_ASPIRE(1.0)"``). When provided and ``generate_urns`` is True, a full SDMX URN is built and set as the StructureMap's ``source``. target_structure_id: Optional target structure reference in ``"AGENCY:ID(VERSION)"`` format (e.g. ``"WB:DSD_WDI(1.0)"``). When provided and ``generate_urns`` is True, a full SDMX URN is built and set as the StructureMap's ``target``. Returns: A valid pysdmx StructureMap object. Raises: ValueError: If mandatory sheets/columns are missing or mapping rules are invalid. Examples: >>> mappings = { ... "INFO": pd.DataFrame({"Key": ["FMR_AGENCY"], "Value": ["TEST_AGENCY"]}), ... "COMP_MAPPING": pd.DataFrame({"SOURCE": ["src"], "TARGET": ["tgt"], "MAPPING_RULES": ["fixed:VAL"]}), ... "REP_MAPPING": pd.DataFrame({"source": ["a"], "target": ["b"]}) ... } >>> smap = build_structure_map_from_template_wb(mappings) >>> isinstance(smap, StructureMap) True build_fixed_map(target: str, value: str, located_in: str | None = 'target') -> pysdmx.model.map.FixedValueMap Build a pysdmx FixedValueMap for setting a component to a fixed value. Args: target: The ID of the target component in the structure map. value: The fixed value to assign to the target component. located_in: Indicates whether the mapping is located in 'source' or 'target'. Defaults to 'target'. Returns: A pysdmx FixedValueMap object representing the fixed mapping. Raises: ValueError: If ``target`` or ``value`` is empty. ValueError: If ``located_in`` is not 'source' or 'target'. Examples: >>> mapping = build_fixed_map("CONF_STATUS", "F") >>> isinstance(mapping, FixedValueMap) True build_implicit_component_map(source: str, target: str) -> pysdmx.model.map.ImplicitComponentMap Build a pysdmx ImplicitComponentMap for implicit mapping rules. Args: source: The ID of the source component in the structure map. target: The ID of the target component in the structure map. Returns: A pysdmx ImplicitComponentMap object. Raises: ValueError: If ``source`` or ``target`` is empty. Examples: >>> mapping = build_implicit_component_map("FREQ", "FREQUENCY") >>> isinstance(mapping, ImplicitComponentMap) True build_date_pattern_map(source: str, target: str, pattern: str, frequency: str, id: str | None = None, locale: str = 'en', pattern_type: Literal['fixed', 'variable'] = 'fixed', resolve_period: Optional[Literal['startOfPeriod', 'endOfPeriod', 'midPeriod']] = None) -> pysdmx.model.map.DatePatternMap Build a DatePatternMap object for mapping date patterns between SDMX components. Args: source: The ID of the source component. target: The ID of the target component. pattern: The SDMX date pattern describing the source date (e.g., "MMM yy"). frequency: The frequency code or reference (e.g., "M" for monthly). id: Optional map ID as defined in the registry. locale: Locale for parsing the input date pattern. Defaults to "en". pattern_type: Type of date pattern. Defaults to "fixed". - "fixed": frequency is a fixed value (e.g., "A" for annual). - "variable": frequency references a dimension or attribute (e.g., "FREQ"). resolve_period: Point in time to resolve when mapping from low to high frequency periods. Returns: A fully constructed DatePatternMap instance. Raises: ValueError: If any required argument is empty or invalid. TypeError: If argument types do not match expected types. Examples: >>> dpm = build_date_pattern_map( ... source="DATE", ... target="TIME_PERIOD", ... pattern="MMM yy", ... frequency="M" ... ) >>> print(dpm) source: DATE, target: TIME_PERIOD, pattern: MMM yy, frequency: M build_value_map(source: str, target: str, valid_from: datetime.datetime | None = None, valid_to: datetime.datetime | None = None) -> pysdmx.model.map.ValueMap Create a pysdmx ValueMap object mapping a source value to a target value. Args: source: The source value to map. target: The target value to map to. valid_from: Start of business validity for the mapping. valid_to: End of business validity for the mapping. Returns: A pysdmx ValueMap object representing the mapping. Raises: ValueError: If source or target is empty. TypeError: If source or target is not a string. Examples: >>> from datetime import datetime >>> vm = build_value_map("BE", "BEL") >>> isinstance(vm, ValueMap) True >>> vm.source 'BE' >>> vm.target 'BEL' >>> vm2 = build_value_map("DE", "GER", valid_from=datetime(2020, 1, 1)) >>> vm2.valid_from.year 2020 build_value_map_list(df: pandas.core.frame.DataFrame, source_col: str = 'source', target_col: str = 'target', valid_from_col: str = 'valid_from', valid_to_col: str = 'valid_to') -> list[pysdmx.model.map.ValueMap] Build a list of ValueMap objects from a pandas DataFrame, optionally including validity periods. Args: df: DataFrame where each row represents a mapping. source_col: Column name for source values. target_col: Column name for target values. valid_from_col: Optional column name for validity start date. Defaults to "valid_from". valid_to_col: Optional column name for validity end date. Defaults to "valid_to". Returns: List of ValueMap objects created from the DataFrame. Raises: ValueError: If DataFrame is empty or required columns are missing. TypeError: If source or target columns contain non-string values. Notes: - If validity columns exist and contain non-null values, they will be used. - If validity columns are absent or contain only nulls, they are ignored. Examples: >>> import pandas as pd >>> data = { ... 'source': ['BE', 'FR'], ... 'target': ['BEL', 'FRA'], ... 'valid_from': ['2020-01-01', None], ... 'valid_to': ['2025-12-31', None] ... } >>> df = pd.DataFrame(data) >>> value_maps = build_value_map_list(df, 'source', 'target') >>> isinstance(value_maps[0], ValueMap) True build_multi_value_map_list(df: pandas.core.frame.DataFrame, source_cols: collections.abc.Sequence[str], target_cols: collections.abc.Sequence[str], valid_from_col: str = 'valid_from', valid_to_col: str = 'valid_to') -> list[pysdmx.model.map.MultiValueMap] Build a list of MultiValueMap objects from a pandas DataFrame. Iterates through the DataFrame rows to create mapping objects that map values from multiple source columns to multiple target columns. Args: df: DataFrame where each row represents a mapping. source_cols: Column names for source values. target_cols: Column names for target values. valid_from_col: Optional column name for validity start date. Defaults to "valid_from". valid_to_col: Optional column name for validity end date. Defaults to "valid_to". Returns: List of MultiValueMap objects created from the DataFrame. Raises: ValueError: If DataFrame is empty or required columns are missing. TypeError: If source or target columns contain non-string values. Examples: >>> import pandas as pd >>> data = { ... 'country': ['DE', 'CH'], ... 'currency_src': ['LC', 'LC'], ... 'currency_tgt': ['EUR', 'CHF'], ... 'region_tgt': ['EU', 'Non-EU'] ... } >>> df = pd.DataFrame(data) >>> maps = build_multi_value_map_list( ... df, ... ['country', 'currency_src'], ... ['currency_tgt', 'region_tgt'] ... ) >>> len(maps) 2 >>> maps[0].source ('DE', 'LC') >>> maps[0].target ('EUR', 'EU') build_representation_map(df: pandas.core.frame.DataFrame, agency: str = 'FAKE_AGENCY', id: str | None = None, name: str | None = None, source_cl: str | None = None, target_cl: str | None = None, version: str = '1.0', description: str | None = None, source_col: str = 'source', target_col: str = 'target', valid_from_col: str = 'valid_from', valid_to_col: str = 'valid_to', generate_urn: bool = True) -> pysdmx.model.map.RepresentationMap Build a RepresentationMap object from a pandas DataFrame using build_value_map_list. Args: df: DataFrame where each row represents a mapping. agency: Agency maintaining the representation map. id: Identifier for the representation map. name: Name of the representation map. source_cl: URN or identifier for the source codelist or data type. target_cl: URN or identifier for the target codelist or data type. version: Version of the representation map. Defaults to "1.0". description: Optional description of the representation map. source_col: Column name for source values. Defaults to "source". target_col: Column name for target values. Defaults to "target". valid_from_col: Column name for validity start date. Defaults to "valid_from". valid_to_col: Column name for validity end date. Defaults to "valid_to". generate_urn: If True, automatically generate URN. Defaults to True. Returns: A RepresentationMap object containing the mappings. Raises: ValueError: If DataFrame is empty or required columns are missing. TypeError: If source or target columns contain non-string values. Examples: >>> import pandas as pd >>> data = { ... 'source': ['BE', 'FR'], ... 'target': ['BEL', 'FRA'], ... 'valid_from': ['2020-01-01', None], ... 'valid_to': ['2025-12-31', None] ... } >>> df = pd.DataFrame(data) >>> rm = build_representation_map(df, 'urn:source:codelist', 'urn:target:codelist', 'RM1', 'Country Map', 'ECB') >>> isinstance(rm, RepresentationMap) True build_multi_representation_map(df: pandas.core.frame.DataFrame, agency: str = 'FAKE_AGENCY', id: str | None = None, name: str | None = None, source_cls: list[str] | None = None, target_cls: list[str] | None = None, version: str = '1.0', description: str | None = None, source_cols: list[str] | None = None, target_cols: list[str] | None = None, valid_from_col: str = 'valid_from', valid_to_col: str = 'valid_to') -> pysdmx.model.map.MultiRepresentationMap Build a MultiRepresentationMap object from a pandas DataFrame. Wraps the creation of individual MultiValueMap objects and bundles them into a MultiRepresentationMap container. Args: df: DataFrame where each row represents a multi-mapping. agency: Agency maintaining the map. Defaults to "FAKE_AGENCY". id: Identifier for the map. name: Name of the map. source_cls: URNs/IDs for source codelists/types. target_cls: URNs/IDs for target codelists/types. version: Version of the map. Defaults to "1.0". description: Description of the map. source_cols: Source columns. Defaults to ["source"]. target_cols: Target columns. Defaults to ["target"]. valid_from_col: Validity start column. Defaults to "valid_from". valid_to_col: Validity end column. Defaults to "valid_to". Returns: The constructed MultiRepresentationMap object. Raises: ValueError: If DataFrame is empty or columns are missing. TypeError: If non-string data is found in source/target columns. build_single_component_map(df: pandas.core.frame.DataFrame, source_component: str, target_component: str, agency: str = 'FAKE_AGENCY', id: str | None = None, name: str | None = None, source_cl: str | None = None, target_cl: str | None = None, version: str = '1.0', description: str | None = None, source_col: str = 'source', target_col: str = 'target', valid_from_col: str = 'valid_from', valid_to_col: str = 'valid_to', generate_urn: bool = True) -> pysdmx.model.map.ComponentMap Build a ComponentMap mapping one source component to one target component using a RepresentationMap built from a pandas DataFrame. Args: df: DataFrame where each row represents a mapping. source_component: ID of the source component. target_component: ID of the target component. agency: Agency maintaining the representation map. Defaults to "FAKE_AGENCY". id: Identifier for the representation map. name: Name of the representation map. source_cl: URN or identifier for the source codelist or data type. target_cl: URN or identifier for the target codelist or data type. version: Version of the representation map. Defaults to "1.0". description: Optional description of the representation map. source_col: Column name for source values. Defaults to "source". target_col: Column name for target values. Defaults to "target". valid_from_col: Column name for validity start date. Defaults to "valid_from". valid_to_col: Column name for validity end date. Defaults to "valid_to". generate_urn: If True, generate URN for the RepresentationMap. Defaults to True. Returns: A ComponentMap object mapping the source to the target component. Raises: ValueError: If DataFrame is empty or required columns are missing. TypeError: If source or target columns contain non-string values. Examples: >>> import pandas as pd >>> data = { ... 'source': ['BE', 'FR'], ... 'target': ['BEL', 'FRA'], ... 'valid_from': ['2020-01-01', None], ... 'valid_to': ['2025-12-31', None] ... } >>> df = pd.DataFrame(data) >>> cm = build_single_component_map( ... df, ... source_component="COUNTRY", ... target_component="COUNTRY", ... agency="ECB", ... id="CM1", ... name="Country Component Map", ... source_cl="urn:source:codelist", ... target_cl="urn:target:codelist" ... ) >>> isinstance(cm, ComponentMap) True collect_structure_map_artifacts(structure_map: pysdmx.model.map.StructureMap, convert_to_urns: bool = True) -> list[pysdmx.model.__base.MaintainableArtefact] Collect the StructureMap and all its dependent RepresentationMaps. When a StructureMap contains RepresentationMap objects, this function extracts them and converts the StructureMap to use URN references. Args: structure_map: The StructureMap to process. convert_to_urns: If True, converts embedded RepresentationMap objects to URN references in the output StructureMap. Defaults to True. Returns: A list containing RepresentationMaps followed by the StructureMap. Example: >>> from pysdmx.io import write_sdmx >>> from pysdmx.io.format import Format >>> >>> # Collect all artifacts >>> artifacts = collect_structure_map_artifacts(my_structure_map) >>> >>> # Write them all together >>> xml = write_sdmx( ... artifacts, ... sdmx_format=Format.STRUCTURE_SDMX_ML_3_0, ... prettyprint=True ... ) validate_structure_map_references(structure_map: pysdmx.model.map.StructureMap) -> None Validate that all RepresentationMap references are resolved. This function checks if ComponentMap and MultiComponentMap rules contain actual RepresentationMap objects rather than just URN strings. It also validates that RepresentationMaps have required fields set. Args: structure_map: The StructureMap to validate. Raises: ValueError: If any ComponentMap or MultiComponentMap contains only a URN string reference instead of the actual object, or if RepresentationMaps have missing required fields. Example: >>> try: ... validate_structure_map_references(my_structure_map) ... print("All references are resolved!") ... except ValueError as e: ... print(f"Unresolved references: {e}") prepare_structure_map_for_upload(structure_map: pysdmx.model.map.StructureMap, validate: bool = True) -> list[pysdmx.model.__base.MaintainableArtefact] Prepare a StructureMap for upload by collecting all dependencies. This is a convenience function that combines validation (optional) and artifact collection. Args: structure_map: The StructureMap to prepare. validate: If True, validates that all references are resolved. Defaults to True. Returns: A list of all artifacts ready to write/upload. Raises: ValueError: If validate=True and unresolved references are found. Example: >>> from pysdmx.api.fmr.maintenance import ( ... RegistryMaintenanceClient, StructureAction, ... ) >>> from tidysdmx.structure_map_writer import prepare_structure_map_for_upload >>> >>> # Prepare artifacts >>> artifacts = prepare_structure_map_for_upload(my_structure_map) >>> >>> # Upload to FMR >>> client = RegistryMaintenanceClient( ... api_endpoint="https://your-fmr/sdmx/v2/", ... user="username", ... password="password" ... ) >>> client.put_structures(artifacts, action=StructureAction.Replace) ## Mapping Apply structure maps to tidy DataFrames. map_structures(df: pandas.core.frame.DataFrame, structure_map: pysdmx.model.map.StructureMap, verbose: bool = False) -> pandas.core.frame.DataFrame Apply all mapping components from a StructureMap to a DataFrame. Separates the maps by type and applies them in order: FixedValueMap, ImplicitComponentMap, ComponentMap, MultiComponentMap. Args: df: The source dataset. structure_map: A StructureMap containing various mapping components. verbose: If True, print logs about applied mappings. Returns: Modified DataFrame with all mappings applied. apply_fixed_value_maps(df: pandas.core.frame.DataFrame, fixed_value_maps: list[pysdmx.model.map.FixedValueMap]) -> pandas.core.frame.DataFrame Apply FixedValueMap rules to a DataFrame. Args: df: The source dataset. fixed_value_maps: A list of FixedValueMap objects containing target and value. Returns: DataFrame with fixed value columns added. apply_implicit_component_maps(df: pandas.core.frame.DataFrame, implicit_maps: list[pysdmx.model.map.ImplicitComponentMap], verbose: bool = False) -> pandas.core.frame.DataFrame Apply ImplicitComponentMap rules to a DataFrame. Copies values from source to target columns, supporting different source/target names. Args: df: The source dataset. implicit_maps: A list of ImplicitComponentMap objects. verbose: If True, print logs about applied mappings and conflicts. Returns: DataFrame with implicit component mappings applied. apply_multi_component_map(df: pandas.core.frame.DataFrame, multi_component_map: pysdmx.model.map.MultiComponentMap, verbose: bool = False) -> pandas.core.frame.DataFrame Apply a single MultiComponentMap with regex support, preserving rule order. Rules are applied in the order they appear in the MultiRepresentationMap. The first matching rule wins. Patterns prefixed with ``"regex:"`` are matched using ``re.fullmatch``. Only the first target column is used; multi-target MultiComponentMaps are not supported. Args: df: Source data. multi_component_map: MultiComponentMap with source columns, target column, and values. verbose: If True, print progress. Returns: DataFrame with the target column added or overwritten. map_to_sdmx(df: pandas.core.frame.DataFrame, mapping: dict) -> pandas.core.frame.DataFrame Map DataFrame columns to SDMX values using a lookup mapping. This function transforms the given DataFrame columns to conform to the SDMX representation by applying either a fixed mapping or an ordered, regex-based mapping. For each key present in the DataFrame: - Fixed Mapping: If the mapping for a key contains a ``TARGET`` column but no ``SOURCE`` column, the entire column is replaced with the fixed value provided by ``TARGET``. - Regex-based Mapping: If the mapping for a key contains both ``SOURCE`` and ``TARGET`` columns, the function applies ordered regex-based matching using a first-match-wins strategy. Args: df: The input DataFrame containing the data to be mapped. mapping: The lookup mapping as a dict. Each key represents an SDMX component and its value is expected to be either a list of dictionaries (with keys ``SOURCE`` and ``TARGET``) or a DataFrame with those columns. Returns: The transformed DataFrame with mapped column values. Raises: ValueError: If the schema version is unsupported. transform_source_to_target(df: pandas.core.frame.DataFrame, mapping: dict) -> pandas.core.frame.DataFrame Transform a raw DataFrame into the format defined by a components map. Creates a new DataFrame with columns as defined in ``mapping["components"]["TARGET"]`` and populates it with data from the source DataFrame based on the column names in ``["SOURCE"]``. Args: df: The input DataFrame with raw data. mapping: The master mapping dictionary containing a mapping between the input file columns and the columns defined in the schema. Returns: The transformed DataFrame with columns as defined in the components map's TARGET. Raises: KeyError: If the mapping does not contain a ``"components"`` key or its value is empty. ## Standardisation Prepare a mapped DataFrame for SDMX upload. standardize_output(df: pandas.core.frame.DataFrame, artefact_id: str, schema: pysdmx.model.dataflow.Schema, action: Literal['I', 'U', 'D'] = 'I') -> pandas.core.frame.DataFrame Standardize the output DataFrame by adding SDMX reference columns. Enriches the given DataFrame with SDMX-related metadata columns (``STRUCTURE``, ``STRUCTURE_ID``, ``ACTION``) based on the provided artefact ID and schema, then ensures these columns appear first. Args: df: Input DataFrame containing SDMX data. artefact_id: Unique identifier of the SDMX artefact (e.g., Dataflow ID). schema: A pysdmx Schema object used to determine artefact type and filter columns. action: Action indicator for SDMX operations. Defaults to ``"I"``. Allowed values: ``"I"`` (Insert), ``"U"`` (Update), ``"D"`` (Delete). Returns: A new DataFrame with SDMX reference columns added and reordered. Raises: ValueError: If ``df`` is empty or ``artefact_id``/``schema`` is empty. TypeError: If ``df`` is not a pandas DataFrame. standardize_sdmx(df: pandas.core.frame.DataFrame, mapping: dict, cat_indicator: bool = False) -> pandas.core.frame.DataFrame Standardize a DataFrame by applying column and value transformations. Args: df: The input DataFrame with raw data. mapping: A dictionary containing the mapping DataFrame and other relevant information. cat_indicator: Whether OBS_VALUE is a categorical indicator. Default is False. Returns: The standardized DataFrame with columns transformed according to the mapping. standardize_data_for_upload(df: pandas.core.frame.DataFrame, dsd: str, structure: str = 'datastructure', action: str = 'I', cat_indicator: bool = False) -> pandas.core.frame.DataFrame Standardize a DataFrame for SDMX upload. .. deprecated:: Use :func:`standardize_output` instead. Finalizes the DataFrame for upload by fixing INDICATOR values, adding reference columns, and reordering columns. Args: df: The input DataFrame to modify. dsd: The Data Structure Definition (DSD) identifier. structure: The structure type. Default is ``'datastructure'``. Options: ``'datastructure'``, ``'metadataflow'``, ``'dataflow'``. action: The action type. Default is ``'I'`` (Insert). Options: ``'I'``, ``'U'``, ``'D'``. cat_indicator: Whether OBS_VALUE is a categorical indicator. Default is False. Returns: The modified DataFrame with corrected INDICATOR values, added reference columns, and reordered columns. standardize_indicator_id(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame Fix the INDICATOR column to be uppercase and prefixed with dataset ID. Ensures all values in the ``INDICATOR`` column are upper case, prefixed with the dataset identifier, and have dots replaced with underscores. Args: df: The DataFrame to modify. Must contain an ``INDICATOR`` column and either a ``DATABASE_ID`` or ``DATASET_ID`` column. Returns: The modified DataFrame with corrected INDICATOR values. Raises: ValueError: If the database/dataset ID column contains more than one unique value. Examples: >>> df = pd.DataFrame({ ... "DATABASE_ID": ["WB.DATA360", "WB.DATA360"], ... "INDICATOR": ["indicator.one", "indicator.two"], ... }) >>> result = standardize_indicator_id(df) >>> list(result["INDICATOR"]) ['WB_DATA360_INDICATOR_ONE', 'WB_DATA360_INDICATOR_TWO'] sanitize_variable(value: str, uppercase: bool = True) -> str Sanitize a raw string value into a valid SDMX code ID. Applies the same sanitization used internally by ``create_schema_from_table`` when building codelist code IDs from DataFrame column values. Use this function during your data cleaning phase to ensure that the values in your DataFrame will match the code IDs generated in the schema. The sanitization rules are: - Non-alphanumeric/underscore characters (including dots) are replaced with ``_``. - Leading/trailing underscores are stripped. - IDs starting with a digit are prefixed with ``_``. - Result is uppercased by default (controlled by ``uppercase``). Args: value: The raw string value to sanitize (e.g. ``"per_allsp.adq_ep_preT_tot"``). uppercase: If True (default), the result is uppercased, matching the default behaviour of ``create_schema_from_table``. Set to False if you called ``create_schema_from_table`` with ``uppercase_code_ids=False``. Returns: A sanitized SDMX-safe identifier string. Examples: >>> sanitize_variable("per_allsp.adq_ep_preT_tot") 'PER_ALLSP_ADQ_EP_PRET_TOT' >>> sanitize_variable("per_allsp.adq_ep_preT_tot", uppercase=False) 'per_allsp_adq_ep_pret_tot' add_sdmx_reference_cols(df: pandas.core.frame.DataFrame, dsd: str, structure: str = 'datastructure', action: str = 'I') -> pandas.core.frame.DataFrame Add SDMX reference columns to a DataFrame. .. deprecated:: Use :func:`standardize_output` instead. Args: df: The input DataFrame to which the columns will be added. dsd: The Data Structure Definition (DSD) identifier. structure: The structure type. Default is ``'datastructure'``. action: The action type. Default is ``'I'`` (Insert). Returns: The DataFrame with the added SDMX reference columns. ## Validation Validate datasets against schemas and codelists. validate_dataset_local(df: pandas.core.frame.DataFrame, schema: pysdmx.model.dataflow.Schema | None = None, valid: dict[str, object] | None = None, sdmx_cols: list[str] | None = None, max_errors: int = 1000) -> pandas.core.frame.DataFrame Validate that a DataFrame is SDMX compliant and return a DataFrame of errors. Either a schema or a precomputed ``valid`` object must be provided to avoid recomputing validation info for multiple datasets. Args: df: The DataFrame to be validated. schema: The schema object (optional if ``valid`` is provided). valid: Precomputed validation information returned by :func:`~tidysdmx.utils.extract_validation_info` (optional). sdmx_cols: SDMX reference columns expected in the dataset. Defaults to ``['STRUCTURE', 'STRUCTURE_ID', 'ACTION']``. max_errors: Maximum number of individual errors to report per validation check. Defaults to ``1000``. Returns: A DataFrame containing validation errors. Each row is one error, with columns ``Validation`` and ``Error``. validate_columns(df: pandas.core.frame.DataFrame, valid_columns: list[str], sdmx_cols: list[str] | None = None, max_errors: int = 1000) -> None Validate that all DataFrame columns are valid components or SDMX references. Args: df: The DataFrame to validate. valid_columns: List of valid component names. sdmx_cols: List of additional allowed column names. Defaults to ``['STRUCTURE', 'STRUCTURE_ID', 'ACTION']``. max_errors: Maximum number of unexpected columns to include in the error message. Defaults to ``1000``. Raises: ValueError: If any columns in the DataFrame are not in ``valid_columns`` or ``sdmx_cols``, listing all offending names up to ``max_errors``. validate_mandatory_columns(df: pandas.core.frame.DataFrame, mandatory_columns: list[str], sdmx_cols: list[str] | None = None) -> None Validate that all mandatory columns are present in the DataFrame. Args: df: The DataFrame to validate. mandatory_columns: List of mandatory component names. sdmx_cols: List of additional mandatory column names. Defaults to ``['STRUCTURE', 'STRUCTURE_ID', 'ACTION']``. Raises: ValueError: If any mandatory column is absent from the DataFrame. validate_codelist_ids(df: pandas.core.frame.DataFrame, codelist_ids: dict[str, list[str]], max_errors: int = 1000) -> None Validate that all values in coded columns are within the allowed codelist IDs. Reports violations across all coded columns in a single error, capped at ``max_errors`` entries. Args: df: The DataFrame to validate. codelist_ids: Mapping of column name to list of allowed code IDs. max_errors: Maximum number of invalid-value entries to include in the error message across all columns. Defaults to ``1000``. Raises: ValueError: If any value in a coded column is not in the allowed IDs, listing all offending values (up to ``max_errors``) with their column. validate_duplicates(df: pandas.core.frame.DataFrame, dim_comp: list[str], max_errors: int = 1000) -> None Validate that there are no duplicate rows for a given set of key columns. Args: df: The DataFrame to validate. dim_comp: Column names forming the uniqueness key (dimensions). max_errors: Maximum number of duplicate key combinations to include in the error message. Defaults to ``1000``. Raises: ValueError: If duplicate rows are found, reporting the count and the offending key combinations (up to ``max_errors``). validate_no_missing_values(df: pandas.core.frame.DataFrame, mandatory_columns: list[str], max_errors: int = 1000) -> None Validate that there are no missing values in mandatory columns. Args: df: The DataFrame to validate. mandatory_columns: List of mandatory column names to check. max_errors: Maximum number of rows with missing values to include in the error message. Defaults to ``1000``. Raises: ValueError: If missing values are found in any mandatory column, reporting the count and the offending rows (up to ``max_errors``). ## Tidy Raw Filter and shape raw inputs. filter_tidy_raw(df: pandas.core.frame.DataFrame, schema: pysdmx.model.dataflow.Schema) -> pandas.core.frame.DataFrame Filter an SDMX DataFrame by removing rows that violate codelist constraints. Args: df: The input DataFrame. schema: The SDMX schema to validate against. Returns: A filtered DataFrame with invalid code rows removed. filter_rows(df: pandas.core.frame.DataFrame, codelist_ids: dict[str, list[str]]) -> pandas.core.frame.DataFrame Filter out rows where values are not in the allowed codelist. Compares as strings but does not change DataFrame dtypes. Does not mutate the input DataFrame. Args: df: The input DataFrame. codelist_ids: A mapping of column names to allowed codelist IDs. Returns: A filtered copy of the DataFrame containing only selected rows. ## Utilities Helpers for codelists, components, Excel templates, and XML. extract_validation_info(schema: pysdmx.model.dataflow.Schema) -> dict[str, object] Extract validation information from a given schema. Args: schema: The schema object containing all necessary validation information. Returns: A dictionary containing validation information with the following keys: - valid_comp: List of valid component names. - mandatory_comp: List of mandatory component names. - coded_comp: List of coded component names. - codelist_ids: Dictionary with coded components as keys and list of codelist IDs as values. - dim_comp: List of dimension component names. get_codelist_ids(comp: pysdmx.model.dataflow.Components, coded_comp: list[str]) -> dict[str, list[str]] Retrieve all codelist IDs for given coded components. Args: comp: A pysdmx Components collection. coded_comp: List of coded component IDs. Returns: Dictionary with coded component IDs as keys and lists of codelist IDs as values. extract_component_ids(schema: pysdmx.model.dataflow.Schema) -> list[str] Retrieve all component IDs from a given pysdmx Schema. Args: schema: A pysdmx Schema object representing an SDMX structure. Returns: A list of component IDs contained in the schema. Raises: TypeError: If the input is not a Schema instance. ValueError: If the schema has no components. Examples: >>> from pysdmx.model import Schema, Components, Component >>> comp1 = Component(id="FREQ") >>> comp2 = Component(id="TIME_PERIOD") >>> schema = Schema(context="datastructure", agency="ECB", id_="EXR", ... components=Components([comp1, comp2]), ... version="1.0.0", urns=[]) >>> extract_component_ids(schema) ['FREQ', 'TIME_PERIOD'] create_mapping_rules(components: collections.abc.Sequence[str], rep_maps: collections.abc.Set[str] | None = None) -> list[str] Create Excel hyperlink formulas for components with representation maps. Args: components: A list or sequence of SDMX component IDs. rep_maps: A set of component IDs for which a representation map exists and a hyperlink should be generated. Returns: A list of strings, where each element is either an Excel hyperlink formula or an empty string. Raises: TypeError: If any input argument fails type validation via @typechecked. Examples: >>> components = ["FREQ", "REF_AREA", "SEX", "OBS_VALUE"] >>> rep_maps = {"REF_AREA", "SEX"} >>> create_mapping_rules(components, rep_maps) ['', '=HYPERLINK("#REF_AREA!A1","REF_AREA")', '=HYPERLINK("#SEX!A1","SEX")', ''] >>> create_mapping_rules(components, None) ['', '', '', ''] >>> create_mapping_rules([], {"ANY"}) [] build_excel_workbook(components: collections.abc.Sequence[str], rep_maps: collections.abc.Sequence[str] | None = None) -> openpyxl.workbook.workbook.Workbook Build a Workbook with component mapping and representation map sheets. The primary sheet ``comp_mapping`` contains three columns: ``source``, ``target``, and ``mapping_rules``, with hyperlinks for components that have a representation map. Args: components: An ordered list of unique target component IDs. rep_maps: A sequence of names (matching component IDs) for which dedicated representation mapping tabs should be created. Internally deduplicated via conversion to a set. Returns: An openpyxl Workbook object populated with sheets and headers. Raises: ValueError: If ``components`` validation fails (delegated to helper). TypeCheckError: If any input argument fails type validation. RuntimeError: If sheet creation fails due to invalid sheet names. write_excel_mapping_template(components: collections.abc.Sequence[str], rep_maps: collections.abc.Sequence[str] | None = None, output_path: pathlib.Path = PosixPath('mapping.xlsx')) -> pathlib.Path Generate an Excel mapping template with component and representation tabs. Args: components: An ordered list of unique target component IDs. rep_maps: A sequence of unique names for which dedicated representation mapping tabs should be created. output_path: The full path where the Excel file will be saved. Returns: The file path to the saved Excel workbook. Raises: ValueError: If `components` validation fails (delegated to helper). FileNotFoundError: If the parent directory for `output_path` does not exist. RuntimeError: If saving the workbook fails due to I/O issues. read_mapping(path: str) -> dict Read a JSON mapping file and parse its content into DataFrames. The function processes JSON data with four main keys: 1. ``schema_version``: The version of the mapping schema. 2. ``dsd_id``: The Data Structure Definition ID. 3. ``components``: A flat structure converted into a DataFrame. 4. ``representation``: Multiple sub-keys, each converted into a separate DataFrame. Empty sub-keys are skipped. All occurrences of the string ``"NA"`` are converted to ``pd.NA``. Args: path: The file path to the JSON file to be parsed. Returns: A dictionary where: - ``schema_version`` is stored under key ``'schema_version'``. - ``dsd_id`` is stored under key ``'dsd_id'``. - The components DataFrame is stored under key ``'components'``. - Each valid representation sub-key is stored as a DataFrame under its corresponding key. Raises: ValueError: If required keys are missing or have unexpected formats. fix_sdmx_xml_datatype_tags(input_path: str | pathlib.Path, output_path: str | pathlib.Path | None = None) -> pathlib.Path Fix incorrect SourceCodelist/TargetCodelist tags in SDMX-ML. The pysdmx XML writer emits ``String`` and ``String`` when a RepresentationMap uses a plain DataType. The correct SDMX 3.0 tags are ```` and ````. Args: input_path: Path to the SDMX-ML XML file to fix. output_path: Path to write the corrected XML. If ``None``, the input file is overwritten in place. Returns: The path to the written output file. Raises: FileNotFoundError: If ``input_path`` does not exist. gen_urn(artefact_type: str, agency: str, artefact_id: str, version: str = '1.0') -> str Generate a full SDMX URN for any maintainable artefact. Args: artefact_type: The type of artefact (e.g., "StructureMap", "RepresentationMap") agency: The agency ID artefact_id: The artefact ID version: The version (default "1.0") Returns: Full URN string Example: >>> gen_urn("StructureMap", "BIS", "SM_TEST", "1.0") 'urn:sdmx:org.sdmx.infomodel.structuremapping.StructureMap=BIS:SM_TEST(1.0)' ## QA Quality-assurance helpers. qa_coerce_numeric(df: pandas.core.frame.DataFrame, numeric_columns: list[str]) -> pandas.core.frame.DataFrame Coerce specified columns to numeric, removing rows with invalid values. Args: df: The input DataFrame. numeric_columns: Column names to coerce to numeric. Returns: A new DataFrame with numeric columns coerced and invalid rows removed. qa_remove_duplicates(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame Remove duplicate rows from a DataFrame. Args: df: The input DataFrame. Returns: A new DataFrame with duplicate rows removed. ## Kedro Integration Kedro pipeline node wrappers. kd_read_mappings(mapping_files: dict) -> dict Fetch multiple mappings from different files. Args: mapping_files: A dictionary where keys are dataset-specific keys and values are file paths to the mapping files. Returns: A dictionary where keys are dataset-specific keys and values are the mappings. kd_standardize_sdmx(data: dict, mappings: dict, boolean: bool = True) -> dict Standardize a partitioned dataset into SDMX format. Applies transform_source_to_target to each input dataframe with its corresponding mapping. Args: data: A dictionary where keys are dataset-specific keys and values are input DataFrames. mappings: A dictionary where keys are dataset-specific keys and values are mapping DataFrames. boolean: A flag to force order execution in Kedro. Returns: A dictionary where keys are dataset-specific keys and values are transformed DataFrames. kd_validate_dataset_local(df: pandas.core.frame.DataFrame, schema=None, valid=None) -> tuple[bool, dict] Validate a single DataFrame for SDMX compliance. Wrapper that calls validate_dataset_local to obtain a DataFrame of errors, then logs messages and returns a tuple of (success, errors). Args: df: The DataFrame to be validated. schema: The schema object containing validation information (optional if ``valid`` is provided). valid: Precomputed validation information (optional). Returns: A tuple where the first element is True if the dataset passed validation (no errors) and False otherwise, and the second element is an empty dict on success or a dict with key ``ValidationReport`` mapping to the list of error messages. kd_validate_datasets_local(datasets: dict, schema, boolean: bool) -> tuple[dict, dict] Validate multiple datasets for SDMX compliance. Ensures each dataset has ``STRUCTURE``, ``STRUCTURE_ID``, and ``ACTION`` columns. See the `SDMX-CSV field guide `__ for more details. Args: datasets: Dictionary of datasets to be validated. schema: Schema object containing validation information. boolean: A flag to force order execution in Kedro. Returns: A tuple of two dictionaries. The first maps each key to True/False, and the second maps each key to its error dictionary. ## Lookups Vectorised lookup helpers. vectorized_lookup_ordered_v1(series: pandas.core.series.Series, mapping_df: pandas.core.frame.DataFrame) -> pandas.core.series.Series Apply ordered regex matching to a Pandas Series. For each regex pattern in mapping_df, check if the value in series matches the pattern. The corresponding TARGET is assigned when a match is found, and later rules are skipped. Any cell that does not match any pattern retains its original value. Args: series: The input data series (e.g., a DataFrame column). mapping_df: A DataFrame with at least two columns: - ``SOURCE``: regex patterns (ordered by priority) - ``TARGET``: corresponding replacement values Returns: A new series with values replaced according to the first matching regex, or the original value if no match is found. vectorized_lookup_ordered_v2(series: pandas.core.series.Series, mapping_df: pandas.core.frame.DataFrame) -> pandas.core.series.Series Apply ordered matching (regex or exact) to a Pandas Series. For each row in mapping_df: - If ``IS_REGEX`` is True, perform regex matching. - If ``IS_REGEX`` is False, perform exact string matching. The corresponding TARGET is assigned when a match is found, and later rules are skipped. Any cell that does not match retains its original value. Args: series: The input data series (e.g., a DataFrame column). mapping_df: A DataFrame with at least three columns: - ``SOURCE``: regex patterns or exact strings (ordered by priority) - ``TARGET``: corresponding replacement values - ``IS_REGEX``: boolean indicating whether SOURCE is a regex Returns: A new series with values replaced according to the first matching rule, or the original value if no match is found. ---------------------------------------------------------------------- This is the User Guide documentation for the package. ---------------------------------------------------------------------- # Introduction **tidysdmx** is a Python toolbox for producing SDMX-conformant data with as little ceremony as possible. It sits on top of [pysdmx](https://py.sdmx.io) and adds higher-level helpers for the workflows that statistical agencies and research teams actually run every day: - pulling schemas from a Fusion Metadata Registry (FMR); - describing messy raw inputs as SDMX schemas; - expressing source-to-target mappings in an Excel template; - applying those mappings to tidy DataFrames; - validating results against the dissemination DSD; - emitting SDMX-ML 3.0 artefacts ready for FMR upload. ## Who is this for? - **Data engineers** wiring up reproducible pipelines that ingest CSVs, Excel files, or database extracts and publish SDMX-conformant outputs. - **Statisticians and domain experts** who own a mapping between their raw indicators and an official dissemination schema and want to express it without writing XML. - **Platform teams** integrating SDMX production into orchestrators such as Kedro or Airflow. ## How does it relate to pysdmx? tidysdmx **wraps pysdmx** — it does not reimplement it. Wherever pysdmx already provides a model class, reader, writer, or FMR client, tidysdmx calls it directly. The value tidysdmx adds is concentrated at the boundaries: turning a pandas DataFrame into a pysdmx `Schema`, turning an Excel workbook into a `StructureMap`, and turning a mapped DataFrame into the tabular shape that pysdmx writers expect. If you already know pysdmx, you can think of tidysdmx as a set of opinionated convenience functions; if you don't, you can use tidysdmx without ever touching pysdmx internals. ## What's covered in this guide? - [Installation](01-installation.qmd) — install with pip or Poetry and verify your environment. - [Quick start](02-quick-start.qmd) — the smallest end-to-end example. More chapters — end-to-end workflow, SDMX concepts, FMR integration, mapping templates, validation, and LLM/agent artefacts — are in progress and will be linked here as they land. ## Status tidysdmx is **under active development**. Public APIs are stabilising but may still change between minor versions. Pin a version in production and check the [changelog](../changelog.html) before upgrading. # Installation tidysdmx requires **Python 3.11.9 or later**. ## With pip ```bash pip install tidysdmx ``` ## With Poetry ```bash poetry add tidysdmx ``` ## From source (development) ```bash git clone https://github.com/WB-DECIS/tidysdmx.git cd tidysdmx poetry install poetry run pytest -m "not integration" ``` ## Runtime dependencies tidysdmx depends on: - [`pysdmx`](https://py.sdmx.io) — the underlying SDMX information-model library. - [`pandas`](https://pandas.pydata.org) — the tabular workhorse. - [`typeguard`](https://typeguard.readthedocs.io) — runtime type checking. - [`openpyxl`](https://openpyxl.readthedocs.io) — Excel mapping template support. ## Verify the install ```python import tidysdmx print(tidysdmx.__version__) from tidysdmx import ( create_schema_from_table, parse_mapping_template_wb, build_structure_map_from_template_wb, map_structures, standardize_output, validate_dataset_local, ) ``` If the imports succeed, you're ready to go. Continue to the [quick start](02-quick-start.qmd). ## Optional: connect to FMR Most real-world workflows fetch a dissemination schema from a Fusion Metadata Registry. The `pysdmx` FMR client is bundled with tidysdmx and can be instantiated directly: ```python import pysdmx as px client = px.api.fmr.RegistryClient("https://fmrqa.worldbank.org/FMR/sdmx/v2") ``` # Quick start This page shows the smallest possible tidysdmx pipeline: tidy DataFrame in, validated SDMX-ready output out. A full iterative-development walkthrough (with FMR fetch, Excel mapping, and XML export) is in progress. ## 1. Describe your tidy raw data as a schema ```python import pandas as pd from tidysdmx import create_schema_from_table tidy_raw = pd.DataFrame({ "SERIES": ["PER_ALLSP_ADQ_EP_TOT"] * 4, "ECONOMY": ["GHA", "GHA", "KEN", "KEN"], "TIME_PERIOD": ["2018", "2019", "2018", "2019"], "VALUE": [12.3, 13.1, 9.8, 10.4], }) raw_schema = create_schema_from_table( tidy_raw, dimensions=["SERIES", "ECONOMY"], time_dimension="TIME_PERIOD", measure="VALUE", ) ``` `raw_schema` is a pysdmx `Schema` object containing a DSD, a ConceptScheme, and the codelists inferred from your data. ## 2. Validate against that schema ```python from tidysdmx import validate_dataset_local errors = validate_dataset_local( df=tidy_raw, schema=raw_schema.dsd.to_schema(), sdmx_cols=[], ) assert errors.empty ``` ## 3. Apply a structure map (when you have one) With a `StructureMap` from pysdmx (or one built via `build_structure_map_from_template_wb`): ```python from tidysdmx import map_structures, standardize_output mapped = map_structures(df=tidy_raw, structure_map=sm, verbose=True) out = standardize_output( df=mapped, artefact_id="WB.GGH.HSP:DS_ASPIRE(1.0.0)", schema=dis_schema, action="I", ) ``` `out` is a tidy DataFrame whose columns and codes match the dissemination schema, ready to be written as SDMX-ML or uploaded to FMR. ## Where to next? - The [API reference](../reference/index.html) documents every public function. - A full end-to-end workflow walkthrough and recipe collection are in progress.